Warehouse scale server repair, more benign than borg.
Bynar is an open source system for automating server maintenance across the datacenter. Bynar builds upon many years of experience automating the drudgery of server repair. The goal is to have the datacenter maintain itself. Large clusters these days require lots of maintenance. Cassandra, Ceph, Gluster, Hadoop and others all require quick replacement of server parts as they break down or the cluster can become degraded. As your cluster grows, you generally need to have more people to maintain them. Bynar hopes to break this cycle and free up your time so that your clusters can scale to ever greater sizes without requiring more people to maintain them.
The project is divided into different binaries that all communicate over protobuf:
disk-manager
to add the new disk back into the server.disk-manager
and bynar
Bynar requires a Postgres database to be setup. Setting up a production ready Postgres is outside the scope of this document. For testing Bynar a docker postgres container is quick to setup. The database maintains information about hardware status and ongoing operations.
/etc/bynar/bynar.json
file to configure it.
The slack* fields are optional. They will allow Bynar to send alerts to a
channel while it's performing maintenance. The daemon* fields are optional.{
"proxy": "https://my.proxy",
"manager_host": "localhost",
"manager_port": 5555,
"slack_webhook": "https://hooks.slack.com/services/ID",
"slack_channel": "#my-channel",
"slack_botname": "my-bot",
"jira_user": "test_user",
"jira_password": "user_password",
"jira_host": "https://tickets.jira.com",
"jira_issue_type": "3",
"jira_priority": "4",
"jira_project_id": "MyProject",
"jira_ticket_assignee": "assignee_username",
"vault_endpoint": "https://my_vault.com",
"vault_token": "token_98706420",
"database": {
"username": "postgres_user",
"password": "postgres_passwd",
"port": "8888",
"dbname": "database_name",
"endpoint": "some.endpoint"
},
"daemon_output": "bynar_daemon.out",
"daemon_error" : "bynar_daemon.err",
"daemon_pid" : "bynar_daemon.pid"
}
This binary handles adding and removing disks from a server. It uses protobuf serialization to allow RPC usage. Please check the api crate for more information or the bynar-client.
/etc/bynar/disk-manager.json
file. This file should be deployed/bynar/{hostname}.pem
. Any clients
wanting to connect to it will need to contact vault first. If vault is
not enabled it will save the public key to /etc/bynar/.
{
"backend": "ceph",
"vault_endpoint": "https://my_vault:8888",
"vault_token": "token_98706420"
}
Bynar that runs on Ceph, should have a ceph.json file to describe it. This tells
where to look for ceph configuration, user details etc.
/etc/bynar/ceph.json
file:
{
"config_file": "/etc/ceph/ceph.conf",
"user_id": "admin",
"pool_name": "pool_name",
"target_weight": 1.0,
"system_disks": [
{
"device": "/dev/sdc"
}
],
"journal_devices": [
{
"device": "/dev/sda"
},
{
"device": "/dev/sdb",
"partition_id": 1
}
],
"osd_config": [
{
"is_lvm": false,
"dev_path": "/dev/sdx",
"journal_path" : "/dev/sdxY",
"rdb_path': "dev/sdxZ",
}
]
"udev_rule_path": "/etc/udev/rules.d"
}
The pool_name is the name of the pool used to measure latency in the cluster, target_weight the desired weight of OSDs in the cluster.
System Disks must be specified for ceph to filter out.
This is a list of all disks that Ceph should not run on.
A disk with the root or boot partition, as wellas the device path of the root and boot (/boot, /boot/efi) partitions must be provided for Bynar to filter out.
Bynar needs to be able to distinguish the disks so it does not try to wipe a boot partition.
If not provided ceph will attempt to add/remove the disk/partition as an OSD.
Optionally, latency_cap, backfill_cap, and increment can be specified for ceph to use.
Bynar will gradually weight in an osd that is added to the cluster so as not to introduce
too much latency to the cluster or cause issues with pgs stuck in backfill.
Bynar has its own defaults to use however explicit parameters can be set.
Please note that the latency_cap is in ms
Journal devices can optionally be specified for ceph to use. Bynar will attempt
to balance the number of partitions across the devices given. If an explict
partition_id
is also given Bynar will make use of that. If no partition_id
is given Bynar will create new partitions when disks are added. The partition
size will be equal to the ceph.conf osd journal size
configuration setting
which is given in megabytes.
Osd Configs should be specified for ceph to use for each OSD device on the server.
This lets Bynar know whether to add an osd device manually or through LVM.
When configuring for a Bluestore device that will not be added as an LVM,
you can also specify the journal path and the RocksDB path (the
block.wal and block.db symlinks respectively), though they should not point to the same location.
The udev_rules_path is needed when adding an osd device manually, as the kernel needs to recognize that the device is owned by ceph:ceph
disk-manager
, bynar
service on every server you want
maintained.This community repository hosts all information about building Bynar from source, how to contribute code and documentation, who to contact about what, etc.
Ensure there is enough space on the root partition of your development system. Typical recommendation is that the root partition should be atleast 25GB. The following packages are required. Install using:
sudo apt install <package_name>
CLI command to install all the dependencies:
sudo apt install libzmq3-dev libprotobuf-dev librados2 libatasmart-dev libssl-dev libblkid-dev libudev-dev librados-dev pkg-config libclang-dev llvm libdevmapper-dev liblvm2-dev liblvm2app2.2 gcc clang smartmontools parted
Install Rust and point it to the nightly build. The stable version will not be sufficient to run the test cases it needs a feature only available on nightly build.
$ curl https://sh.rustup.rs -sSf | sh
$ rustup override set nightly
Login to your github account, and checkout the latest source code from this repository. Then, to create executable binary
Run:
$ cargo build --release
To check your code without building the binary:
$ cargo check
Hardware issues crop up all the time as part of the regular cycle of things in servers. Bynar can nearly completely automate that maintenance of hard drive failure except for the actual replacing of the drive. The typical workflow by a human would look something like this:
So how can Bynar help? Well it can handle steps 1,2,3,4 and 6. Nearly everything! While it is replacing your drives it can also inform you over slack or other channels to keep you in the loop. The time saved here multplies with each piece of hardware replaced and now you can focus your time and energy on other things. It's a positive snowball effect!
Note that root permissions are required for integration testing. The reason
is that the test functions will attempt to create loopback devices, mount them,
check their filesystems etc and all that requires root. The nightly compiler
is also required for testing because mocktopus makes use of features that
haven't landed in stable yet. Run: sudo ~/.cargo/bin/cargo test -- --nocapture
to test.
If you need support, start by checking the issues page. If that doesn't answer your questions, or if you think you found a bug, please file an issue.
That said, if you have questions, reach out to us communication.
Want to contribute to Bynar? Awesome! Check out the contributing guide.