Open scholarsmate opened 4 years ago
I haven't played much with swarm, but as far as I understand each stack can be a separate compose file / projects. Currently we filter containers in the same project:
I'm guessing that should be trivial to fix. Could make it configurable in some way.
Also, I am planning to revamp this project a bit making the code a lot cleaner and extensible. It works pretty well, but is a result of experimenting with the design over time 😄
Should probably try to add a test case for swarm setups.
Traefik (another awesome project) uses a command line switch, namely --providers.docker.swarmmode=true
to inform the service that it's running in swarm mode so it knows how to read and interpret the service configurations from the docker socket.
Looks like it's not that hard to separate traditional compose setups with stacks.
Labels set for containers in compose projects:
"com.docker.compose.project": "restic-compose-backup",
"com.docker.compose.service": "mariadb",
Labels set for containers in stacks:
"com.docker.stack.namespace": "test",
"com.docker.swarm.service.name": "test_mariadb",
As long as we have some prefix and a service name we are good.
Released version 0.5 with experimental swarm/stack support. If SWAM_MODE
is defined we evaluate containers in stacks. I'm not entirely sure how this works when scale is > 1, but at least this is a start.
Docs : https://restic-compose-backup.readthedocs.io/en/latest/guide/configuration.html#swarm-mode Leaving this issue open for now. I'm sure there will be more to discuss.
Great, I hope to give this a try today!
Testing it on 3 nodes, and it looks like there is still some work to do. It appears that it does not detect services running on other nodes and for services running on the same node, I get mixed results. The docker stack includes the jira, confluence and their databases as well as the Restic-compose-backup as shown in my original post (the only modification I made was to have SWARM_MODE=true
in the /etc/restic/restic.env
file). Here's what I have:
[vagrant@docker-server ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
71089e59f288 gitlab/gitlab-ee:latest "/assets/wrapper" 2 hours ago Up 2 hours (healthy) 22/tcp, 80/tcp, 443/tcp devops_gitlab.2.jyqsl9o247y7n8qm35njtc8ko
9887505825d0 atlassian/jira-software:latest "/tini -- /entrypoin…" 2 hours ago Up 2 hours 8080/tcp devops_jira.1.a6lqxyksm7tk8cbd3pxkei1s7
5b0cd831966e mongo:4.0 "docker-entrypoint.s…" 2 hours ago Up 2 hours 27017/tcp devops_mongo.1.v0see33mn6v71h24rtfu3edg2
aacdd99c4623 rocketchat/rocket.chat:latest "bash -c 'for i in `…" 2 hours ago Up 2 hours 3000/tcp devops_rocketchat.1.avx6d8papf9q7b1j4lezc2q6i
edbb95bdf3f6 grafana/grafana:latest "/run.sh" 3 hours ago Up 3 hours 3000/tcp devops_grafana.1.yazm5uk5gbi3blu8e9tnh4wue
5ee27ee7c637 sonatype/nexus3:latest "sh -c ${SONATYPE_DI…" 3 hours ago Up 3 hours 8081/tcp devops_nexus.1.ujnaqbn3zxytma4fvraq6erkw
3c8ccceb2723 postgres:latest "docker-entrypoint.s…" 3 hours ago Up 3 hours 5432/tcp devops_confluence_db.1.5yfhgmocnocs8k7f8cof0u2nk
f58fd3da4f76 zettaio/restic-compose-backup:latest "./entrypoint.sh" 3 hours ago Up 3 hours devops_backup.1.jprju4z108j7d29510c8i48ta
7ff12d3826ef traefik:v2.1 "/entrypoint.sh --lo…" 3 hours ago Up 3 hours 0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp devops_traefik.yejmt4cjzol1x3020zkuwt6co.ofdsw5v7q94li6slzjivo51bj
0995f730221e portainer/agent:latest "./agent" 3 hours ago Up 3 hours 0.0.0.0:9001->9001/tcp devops_portainer-agent-internal.yejmt4cjzol1x3020zkuwt6co.o00ragu1nyrpnrj62u428iynv
[vagrant@docker-server ~]$
Now to see what the backup service detected:
[vagrant@docker-server ~]$ docker exec -it f58fd3da4f76 sh
/restic-compose-backup # rcp status
sh: rcp: not found
/restic-compose-backup # rcb status
2020-03-07 18:21:49,708 - INFO: Status for compose project 'None'
2020-03-07 18:21:49,708 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-07 18:21:49,708 - INFO: Backup currently running?: False
2020-03-07 18:21:49,709 - INFO: Checking docker availability
2020-03-07 18:21:49,858 - ERROR: ---------- stderr ----------
2020-03-07 18:21:49,858 - ERROR: Fatal: unable to open config file: <config/> does not exist
2020-03-07 18:21:49,859 - ERROR: Is there a repository at the following location?
2020-03-07 18:21:49,859 - ERROR: rest:http://10.4.16.6:8000
2020-03-07 18:21:49,859 - ERROR: ----------------------------
2020-03-07 18:21:49,859 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-07 18:21:52,269 - INFO: Successfully initialized repository: rest:http://10.4.16.6:8000
2020-03-07 18:21:52,270 - INFO: ------------------------- Detected Config -------------------------
2020-03-07 18:21:52,271 - INFO: service: devops_confluence_db
2020-03-07 18:21:52,313 - INFO: - postgres (is_ready=True)
2020-03-07 18:21:52,314 - INFO: -------------------------------------------------------------------
/restic-compose-backup # rcb status
2020-03-07 18:22:17,130 - INFO: Status for compose project 'None'
2020-03-07 18:22:17,131 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-07 18:22:17,131 - INFO: Backup currently running?: False
2020-03-07 18:22:17,131 - INFO: Checking docker availability
2020-03-07 18:22:18,019 - INFO: ------------------------- Detected Config -------------------------
2020-03-07 18:22:18,020 - INFO: service: devops_confluence_db
2020-03-07 18:22:18,038 - INFO: - postgres (is_ready=True)
2020-03-07 18:22:18,039 - INFO: -------------------------------------------------------------------
/restic-compose-backup #
It's really cool that the service initializes the Restic backup location for me the first time and that it discovered the Postgres instance that was running on the same node. What's missing is that jira is also running on that same node and it has been labeled with restic-compose-backup.volumes=true
. If we can get the backup service to at least handle those services running on the same node as it is, that's progress enough that I think I can use it in my project. If we can get it working for all nodes across the entire swarm, that would be complete and total victory!
hmm. I guess at least for databases the service endpoint could potentially be used detect containers across the different nodes. We only need to gather credentials and host/port.
For volumes/binds it gets a lot more complicated depending on if the setup has some form of volume sharing across the nodes. The backup container should have to run on all nodes to cover all volumes. I've tried streaming data using get_archive
in the past, but this is way too unreliable.
In your setup it should ideally find:
It seems confluence_db
and jira
is located on the node you listed. It's weird that it doesn't detect the jira volumes. I would assume we could find jira_db
as well. For the volumes in confluence.. that's where it gets complicated. The backup service right now is only able to handle volumes on the node it currently runs on.
I'm wondering if the simplest solution is to make the backup service run a container on each node in the swarm cluster when doing operations. It would still keep the management simple and it's easier to coordinate and configure how to approach volume backups.
Precisely. If it can work for a single node in the Custer (DB backups and volumes), then we just run it on every node in the swarm and that ought to work.
After looking into labels:
In your jira
service you have to set the labels on the container instead of the service. (labels in deploy
is only exposed in the service and not in the container itself). We should probably support service labels as well, but that is not on the "quick fix" list.
Loading service data should be part of the rewrite/cleanup. Right now we are just passing around a list of containers instead of a higher level enviroment object. It's still kind of the prototype code 😄
Can you verify that moving the label actually makes rcb detect the container and volumes?
If this works fine I think 2 of 3 issues are resolved to make basic swarm support working. The last one is to detect the swarm node config and make rcb operations run per node in a loop with docker run
and node constraints.
I made the following changes:
$ git diff
diff --git a/atlassian/docker-compose.yml b/atlassian/docker-compose.yml
index 7a10f86..4a1e61a 100644
--- a/atlassian/docker-compose.yml
+++ b/atlassian/docker-compose.yml
@@ -51,12 +51,14 @@ services:
- default
depends_on:
- jira_db
+ labels:
+ - restic-compose-backup.volumes=true
# Control the number of instances deployed
deploy:
mode: replicated
replicas: 1
labels:
- - restic-compose-backup.volumes=true
+ #- restic-compose-backup.volumes=true
- traefik.enable=true
- traefik.http.routers.jira.entrypoints=web
- traefik.http.routers.jira.rule=Host(`jira.${SVC_DOMAIN?}`)
@@ -81,12 +83,14 @@ services:
- default
depends_on:
- confluence_db
+ labels:
+ - restic-compose-backup.volumes=true
# Control the number of instances deployed
deploy:
mode: replicated
replicas: 1
labels:
- - restic-compose-backup.volumes=true
+ #- restic-compose-backup.volumes=true
- traefik.enable=true
- traefik.http.routers.confluence.entrypoints=web
- traefik.http.routers.confluence.rule=Host(`confluence.${SVC_DOMAIN?}`)
Then re-rolled my install, this time confluence and the confluence database ended up in the same VM as the backup service:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
84e68b25e462 gitlab/gitlab-ee:latest "/assets/wrapper" 17 minutes ago Up 17 minutes (healthy) 22/tcp, 80/tcp, 443/tcp devops_gitlab.1.zs5jg5zu3ldwno6vmo5csv9ua
9b4ec876fc1a atlassian/confluence-server:latest "/sbin/tini -- /entr…" 29 minutes ago Up 29 minutes 8090-8091/tcp devops_confluence.1.scx0xyjvogziwfwr4x91shavj
9bf02dc68d53 rocketchat/rocket.chat:latest "bash -c 'for i in `…" 32 minutes ago Up 32 minutes 3000/tcp devops_rocketchat.2.jygljec3mnke5zxwmddl61902
41464af9452d mongo:4.0 "docker-entrypoint.s…" 35 minutes ago Up 35 minutes 27017/tcp devops_mongo.1.u1uad4yap162hpkoxrbvc0f4m
b66d093d72ce postgres:latest "docker-entrypoint.s…" 41 minutes ago Up 41 minutes 5432/tcp devops_confluence_db.1.9sndy6nj9hkto2j7lpqv0fkyd
813796e0f3d1 grafana/grafana:latest "/run.sh" 46 minutes ago Up 46 minutes 3000/tcp devops_grafana.1.1nso09jvako3ihlx47mmlv5pl
97983038e5f1 zettaio/restic-compose-backup:latest "./entrypoint.sh" About an hour ago Up About an hour devops_backup.1.jd9kf01r3wb26wdf60lrr6qwj
2118f090255d traefik:v2.1 "/entrypoint.sh --lo…" About an hour ago Up About an hour 0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp devops_traefik.ph0orqj7nii9pspy440q5n6ws.t3w9b8mkfkvfpjtok5089mlye
014c45bf586f portainer/agent:latest "./agent" About an hour ago Up About an hour 0.0.0.0:9001->9001/tcp devops_portainer-agent-internal.ph0orqj7nii9pspy440q5n6ws.ohv6r2105xxevehjc28d9gw26
Getting the status we see:
$ docker exec -it 97983038e5f1 sh
/restic-compose-backup # rcb status
2020-03-08 21:16:09,284 - INFO: Status for compose project 'None'
2020-03-08 21:16:09,284 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-08 21:16:09,285 - INFO: Backup currently running?: False
2020-03-08 21:16:09,285 - INFO: Checking docker availability
2020-03-08 21:16:10,112 - INFO: ------------------------- Detected Config -------------------------
2020-03-08 21:16:10,113 - INFO: service: devops_confluence
2020-03-08 21:16:10,114 - INFO: - volume: /var/lib/docker/volumes/devops_confluence_data/_data
2020-03-08 21:16:10,114 - INFO: - volume: /etc/localtime
2020-03-08 21:16:10,114 - INFO: service: devops_confluence_db
2020-03-08 21:16:10,132 - INFO: - postgres (is_ready=True)
2020-03-08 21:16:10,133 - INFO: -------------------------------------------------------------------
/restic-compose-backup #
Putting the labels at the container-level made a difference.
Awesome. That means we can move to step 3.
TODO:
SWARM_MODE
status
and backup
)In a 3-node cluster and setting the replication of the backup service to 3 does not guarantee that the service gets evenly spread out on the nodes. In my last test I found 1 backup service running on 1 node, 2 on another and 0 on the last. I think I can solve that problem though.
On the node that had 1 backup service running, here's the docker ps:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b8bf2b793f34 rocketchat/rocket.chat:latest "bash -c 'for i in `…" 13 hours ago Up 13 hours 3000/tcp devops_rocketchat.1.zw1i7wx1c6kzlgqhn6vka9yho
842c6977ee41 zettaio/restic-compose-backup:latest "./entrypoint.sh" 13 hours ago Up 13 hours devops_backup.1.acglv44dttbyvmasrojixudff
9b6baca4716c traefik:v2.1 "/entrypoint.sh --lo…" 13 hours ago Up 13 hours 0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp devops_traefik.dazqixxfwh1r5tcjcbvo2iuii.aqncrb55xwidne5jlkjewqyym
6a93f80e1a3c portainer/agent:latest "./agent" 13 hours ago Up 13 hours 0.0.0.0:9001->9001/tcp devops_portainer-agent-internal.dazqixxfwh1r5tcjcbvo2iuii.5ahpj9apv1tqg3qzsi1vdnl69
Rocket Chat was labelled as follows:
volumes:
- rocketchat_uploads:/app/uploads
labels:
- restic-compose-backup.volumes=true
Here is rcb status:
$ docker exec -it 842c6977ee41 sh
/restic-compose-backup # rcb status
2020-03-10 13:04:07,445 - INFO: Status for compose project 'None'
2020-03-10 13:04:07,445 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-10 13:04:07,445 - INFO: Backup currently running?: False
2020-03-10 13:04:07,445 - INFO: Checking docker availability
2020-03-10 13:04:08,003 - ERROR: ---------- stderr ----------
2020-03-10 13:04:08,004 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:04:08,004 - ERROR: ----------------------------
2020-03-10 13:04:08,004 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-10 13:04:08,036 - ERROR: ---------- stderr ----------
2020-03-10 13:04:08,037 - ERROR: Fatal: create repository at rest:http://10.4.16.6:8000 failed: Fatal: config file already exists
2020-03-10 13:04:08,037 - ERROR:
2020-03-10 13:04:08,037 - ERROR: ----------------------------
2020-03-10 13:04:08,038 - ERROR: Failed to initialize repository
2020-03-10 13:04:08,038 - INFO: ------------------------- Detected Config -------------------------
2020-03-10 13:04:08,038 - INFO: service: devops_rocketchat
2020-03-10 13:04:08,038 - INFO: - volume: /var/lib/docker/volumes/devops_rocketchat_uploads/_data
2020-03-10 13:04:08,038 - INFO: -------------------------------------------------------------------
And here is a backup attempt:
/restic-compose-backup # rcb backup
2020-03-10 13:05:36,543 - INFO: Starting backup container
2020-03-10 13:05:36,806 - INFO: Backup process container: strange_hoover
2020-03-10 13:05:37,614 - INFO: 2020-03-10 13:05:37,608 - INFO: Status for compose project 'None'
2020-03-10 13:05:37,616 - INFO: 2020-03-10 13:05:37,608 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-10 13:05:37,617 - INFO: 2020-03-10 13:05:37,609 - INFO: Backup currently running?: False
2020-03-10 13:05:37,618 - INFO: 2020-03-10 13:05:37,609 - INFO: Checking docker availability
2020-03-10 13:05:38,199 - INFO: 2020-03-10 13:05:38,193 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,200 - INFO: 2020-03-10 13:05:38,194 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:05:38,201 - INFO: 2020-03-10 13:05:38,194 - ERROR: ----------------------------
2020-03-10 13:05:38,203 - INFO: 2020-03-10 13:05:38,194 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-10 13:05:38,229 - INFO: 2020-03-10 13:05:38,223 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,231 - INFO: 2020-03-10 13:05:38,223 - ERROR: Fatal: create repository at rest:http://10.4.16.6:8000 failed: Fatal: config file already exists
2020-03-10 13:05:38,231 - INFO: 2020-03-10 13:05:38,223 - ERROR:
2020-03-10 13:05:38,232 - INFO: 2020-03-10 13:05:38,223 - ERROR: ----------------------------
2020-03-10 13:05:38,233 - INFO: 2020-03-10 13:05:38,224 - ERROR: Failed to initialize repository
2020-03-10 13:05:38,235 - INFO: 2020-03-10 13:05:38,224 - INFO: ------------------------- Detected Config -------------------------
2020-03-10 13:05:38,236 - INFO: 2020-03-10 13:05:38,224 - INFO: service: devops_rocketchat
2020-03-10 13:05:38,238 - INFO: 2020-03-10 13:05:38,225 - INFO: - volume: /var/lib/docker/volumes/devops_rocketchat_uploads/_data
2020-03-10 13:05:38,239 - INFO: 2020-03-10 13:05:38,225 - INFO: -------------------------------------------------------------------
2020-03-10 13:05:38,240 - INFO: 2020-03-10 13:05:38,225 - INFO: Backing up volumes
2020-03-10 13:05:38,785 - INFO: 2020-03-10 13:05:38,777 - ERROR: ---------- stdout ----------
2020-03-10 13:05:38,786 - INFO: 2020-03-10 13:05:38,777 - ERROR: open repository
2020-03-10 13:05:38,787 - INFO: 2020-03-10 13:05:38,778 - ERROR: ----------------------------
2020-03-10 13:05:38,788 - INFO: 2020-03-10 13:05:38,778 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,790 - INFO: 2020-03-10 13:05:38,778 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:05:38,791 - INFO: 2020-03-10 13:05:38,778 - ERROR: ----------------------------
2020-03-10 13:05:38,792 - INFO: 2020-03-10 13:05:38,778 - ERROR: Volume backup exited with non-zero code: 1
2020-03-10 13:05:38,793 - INFO: 2020-03-10 13:05:38,779 - INFO: Backing up databases
2020-03-10 13:05:38,794 - INFO: 2020-03-10 13:05:38,779 - ERROR: Exit code: True
2020-03-10 13:05:39,025 - INFO: Backup container exit code: 1
2020-03-10 13:05:39,025 - INFO: No alerts configured
I'll work on ensuring the backup service is running, one per node since this has to do with how I've got my swarm deployment configured. Beyond that there appears to be another problem with repository initialization. Perhaps it's a race, due to the fact that the 3 backup services are all set to run at the same time. Maybe we need to setup a helper that just does the repository initialization once at startup, then exits. Restic should be able to handle multiple, simultaneous backups, but perhaps not to the same repository.
I'm beginning to think that all required backup configuration (backup endpoint, credentials and schedule) can be given in the labels (overriding the environment file) so that each backed up service can be be backed up in different repositories at different times, if desired. Basically using the Traefik model for a more dynamic backup configuration.
Thanks for testing this. That definitely looks like a race condition on repo initialization, so running backup service per node is a no go.
The backup service needs to do the following (for status and backup):
status_node
command on each node in a spawned container with node constraintbackup_node
command on each node in spawned container with node constraintThis is exactly how we do it for a single node as well except that we are able to handle each node sequentially. The new status_node
and backup_node
commands would really only need to do a subset of the operations compared to the backup
and status
command.
The more fancy configuration as you mention is desirable, but probably more a 1.0 thing. I'd have to look into how restic handles multiple hosts. I know it can handle this when set up correctly.
Still, your experiment was useful.
First of all, let me say this is an awesome project. There are many Restic docker projects out there, but this one seems to have gotten it right from a design perspective.
I'm trying to integrate the service into a docker-compose swarm. I have this:
I deploy that into a stack that is running Jira, Confluence, 2 PostgreSQL databases and Traefik2.
I have a Restic REST service running in another VM. When I check the status, it doesn't appear that the backup service can read the labels despite the labels being present.
Is reading labels from services running in a swarm stack something that is currently supported by this project?
Thanks.