ZettaIO / restic-compose-backup

Automatic restic backup of a docker-compose setup. https://hub.docker.com/r/zettaio/restic-compose-backup
MIT License
29 stars 16 forks source link

Docker swarm support #23

Open scholarsmate opened 4 years ago

scholarsmate commented 4 years ago

First of all, let me say this is an awesome project. There are many Restic docker projects out there, but this one seems to have gotten it right from a design perspective.

I'm trying to integrate the service into a docker-compose swarm. I have this:

version: '3'

services:
  backup:
    image: 'zettaio/restic-compose-backup'
    env_file:
      - /etc/restic/restic.env
    volumes:
      # We need to communicate with docker
      - /var/run/docker.sock:/tmp/docker.sock:ro
      # Persistent storage of restic cache (greatly speeds up all restic operations)
      - cache:/cache
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.role == manager

volumes:
  cache:

I deploy that into a stack that is running Jira, Confluence, 2 PostgreSQL databases and Traefik2.

version: '3'

services:

  jira_db:
    image: postgres
    environment:
      - POSTGRES_DB=jiradb
      - POSTGRES_USER=atlassian
      - POSTGRES_PASSWORD=atlassian
      - POSTGRES_ENCODING=UTF8
      - POSTGRES_COLLATE=C
      - POSTGRES_COLLATE_TYPE=C
    command: ["-c", "shared_buffers=256MB", "-c", "max_connections=200"]
    volumes:
      - postgres_jira_data:/var/lib/postgresql/data
    labels:
      - restic-compose-backup.postgres=true

  confluence_db:
    image: postgres
    environment:
      - POSTGRES_DB=confluencedb
      - POSTGRES_USER=atlassian
      - POSTGRES_PASSWORD=atlassian
      - POSTGRES_ENCODING=UTF8
      - POSTGRES_COLLATE=C
      - POSTGRES_COLLATE_TYPE=C
    command: ["-c", "shared_buffers=256MB", "-c", "max_connections=200"]
    volumes:
      - postgres_confluence_data:/var/lib/postgresql/data
    labels:
      - restic-compose-backup.postgres=true

  jira:
    image: atlassian/jira-software
    environment:
      - ATL_PROXY_NAME=jira.${SVC_DOMAIN?}
      - ATL_PROXY_PORT=443
      - ATL_TOMCAT_SCHEME=https
      - ATL_JDBC_URL=jdbc:postgresql://jira_db:5432/jiradb
      - ATL_JDBC_USER=atlassian
      - ATL_JDBC_PASSWORD=atlassian
      - ATL_DB_TYPE=postgres72
      - ALT_DB_DRIVER=org.postgresql.Driver
    volumes:
      - jira_data:/var/atlassian/application-data/jira
      - /etc/localtime:/etc/localtime:ro
    networks:
      - traefik-proxy
      - default
    depends_on:
      - jira_db
    # Control the number of instances deployed
    deploy:
      mode: replicated
      replicas: 1
      labels:
        - restic-compose-backup.volumes=true
        - traefik.enable=true
        - traefik.http.routers.jira.entrypoints=web
        - traefik.http.routers.jira.rule=Host(`jira.${SVC_DOMAIN?}`)
        - traefik.http.services.jira.loadbalancer.server.port=8080
        - traefik.docker.network=traefik-proxy

  confluence:
    image: atlassian/confluence-server:latest
    environment:
      - ATL_PROXY_NAME=confluence.${SVC_DOMAIN?}
      - ATL_PROXY_PORT=443
      - ATL_TOMCAT_SCHEME=https
      - ATL_JDBC_URL=jdbc:postgresql://confluence_db:5432/confluencedb
      - ATL_JDBC_USER=atlassian
      - ATL_JDBC_PASSWORD=atlassian
      - ATL_DB_TYPE=postgresql
    volumes:
      - confluence_data:/var/atlassian/application-data/confluence
      - /etc/localtime:/etc/localtime:ro
    networks:
      - traefik-proxy
      - default
    depends_on:
      - confluence_db
    # Control the number of instances deployed
    deploy:
      mode: replicated
      replicas: 1
      labels:
        - restic-compose-backup.volumes=true
        - traefik.enable=true
        - traefik.http.routers.confluence.entrypoints=web
        - traefik.http.routers.confluence.rule=Host(`confluence.${SVC_DOMAIN?}`)
        - traefik.http.services.confluence.loadbalancer.server.port=8090
        - traefik.docker.network=traefik-proxy

networks:
  traefik-proxy:
    external: true

volumes:
  postgres_jira_data:
  postgres_confluence_data:
  confluence_data:
  jira_data:

I have a Restic REST service running in another VM. When I check the status, it doesn't appear that the backup service can read the labels despite the labels being present.

/restic-compose-backup # rcb status
2020-03-03 21:49:11,185 - INFO: Status for compose project 'None'
2020-03-03 21:49:11,185 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-03 21:49:11,186 - INFO: Backup currently running?: False
2020-03-03 21:49:11,186 - INFO: Checking docker availability
2020-03-03 21:49:11,926 - INFO: ------------------------- Detected Config -------------------------
2020-03-03 21:49:11,927 - INFO: No containers in the project has 'restic-compose-backup.*' label
2020-03-03 21:49:11,927 - INFO: -------------------------------------------------------------------

Is reading labels from services running in a swarm stack something that is currently supported by this project?

Thanks.

einarf commented 4 years ago

I haven't played much with swarm, but as far as I understand each stack can be a separate compose file / projects. Currently we filter containers in the same project:

https://github.com/ZettaIO/restic-compose-backup/blob/6817f0999f7b6294811fcedb170abc81a5b2922f/src/restic_compose_backup/containers.py#L357-L361

I'm guessing that should be trivial to fix. Could make it configurable in some way.

Also, I am planning to revamp this project a bit making the code a lot cleaner and extensible. It works pretty well, but is a result of experimenting with the design over time 😄

Should probably try to add a test case for swarm setups.

scholarsmate commented 4 years ago

Traefik (another awesome project) uses a command line switch, namely --providers.docker.swarmmode=true to inform the service that it's running in swarm mode so it knows how to read and interpret the service configurations from the docker socket.

einarf commented 4 years ago

Looks like it's not that hard to separate traditional compose setups with stacks.

Labels set for containers in compose projects:

"com.docker.compose.project": "restic-compose-backup",
"com.docker.compose.service": "mariadb",

Labels set for containers in stacks:

"com.docker.stack.namespace": "test",
"com.docker.swarm.service.name": "test_mariadb",

As long as we have some prefix and a service name we are good.

einarf commented 4 years ago

Released version 0.5 with experimental swarm/stack support. If SWAM_MODE is defined we evaluate containers in stacks. I'm not entirely sure how this works when scale is > 1, but at least this is a start.

Docs : https://restic-compose-backup.readthedocs.io/en/latest/guide/configuration.html#swarm-mode Leaving this issue open for now. I'm sure there will be more to discuss.

scholarsmate commented 4 years ago

Great, I hope to give this a try today!

scholarsmate commented 4 years ago

Testing it on 3 nodes, and it looks like there is still some work to do. It appears that it does not detect services running on other nodes and for services running on the same node, I get mixed results. The docker stack includes the jira, confluence and their databases as well as the Restic-compose-backup as shown in my original post (the only modification I made was to have SWARM_MODE=true in the /etc/restic/restic.env file). Here's what I have:

[vagrant@docker-server ~]$ docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS                 PORTS                                        NAMES
71089e59f288        gitlab/gitlab-ee:latest                "/assets/wrapper"        2 hours ago         Up 2 hours (healthy)   22/tcp, 80/tcp, 443/tcp                      devops_gitlab.2.jyqsl9o247y7n8qm35njtc8ko
9887505825d0        atlassian/jira-software:latest         "/tini -- /entrypoin…"   2 hours ago         Up 2 hours             8080/tcp                                     devops_jira.1.a6lqxyksm7tk8cbd3pxkei1s7
5b0cd831966e        mongo:4.0                              "docker-entrypoint.s…"   2 hours ago         Up 2 hours             27017/tcp                                    devops_mongo.1.v0see33mn6v71h24rtfu3edg2
aacdd99c4623        rocketchat/rocket.chat:latest          "bash -c 'for i in `…"   2 hours ago         Up 2 hours             3000/tcp                                     devops_rocketchat.1.avx6d8papf9q7b1j4lezc2q6i
edbb95bdf3f6        grafana/grafana:latest                 "/run.sh"                3 hours ago         Up 3 hours             3000/tcp                                     devops_grafana.1.yazm5uk5gbi3blu8e9tnh4wue
5ee27ee7c637        sonatype/nexus3:latest                 "sh -c ${SONATYPE_DI…"   3 hours ago         Up 3 hours             8081/tcp                                     devops_nexus.1.ujnaqbn3zxytma4fvraq6erkw
3c8ccceb2723        postgres:latest                        "docker-entrypoint.s…"   3 hours ago         Up 3 hours             5432/tcp                                     devops_confluence_db.1.5yfhgmocnocs8k7f8cof0u2nk
f58fd3da4f76        zettaio/restic-compose-backup:latest   "./entrypoint.sh"        3 hours ago         Up 3 hours                                                          devops_backup.1.jprju4z108j7d29510c8i48ta
7ff12d3826ef        traefik:v2.1                           "/entrypoint.sh --lo…"   3 hours ago         Up 3 hours             0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp   devops_traefik.yejmt4cjzol1x3020zkuwt6co.ofdsw5v7q94li6slzjivo51bj
0995f730221e        portainer/agent:latest                 "./agent"                3 hours ago         Up 3 hours             0.0.0.0:9001->9001/tcp                       devops_portainer-agent-internal.yejmt4cjzol1x3020zkuwt6co.o00ragu1nyrpnrj62u428iynv
[vagrant@docker-server ~]$

Now to see what the backup service detected:

[vagrant@docker-server ~]$ docker exec -it f58fd3da4f76 sh
/restic-compose-backup # rcp status
sh: rcp: not found
/restic-compose-backup # rcb status
2020-03-07 18:21:49,708 - INFO: Status for compose project 'None'
2020-03-07 18:21:49,708 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-07 18:21:49,708 - INFO: Backup currently running?: False
2020-03-07 18:21:49,709 - INFO: Checking docker availability
2020-03-07 18:21:49,858 - ERROR: ---------- stderr ----------
2020-03-07 18:21:49,858 - ERROR: Fatal: unable to open config file: <config/> does not exist
2020-03-07 18:21:49,859 - ERROR: Is there a repository at the following location?
2020-03-07 18:21:49,859 - ERROR: rest:http://10.4.16.6:8000
2020-03-07 18:21:49,859 - ERROR: ----------------------------
2020-03-07 18:21:49,859 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-07 18:21:52,269 - INFO: Successfully initialized repository: rest:http://10.4.16.6:8000
2020-03-07 18:21:52,270 - INFO: ------------------------- Detected Config -------------------------
2020-03-07 18:21:52,271 - INFO: service: devops_confluence_db
2020-03-07 18:21:52,313 - INFO:  - postgres (is_ready=True)
2020-03-07 18:21:52,314 - INFO: -------------------------------------------------------------------
/restic-compose-backup # rcb status
2020-03-07 18:22:17,130 - INFO: Status for compose project 'None'
2020-03-07 18:22:17,131 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-07 18:22:17,131 - INFO: Backup currently running?: False
2020-03-07 18:22:17,131 - INFO: Checking docker availability
2020-03-07 18:22:18,019 - INFO: ------------------------- Detected Config -------------------------
2020-03-07 18:22:18,020 - INFO: service: devops_confluence_db
2020-03-07 18:22:18,038 - INFO:  - postgres (is_ready=True)
2020-03-07 18:22:18,039 - INFO: -------------------------------------------------------------------
/restic-compose-backup #

It's really cool that the service initializes the Restic backup location for me the first time and that it discovered the Postgres instance that was running on the same node. What's missing is that jira is also running on that same node and it has been labeled with restic-compose-backup.volumes=true. If we can get the backup service to at least handle those services running on the same node as it is, that's progress enough that I think I can use it in my project. If we can get it working for all nodes across the entire swarm, that would be complete and total victory!

einarf commented 4 years ago

hmm. I guess at least for databases the service endpoint could potentially be used detect containers across the different nodes. We only need to gather credentials and host/port.

For volumes/binds it gets a lot more complicated depending on if the setup has some form of volume sharing across the nodes. The backup container should have to run on all nodes to cover all volumes. I've tried streaming data using get_archive in the past, but this is way too unreliable.

In your setup it should ideally find:

It seems confluence_db and jira is located on the node you listed. It's weird that it doesn't detect the jira volumes. I would assume we could find jira_db as well. For the volumes in confluence.. that's where it gets complicated. The backup service right now is only able to handle volumes on the node it currently runs on.

I'm wondering if the simplest solution is to make the backup service run a container on each node in the swarm cluster when doing operations. It would still keep the management simple and it's easier to coordinate and configure how to approach volume backups.

scholarsmate commented 4 years ago

Precisely. If it can work for a single node in the Custer (DB backups and volumes), then we just run it on every node in the swarm and that ought to work.

einarf commented 4 years ago

After looking into labels:

In your jira service you have to set the labels on the container instead of the service. (labels in deploy is only exposed in the service and not in the container itself). We should probably support service labels as well, but that is not on the "quick fix" list.

Loading service data should be part of the rewrite/cleanup. Right now we are just passing around a list of containers instead of a higher level enviroment object. It's still kind of the prototype code 😄

Can you verify that moving the label actually makes rcb detect the container and volumes? If this works fine I think 2 of 3 issues are resolved to make basic swarm support working. The last one is to detect the swarm node config and make rcb operations run per node in a loop with docker run and node constraints.

scholarsmate commented 4 years ago

I made the following changes:

$ git diff
diff --git a/atlassian/docker-compose.yml b/atlassian/docker-compose.yml
index 7a10f86..4a1e61a 100644
--- a/atlassian/docker-compose.yml
+++ b/atlassian/docker-compose.yml
@@ -51,12 +51,14 @@ services:
       - default
     depends_on:
       - jira_db
+    labels:
+      - restic-compose-backup.volumes=true
     # Control the number of instances deployed
     deploy:
       mode: replicated
       replicas: 1
       labels:
-        - restic-compose-backup.volumes=true
+       #- restic-compose-backup.volumes=true
         - traefik.enable=true
         - traefik.http.routers.jira.entrypoints=web
         - traefik.http.routers.jira.rule=Host(`jira.${SVC_DOMAIN?}`)
@@ -81,12 +83,14 @@ services:
       - default
     depends_on:
       - confluence_db
+    labels:
+      - restic-compose-backup.volumes=true
     # Control the number of instances deployed
     deploy:
       mode: replicated
       replicas: 1
       labels:
-        - restic-compose-backup.volumes=true
+       #- restic-compose-backup.volumes=true
         - traefik.enable=true
         - traefik.http.routers.confluence.entrypoints=web
         - traefik.http.routers.confluence.rule=Host(`confluence.${SVC_DOMAIN?}`)

Then re-rolled my install, this time confluence and the confluence database ended up in the same VM as the backup service:

$ docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS                    PORTS                                        NAMES
84e68b25e462        gitlab/gitlab-ee:latest                "/assets/wrapper"        17 minutes ago      Up 17 minutes (healthy)   22/tcp, 80/tcp, 443/tcp                      devops_gitlab.1.zs5jg5zu3ldwno6vmo5csv9ua
9b4ec876fc1a        atlassian/confluence-server:latest     "/sbin/tini -- /entr…"   29 minutes ago      Up 29 minutes             8090-8091/tcp                                devops_confluence.1.scx0xyjvogziwfwr4x91shavj
9bf02dc68d53        rocketchat/rocket.chat:latest          "bash -c 'for i in `…"   32 minutes ago      Up 32 minutes             3000/tcp                                     devops_rocketchat.2.jygljec3mnke5zxwmddl61902
41464af9452d        mongo:4.0                              "docker-entrypoint.s…"   35 minutes ago      Up 35 minutes             27017/tcp                                    devops_mongo.1.u1uad4yap162hpkoxrbvc0f4m
b66d093d72ce        postgres:latest                        "docker-entrypoint.s…"   41 minutes ago      Up 41 minutes             5432/tcp                                     devops_confluence_db.1.9sndy6nj9hkto2j7lpqv0fkyd
813796e0f3d1        grafana/grafana:latest                 "/run.sh"                46 minutes ago      Up 46 minutes             3000/tcp                                     devops_grafana.1.1nso09jvako3ihlx47mmlv5pl
97983038e5f1        zettaio/restic-compose-backup:latest   "./entrypoint.sh"        About an hour ago   Up About an hour                                                       devops_backup.1.jd9kf01r3wb26wdf60lrr6qwj
2118f090255d        traefik:v2.1                           "/entrypoint.sh --lo…"   About an hour ago   Up About an hour          0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp   devops_traefik.ph0orqj7nii9pspy440q5n6ws.t3w9b8mkfkvfpjtok5089mlye
014c45bf586f        portainer/agent:latest                 "./agent"                About an hour ago   Up About an hour          0.0.0.0:9001->9001/tcp                       devops_portainer-agent-internal.ph0orqj7nii9pspy440q5n6ws.ohv6r2105xxevehjc28d9gw26

Getting the status we see:

$ docker exec -it 97983038e5f1 sh
/restic-compose-backup # rcb status
2020-03-08 21:16:09,284 - INFO: Status for compose project 'None'
2020-03-08 21:16:09,284 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-08 21:16:09,285 - INFO: Backup currently running?: False
2020-03-08 21:16:09,285 - INFO: Checking docker availability
2020-03-08 21:16:10,112 - INFO: ------------------------- Detected Config -------------------------
2020-03-08 21:16:10,113 - INFO: service: devops_confluence
2020-03-08 21:16:10,114 - INFO:  - volume: /var/lib/docker/volumes/devops_confluence_data/_data
2020-03-08 21:16:10,114 - INFO:  - volume: /etc/localtime
2020-03-08 21:16:10,114 - INFO: service: devops_confluence_db
2020-03-08 21:16:10,132 - INFO:  - postgres (is_ready=True)
2020-03-08 21:16:10,133 - INFO: -------------------------------------------------------------------
/restic-compose-backup #

Putting the labels at the container-level made a difference.

einarf commented 4 years ago

Awesome. That means we can move to step 3.

TODO:

scholarsmate commented 4 years ago

In a 3-node cluster and setting the replication of the backup service to 3 does not guarantee that the service gets evenly spread out on the nodes. In my last test I found 1 backup service running on 1 node, 2 on another and 0 on the last. I think I can solve that problem though.

On the node that had 1 backup service running, here's the docker ps:

$ docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS              PORTS                                        NAMES
b8bf2b793f34        rocketchat/rocket.chat:latest          "bash -c 'for i in `…"   13 hours ago        Up 13 hours         3000/tcp                                     devops_rocketchat.1.zw1i7wx1c6kzlgqhn6vka9yho
842c6977ee41        zettaio/restic-compose-backup:latest   "./entrypoint.sh"        13 hours ago        Up 13 hours                                                      devops_backup.1.acglv44dttbyvmasrojixudff
9b6baca4716c        traefik:v2.1                           "/entrypoint.sh --lo…"   13 hours ago        Up 13 hours         0.0.0.0:80->80/tcp, 0.0.0.0:2222->2222/tcp   devops_traefik.dazqixxfwh1r5tcjcbvo2iuii.aqncrb55xwidne5jlkjewqyym
6a93f80e1a3c        portainer/agent:latest                 "./agent"                13 hours ago        Up 13 hours         0.0.0.0:9001->9001/tcp                       devops_portainer-agent-internal.dazqixxfwh1r5tcjcbvo2iuii.5ahpj9apv1tqg3qzsi1vdnl69

Rocket Chat was labelled as follows:

    volumes:
      - rocketchat_uploads:/app/uploads
    labels:
      - restic-compose-backup.volumes=true

Here is rcb status:

$ docker exec -it 842c6977ee41 sh
/restic-compose-backup # rcb status
2020-03-10 13:04:07,445 - INFO: Status for compose project 'None'
2020-03-10 13:04:07,445 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-10 13:04:07,445 - INFO: Backup currently running?: False
2020-03-10 13:04:07,445 - INFO: Checking docker availability
2020-03-10 13:04:08,003 - ERROR: ---------- stderr ----------
2020-03-10 13:04:08,004 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:04:08,004 - ERROR: ----------------------------
2020-03-10 13:04:08,004 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-10 13:04:08,036 - ERROR: ---------- stderr ----------
2020-03-10 13:04:08,037 - ERROR: Fatal: create repository at rest:http://10.4.16.6:8000 failed: Fatal: config file already exists
2020-03-10 13:04:08,037 - ERROR:
2020-03-10 13:04:08,037 - ERROR: ----------------------------
2020-03-10 13:04:08,038 - ERROR: Failed to initialize repository
2020-03-10 13:04:08,038 - INFO: ------------------------- Detected Config -------------------------
2020-03-10 13:04:08,038 - INFO: service: devops_rocketchat
2020-03-10 13:04:08,038 - INFO:  - volume: /var/lib/docker/volumes/devops_rocketchat_uploads/_data
2020-03-10 13:04:08,038 - INFO: -------------------------------------------------------------------

And here is a backup attempt:

/restic-compose-backup # rcb backup
2020-03-10 13:05:36,543 - INFO: Starting backup container
2020-03-10 13:05:36,806 - INFO: Backup process container: strange_hoover
2020-03-10 13:05:37,614 - INFO: 2020-03-10 13:05:37,608 - INFO: Status for compose project 'None'
2020-03-10 13:05:37,616 - INFO: 2020-03-10 13:05:37,608 - INFO: Repository: 'rest:http://10.4.16.6:8000'
2020-03-10 13:05:37,617 - INFO: 2020-03-10 13:05:37,609 - INFO: Backup currently running?: False
2020-03-10 13:05:37,618 - INFO: 2020-03-10 13:05:37,609 - INFO: Checking docker availability
2020-03-10 13:05:38,199 - INFO: 2020-03-10 13:05:38,193 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,200 - INFO: 2020-03-10 13:05:38,194 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:05:38,201 - INFO: 2020-03-10 13:05:38,194 - ERROR: ----------------------------
2020-03-10 13:05:38,203 - INFO: 2020-03-10 13:05:38,194 - INFO: Could not get repository info. Attempting to initialize it.
2020-03-10 13:05:38,229 - INFO: 2020-03-10 13:05:38,223 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,231 - INFO: 2020-03-10 13:05:38,223 - ERROR: Fatal: create repository at rest:http://10.4.16.6:8000 failed: Fatal: config file already exists
2020-03-10 13:05:38,231 - INFO: 2020-03-10 13:05:38,223 - ERROR:
2020-03-10 13:05:38,232 - INFO: 2020-03-10 13:05:38,223 - ERROR: ----------------------------
2020-03-10 13:05:38,233 - INFO: 2020-03-10 13:05:38,224 - ERROR: Failed to initialize repository
2020-03-10 13:05:38,235 - INFO: 2020-03-10 13:05:38,224 - INFO: ------------------------- Detected Config -------------------------
2020-03-10 13:05:38,236 - INFO: 2020-03-10 13:05:38,224 - INFO: service: devops_rocketchat
2020-03-10 13:05:38,238 - INFO: 2020-03-10 13:05:38,225 - INFO:  - volume: /var/lib/docker/volumes/devops_rocketchat_uploads/_data
2020-03-10 13:05:38,239 - INFO: 2020-03-10 13:05:38,225 - INFO: -------------------------------------------------------------------
2020-03-10 13:05:38,240 - INFO: 2020-03-10 13:05:38,225 - INFO: Backing up volumes
2020-03-10 13:05:38,785 - INFO: 2020-03-10 13:05:38,777 - ERROR: ---------- stdout ----------
2020-03-10 13:05:38,786 - INFO: 2020-03-10 13:05:38,777 - ERROR: open repository
2020-03-10 13:05:38,787 - INFO: 2020-03-10 13:05:38,778 - ERROR: ----------------------------
2020-03-10 13:05:38,788 - INFO: 2020-03-10 13:05:38,778 - ERROR: ---------- stderr ----------
2020-03-10 13:05:38,790 - INFO: 2020-03-10 13:05:38,778 - ERROR: Fatal: Fatal: config cannot be loaded: ciphertext verification failed
2020-03-10 13:05:38,791 - INFO: 2020-03-10 13:05:38,778 - ERROR: ----------------------------
2020-03-10 13:05:38,792 - INFO: 2020-03-10 13:05:38,778 - ERROR: Volume backup exited with non-zero code: 1
2020-03-10 13:05:38,793 - INFO: 2020-03-10 13:05:38,779 - INFO: Backing up databases
2020-03-10 13:05:38,794 - INFO: 2020-03-10 13:05:38,779 - ERROR: Exit code: True
2020-03-10 13:05:39,025 - INFO: Backup container exit code: 1
2020-03-10 13:05:39,025 - INFO: No alerts configured

I'll work on ensuring the backup service is running, one per node since this has to do with how I've got my swarm deployment configured. Beyond that there appears to be another problem with repository initialization. Perhaps it's a race, due to the fact that the 3 backup services are all set to run at the same time. Maybe we need to setup a helper that just does the repository initialization once at startup, then exits. Restic should be able to handle multiple, simultaneous backups, but perhaps not to the same repository.

I'm beginning to think that all required backup configuration (backup endpoint, credentials and schedule) can be given in the labels (overriding the environment file) so that each backed up service can be be backed up in different repositories at different times, if desired. Basically using the Traefik model for a more dynamic backup configuration.

einarf commented 4 years ago

Thanks for testing this. That definitely looks like a race condition on repo initialization, so running backup service per node is a no go.

The backup service needs to do the following (for status and backup):

This is exactly how we do it for a single node as well except that we are able to handle each node sequentially. The new status_node and backup_node commands would really only need to do a subset of the operations compared to the backup and status command.

The more fancy configuration as you mention is desirable, but probably more a 1.0 thing. I'd have to look into how restic handles multiple hosts. I know it can handle this when set up correctly.

Still, your experiment was useful.