CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.94k stars 593 forks source link

pgbackrest cronjob concurrency - is another pgBackRest process running? #3145

Open djtaylor opened 2 years ago

djtaylor commented 2 years ago

Overview

I have a full and incremental backup schedule defined for a postgres cluster. I consistently see more then one job get scheduled, causing the namespace to be filled with pods in an Error state, with duplicate jobs (the ones that didn't run first) throwing error:

[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\n]

As far as I can tell, this is not preventing backups from running, as at least one of the jobs succeeds, leaving the others in an error state.

Environment

Please provide the following details:

Steps to Reproduce

REPRO

  1. Build a postgres cluster with the operator, use S3 as a backend for pgbackrest
  2. Schedule a full and incremental backup job schedule
  3. See the full backup jobs conflict with each other

EXPECTED

  1. I would expect only one job per schedule invocation get created at a time, until it is finished

ACTUAL

  1. More the one job gets created, causing the errors seen above

Logs

time="2022-04-05T01:00:19Z" level=info msg="crunchy-pgbackrest starts"
time="2022-04-05T01:00:19Z" level=info msg="debug flag set to false"
time="2022-04-05T01:00:19Z" level=info msg="backrest backup command requested"
time="2022-04-05T01:00:19Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]"
time="2022-04-05T01:00:19Z" level=info msg="output=[]"
time="2022-04-05T01:00:19Z" level=info msg="stderr=[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n       HINT: is another pgBackRest process running?\n]"
time="2022-04-05T01:00:19Z" level=fatal msg="command terminated with exit code 50"

Additional Information

When inspecting the CronJob definition, there is a field ConcurrencyPolicy set to Allow. Would it not make more sense to set this value to Forbid (or at least have the option of doing so via the CRD), to prevent this type of scheduling conflict from happening?

marianskrzypekk commented 2 years ago

Facing same problem, full takes ~1h so 2 incrementals with */30 will have failed jobs - it's distrupting pod monitoring of backup jobs so much.

flokli commented 2 years ago

In that case, Isn't the operator creating two separate CronJob resources (postgres-pgbackrest-repo1-{full,incr})?

How would you describe concurrency with each other? I think ConcurrencyPolicy only is about jobs based on the same Cronjob.

I think the easier way would be to have the Jobs retry in case there's a log. People usually have alerts on jobs taking too long in the cluster too, so we should get alerts if the backups don't happen due to some erroneously leftover lock.

As far as ConcurrencyPolicy is going, once we wait on locks, we should probably always set it to Forbid. There's no point in queuing a ton of Job resources if a job is taking more time to finish than the cron interval says (and there's probably alerts about this as well).

djtaylor commented 2 years ago

There are two different CronJob resources, indeed. Looking at some recent output in my namespace, I can only see the problem happening with the full backup currently, but the issue is still the same:

NAME                              SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
harbor-db-pgbackrest-repo1-full   0 1 * * *      False     0        12h             81d
harbor-db-pgbackrest-repo1-incr   */30 * * * *   False     0        8m13s           81d

The CronJob with the last scheduled time of 12h ago created a number of job pods that are overlapping, and only one of them succeeds:

harbor-db-pgbackrest-repo1-full-1650934800-mb27g   0/1     Completed   0          12h
harbor-db-pgbackrest-repo1-full-1650934800-qdmwc   0/1     Error       0          12h
harbor-db-pgbackrest-repo1-full-1650934800-sw5cr   0/1     Error       0          12h
harbor-db-pgbackrest-repo1-full-1650934800-ztctx   0/1     Error       0          12h

I don't believe the problem is how long it takes for the job to run, as the database is currently quite small, so the backup job completes very quickly. My question would be, the scheduled run of the CronJob posted above (the full job), why is there more then one job created?

shaojun-tes commented 2 years ago

Facing same problem, is there update.

oliveratprimer commented 2 years ago

Same issue here, I do not believe they are overlapping either.

gloaec commented 2 years ago

I noticed the job only fails when it attempts to backup from primary postgres pod :

postgres-cluster-postgres-ha-w4g5-0 (Primary instance with write permissions)

2022-08-10 14:24:12,590 INFO: no action. I am (postgres-cluster-postgres-ha-w4g5-0), the leader with the lock

postgres-cluster-repo1-full-27668460--1-dxpm7 (Backup job)

time="2022-08-10T14:24:12Z" level=info msg="stderr=[WARN: unable to check pg3: [LockAcquireError] raised from remote-0 tls protocol on 'postgres-cluster-postgres-ha-w4g5-0.postgres-cluster-pods.postgres.svc.cluster.local.': unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n      HINT: is another pgBackRest process running?\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"

I figured it may have nothing to do with concurrency, just a pb in the generated PgBackRest config :

postgres-cluster-pgbackrest-config->pgbackrest_repo.conf

# Generated by postgres-operator. DO NOT EDIT.
# Your changes will not be saved.
[db]
pg1-host = postgres-cluster-postgres-ha-qpw7-0.postgres-cluster-pods.postgres.svc.cluster.local.
...
pg2-host = postgres-cluster-postgres-ha-v2vm-0.postgres-cluster-pods.postgres.svc.cluster.local.
...
pg3-host = postgres-cluster-postgres-ha-w4g5-0.postgres-cluster-pods.postgres.svc.cluster.local.
...

All 3 instances are used for backup : primary + replicas.

Would it be fixable by using postgres-cluster-replicas service instead of postgres-cluster-pods ?

PS: Im not using any pooling solution yet (PgBouncer/PgPool).

szelenka commented 2 years ago

have a PR #3440 which should address this

benjaminjb commented 1 year ago

@szelenka 's PR above was merged in Nov (and thanks for that!) and was released in our v5.3.0, so if you update to 5.3.0, you should no longer run into this problem, but if you are, please let us know.

waldner commented 1 year ago

I'm still seeing this with 5.3.0. First of all, a backup starts just after the cluster is created, even if I did not specify a schedule in the definition; second, since this backup takes a long time, after some time other backup jobs are started (again: why?), which fail because they're unable to get the lock.

szelenka commented 1 year ago

@waldner AFAIK, concurrencyPolicy should exist in K8s version 1.20.15 that you're using. Though it was still in beta back then.

Do you see "Forbid" set in the CronJob the operator creates?

waldner commented 1 year ago

Hi @szelenka, thanks, I suppose the problem is that the backup jobs I see are NOT coming from CronJobs (which is where you added the concurrencyPolicy); they are just plain Jobs, so there is no protection when many of them overlap (except the failure to get the lock).

What's not clear to me is where all these jobs come from; they're all unsolicited, and furthermore I don't understand the point of running a backup on a newly created cluster, since:

Thanks

EDIT: I can share the cluster config, it's very simple:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: test-cluster
  namespace: testns
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.6-2
  postgresVersion: 14
  instances:
    - dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-2
      configuration:
      - secret:
          name: pgo-s3-creds
      global:
        repo1-path: /test-cluster
        repo1-bundle: "y"
        repo1-s3-uri-style: path
        repo1-retention-full: "5"
        repo1-retention-full-type: count
        start-fast: "y"
      repos:
      - name: repo1
        s3:
          bucket: "..."
          endpoint: "...."
          region: "..."
waldner commented 1 year ago

I see from the docs (here: https://access.crunchydata.com/documentation/postgres-operator/4.7.0/architecture/provisioning/) that the initial backup is done to facilitate the creation of replicas. While I still think that it could instead be done only when a replica is effectively requested by the user, I understand the rationale behind the decision. So I guess my problem is that I had a flaky object storage which prevented this initial backup from completing (with lots of timeouts and retries), so the reconciliation was kicking in to start it again and again. I haven't checked the code, but perhaps a check could be added to avoid starting a new job if one is already running.

ValClarkson commented 1 year ago

@waldner I just want to clarify are you ever able to get an initial backup to occur? One suggestion would be to remove the cron and just let the initial backup occur. We are unable to reproduce this issue but you can turn up your debugging which will help narrow down the underlying cause of your issue which as you said may be with storage.

waldner commented 1 year ago

I didn't have any cron job running yet. It was the initial backup itself that was taking a very long time due to network/storage errors and what I imagine were further attempts by the reconciler were overlapping and failing because the very first backup had not finished yet. Admittedly not a very common situation. If I could suggest, I'd either:

Just my 2 cents.

David-Angel commented 1 year ago

We ran into this same issue on startup with one backup pod failing. Our validation steps on installation had to be rewritten to ignore the failing pod. A graceful exit or not starting two pods would be much appreciated.

szelenka commented 1 year ago

Just ran into another caveat related to this. The concurrencyPolicy change I made applies to Jobs which are triggered by CronJob and that works well. BUT there could be multiple CronJobs (full/diff/incr) which only look at their concurrencyPolicy.

So if a full CronJob was triggered by its schedule and is taking awhile to complete, then an incr CronJob is triggered by its schedule before full could complete, you'll run into the same failed Jobs.

AFAIK, there's no k8s native method to check the concurrency across multiple CronJobs .. so any adjustment to this would need to be done within the initContainer or entrypoint of the Job itself.

However, for the above, it's kind of nice to know there's a job taking longer than normal and it should fail the incr backup. Where it'd be on the admin to adjust the CronJob schedules so they don't overlap with one another