Open djtaylor opened 2 years ago
Facing same problem, full takes ~1h so 2 incrementals with */30 will have failed jobs - it's distrupting pod monitoring of backup jobs so much.
In that case, Isn't the operator creating two separate CronJob
resources (postgres-pgbackrest-repo1-{full,incr}
)?
How would you describe concurrency with each other? I think ConcurrencyPolicy
only is about jobs based on the same Cronjob.
I think the easier way would be to have the Jobs retry in case there's a log. People usually have alerts on jobs taking too long in the cluster too, so we should get alerts if the backups don't happen due to some erroneously leftover lock.
As far as ConcurrencyPolicy
is going, once we wait on locks, we should probably always set it to Forbid
. There's no point in queuing a ton of Job
resources if a job is taking more time to finish than the cron interval says (and there's probably alerts about this as well).
There are two different CronJob
resources, indeed. Looking at some recent output in my namespace, I can only see the problem happening with the full
backup currently, but the issue is still the same:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
harbor-db-pgbackrest-repo1-full 0 1 * * * False 0 12h 81d
harbor-db-pgbackrest-repo1-incr */30 * * * * False 0 8m13s 81d
The CronJob
with the last scheduled time of 12h ago created a number of job pods that are overlapping, and only one of them succeeds:
harbor-db-pgbackrest-repo1-full-1650934800-mb27g 0/1 Completed 0 12h
harbor-db-pgbackrest-repo1-full-1650934800-qdmwc 0/1 Error 0 12h
harbor-db-pgbackrest-repo1-full-1650934800-sw5cr 0/1 Error 0 12h
harbor-db-pgbackrest-repo1-full-1650934800-ztctx 0/1 Error 0 12h
I don't believe the problem is how long it takes for the job to run, as the database is currently quite small, so the backup job completes very quickly. My question would be, the scheduled run of the CronJob
posted above (the full
job), why is there more then one job created?
Facing same problem, is there update.
Same issue here, I do not believe they are overlapping either.
I noticed the job only fails when it attempts to backup from primary postgres pod :
postgres-cluster-postgres-ha-w4g5-0
(Primary instance with write permissions)
2022-08-10 14:24:12,590 INFO: no action. I am (postgres-cluster-postgres-ha-w4g5-0), the leader with the lock
postgres-cluster-repo1-full-27668460--1-dxpm7
(Backup job)
time="2022-08-10T14:24:12Z" level=info msg="stderr=[WARN: unable to check pg3: [LockAcquireError] raised from remote-0 tls protocol on 'postgres-cluster-postgres-ha-w4g5-0.postgres-cluster-pods.postgres.svc.cluster.local.': unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
I figured it may have nothing to do with concurrency, just a pb in the generated PgBackRest config :
postgres-cluster-pgbackrest-config
->pgbackrest_repo.conf
# Generated by postgres-operator. DO NOT EDIT.
# Your changes will not be saved.
[db]
pg1-host = postgres-cluster-postgres-ha-qpw7-0.postgres-cluster-pods.postgres.svc.cluster.local.
...
pg2-host = postgres-cluster-postgres-ha-v2vm-0.postgres-cluster-pods.postgres.svc.cluster.local.
...
pg3-host = postgres-cluster-postgres-ha-w4g5-0.postgres-cluster-pods.postgres.svc.cluster.local.
...
All 3 instances are used for backup : primary + replicas.
Would it be fixable by using postgres-cluster-replicas
service instead of postgres-cluster-pods
?
PS: Im not using any pooling solution yet (PgBouncer/PgPool).
have a PR #3440 which should address this
@szelenka 's PR above was merged in Nov (and thanks for that!) and was released in our v5.3.0, so if you update to 5.3.0, you should no longer run into this problem, but if you are, please let us know.
I'm still seeing this with 5.3.0. First of all, a backup starts just after the cluster is created, even if I did not specify a schedule in the definition; second, since this backup takes a long time, after some time other backup jobs are started (again: why?), which fail because they're unable to get the lock.
@waldner AFAIK, concurrencyPolicy should exist in K8s version 1.20.15 that you're using. Though it was still in beta back then.
Do you see "Forbid" set in the CronJob the operator creates?
Hi @szelenka, thanks, I suppose the problem is that the backup jobs I see are NOT coming from CronJobs (which is where you added the concurrencyPolicy); they are just plain Jobs, so there is no protection when many of them overlap (except the failure to get the lock).
What's not clear to me is where all these jobs come from; they're all unsolicited, and furthermore I don't understand the point of running a backup on a newly created cluster, since:
Thanks
EDIT: I can share the cluster config, it's very simple:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: test-cluster
namespace: testns
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.6-2
postgresVersion: 14
instances:
- dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-2
configuration:
- secret:
name: pgo-s3-creds
global:
repo1-path: /test-cluster
repo1-bundle: "y"
repo1-s3-uri-style: path
repo1-retention-full: "5"
repo1-retention-full-type: count
start-fast: "y"
repos:
- name: repo1
s3:
bucket: "..."
endpoint: "...."
region: "..."
I see from the docs (here: https://access.crunchydata.com/documentation/postgres-operator/4.7.0/architecture/provisioning/) that the initial backup is done to facilitate the creation of replicas. While I still think that it could instead be done only when a replica is effectively requested by the user, I understand the rationale behind the decision. So I guess my problem is that I had a flaky object storage which prevented this initial backup from completing (with lots of timeouts and retries), so the reconciliation was kicking in to start it again and again. I haven't checked the code, but perhaps a check could be added to avoid starting a new job if one is already running.
@waldner I just want to clarify are you ever able to get an initial backup to occur? One suggestion would be to remove the cron and just let the initial backup occur. We are unable to reproduce this issue but you can turn up your debugging which will help narrow down the underlying cause of your issue which as you said may be with storage.
I didn't have any cron job running yet. It was the initial backup itself that was taking a very long time due to network/storage errors and what I imagine were further attempts by the reconciler were overlapping and failing because the very first backup had not finished yet. Admittedly not a very common situation. If I could suggest, I'd either:
Just my 2 cents.
We ran into this same issue on startup with one backup pod failing. Our validation steps on installation had to be rewritten to ignore the failing pod. A graceful exit or not starting two pods would be much appreciated.
Just ran into another caveat related to this. The concurrencyPolicy
change I made applies to Jobs which are triggered by CronJob and that works well. BUT there could be multiple CronJobs (full/diff/incr) which only look at their concurrencyPolicy
.
So if a full
CronJob was triggered by its schedule and is taking awhile to complete, then an incr
CronJob is triggered by its schedule before full
could complete, you'll run into the same failed Jobs.
AFAIK, there's no k8s native method to check the concurrency across multiple CronJobs .. so any adjustment to this would need to be done within the initContainer or entrypoint of the Job itself.
However, for the above, it's kind of nice to know there's a job taking longer than normal and it should fail the incr
backup. Where it'd be on the admin to adjust the CronJob schedules so they don't overlap with one another
Overview
I have a full and incremental backup schedule defined for a postgres cluster. I consistently see more then one job get scheduled, causing the namespace to be filled with pods in an
Error
state, with duplicate jobs (the ones that didn't run first) throwing error:[ERROR: [050]: unable to acquire lock on file '/tmp/pgbackrest/db-backup.lock': Resource temporarily unavailable\n HINT: is another pgBackRest process running?\n]
As far as I can tell, this is not preventing backups from running, as at least one of the jobs succeeds, leaving the others in an error state.
Environment
Please provide the following details:
Kubernetes
1.20.15
ubi8-5.0.4-0
13
isci
PVCs for pods,s3
for pgbackrestSteps to Reproduce
REPRO
EXPECTED
ACTUAL
Logs
Additional Information
When inspecting the CronJob definition, there is a field
ConcurrencyPolicy
set toAllow
. Would it not make more sense to set this value toForbid
(or at least have the option of doing so via the CRD), to prevent this type of scheduling conflict from happening?