deis / postgres

A PostgreSQL database used by Deis Workflow.
https://deis.com
MIT License
36 stars 22 forks source link

inspect why /bin/backup stops running after a while #157

Closed bacongobbler closed 7 years ago

bacongobbler commented 7 years ago

We've found that /bin/backup eventually stops running or fails to execute, causing some databases to be running forever without a checkpoint being created in the last 12 hours. This causes users to manually update the readinessCheck as well as the PGCTLTIMEOUT value in https://github.com/deis/postgres/pull/155 to allow the database to finish a full replay from the last checkpoint. The longest (and only) checkpoint we've seen was a month from the time the database had last performed a base backup, which took 3 hours to completely recover.

We need to investigate these situations why it would fail, and perhaps look at more robust solutions (like cron) in the container so the backgrounded process continues to work for months without failure.

bacongobbler commented 7 years ago

As it turns out, wal-e delete can block /bin/backup indefinitely if a file is missing. This error loops forever:

04:59:59.000 HINT: A better error message should be written to handle this exception. Please report this output and, if possible, the situation under which it arises.
04:59:59.000 STRUCTURED: time=2016-10-01T11:59:59.780119-00 pid=2704
04:59:59.000 wal_e.retries WARNING MSG: retrying after encountering exception
04:59:59.000 DETAIL: Exception information dump:
04:59:59.000 Traceback (most recent call last):
04:59:59.000 File "/usr/local/lib/python2.7/dist-packages/wal_e/retries.py", line 62, in shim
04:59:59.000 return f(*args, **kwargs)
04:59:59.000 File "/usr/local/lib/python2.7/dist-packages/wal_e/worker/gs/gs_deleter.py", line 27, in _delete_batch
04:59:59.000 bucket.delete_blobs(page)
04:59:59.000 File "/usr/local/lib/python2.7/dist-packages/gcloud/storage/bucket.py", line 429, in delete_blobs
04:59:59.000 self.delete_blob(blob_name, client=client)
04:59:59.000 File "/usr/local/lib/python2.7/dist-packages/gcloud/storage/bucket.py", line 402, in delete_blob
04:59:59.000 _target_object=None)
04:59:59.000 File "/usr/local/lib/python2.7/dist-packages/gcloud/connection.py", line 344, in api_request
04:59:59.000 error_info=method + ' ' + url)
04:59:59.000 NotFound: 404 Not Found (DELETE https://www.googleapis.com/storage/v1/b/REDACTED-database/o/%2Fbasebackups_005%2Fbase_000000080000002A00000056_00000040%2Fextended_version.txt)
bacongobbler commented 7 years ago

related: #108

bacongobbler commented 7 years ago

See https://github.com/wal-e/wal-e/issues/293 as well as the upstream fix in https://github.com/wal-e/wal-e/pull/294

bacongobbler commented 7 years ago

this error was actually fixed upstream in https://github.com/wal-e/wal-e/commit/b934a416861e9f3e74b0f56d68ae31ddf5e339b1 which is in wal-e v1.0.1.

I'll cut a v2.3.0 so others can upgrade to this release and we can add it to v2.9.

bacongobbler commented 7 years ago

done. Anyone who wants this fix for GCS can use postgres v2.3.0 which includes wal-e v1.0.1.