deis / postgres

A PostgreSQL database used by Deis Workflow.
https://deis.com
MIT License
36 stars 22 forks source link

integration tests are finding 6 backups #67

Closed bacongobbler closed 8 years ago

bacongobbler commented 8 years ago

see https://github.com/deis/postgres/pull/65#issuecomment-196917516

bacongobbler commented 8 years ago

So I found the core issue:

STRUCTURED: time=2016-03-16T16:14:27.012096-00 pid=377
wal_e.retries WARNING  MSG: retrying after encountering exception
        DETAIL: Exception information dump: 
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/wal_e/retries.py", line 62, in shim
            return f(*args, **kwargs)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/worker/s3/s3_deleter.py", line 17, in _delete_batch
            bucket_name = page[0].bucket.name
        AttributeError: 's3.ObjectSummary' object has no attribute 'bucket'

        HINT: A better error message should be written to handle this exception.  Please report this output and, if possible, the situation under which it arises.

It turns out that the image on CI does not have the latest changes from deis/wal-e#3. Busting the cache by adding --no-cache to docker build fixes this, but we should probably tag to a commit instead of a branch so docker's cache will bust when we implement these changes upstream.

bacongobbler commented 8 years ago

That was related but this is still occurring on master.

bacongobbler commented 8 years ago

this is a bug but not a core issue. One of the older releases are being stubborn when getting removed from minio so I'm going to remove this from showstopper. It's not a significant issue that affects the platform and we could potentially ship beta with this bug.

bacongobbler commented 8 years ago

this popped up again in CI: https://travis-ci.org/deis/postgres/builds/122642210

mboersma commented 8 years ago

Another sighting: https://travis-ci.org/deis/postgres/builds/123393663

bacongobbler commented 8 years ago

Unfortunately as soon as we restart the job to go green, the old build goes away. Anyhoo this is a problem but it doesn't seem like it's a core issue; just a CI/delay issue.

mboersma commented 8 years ago

Here is a failure in Travis CI with logging from #102:

pg_ctl: server is running (PID: 1)
/usr/lib/postgresql/9.4/bin/postgres
-----> checking if minio has 5 backups
!!!    did not find 5 base backups, which is the default (found 6)
!!!    base_00000001000000000000000D_00000040_backup_stop_sentinel.json
base_00000001000000000000000E_00000040_backup_stop_sentinel.json
base_00000001000000000000000F_00000040_backup_stop_sentinel.json
base_000000010000000000000010_00000040_backup_stop_sentinel.json
base_000000010000000000000011_00000040_backup_stop_sentinel.json
base_000000010000000000000012_00000040_backup_stop_sentinel.json
make: *** [test-functional] Error 1

The very next test run against the same master commit passed.

vdice commented 8 years ago

It appears the proposed fix in #102 unfortunately didn't completely address the issue; therefore moving this the v2.0 milestone

bacongobbler commented 8 years ago

No #102 was just to expose what the issue is, which is exactly what I assumed (new base backup, old one wasn't deleted due to sync issues). This is a low priority fix because it's not a massive issue nor does it cause any damage to the database. Just a little lag along with checking right in the middle of a backup operation is all. Perhaps a fix would be to stop the database and check the number of backups retained, as the database should shut down gracefully with only 5 backups after the backup has been pushed to minio.