cloudfoundry / postgres-release

BOSH release for PostgreSQL
Apache License 2.0
14 stars 36 forks source link

Upgrading stemcell for postgres release (v28) vm resulted in data loss #42

Closed joshzarrabi closed 6 years ago

joshzarrabi commented 6 years ago

We manage a concourse that uses postgres-release v28 and when upgrading the stemcell resulted in our concourse displaying very old data (it looks like the state of the database right after the original fly set-pipeline command.) Based on the result of du -h we still have the data sitting on the persistent disk, but the db is serving old data instead. We could not find anything in the bosh logs that would point to something obvious happen, and we are not sure where to look in the postgres logs.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/158278085

The labels on this github issue will be updated when the story is started.

valeriap commented 6 years ago

@joshzarrabi we never faced such a problem before.

I'm assuming that postgres was already on v28 before the stemcell upgrade, in this case the DATADIR is /var/vcap/store/postgres/postgres-9.6.8, and that the data that has been lost was already in the database when the stemcell upgrade was started.

The first log to check is the pre-start one. The pre-start checks for the existence in the persistent disk of the DATADIR/postgresql.conf file; basing on its presence, the DATADIR is reused or initialized.

If the DATADIR was correctly reused, another check you can try is to connect directly to postgres and list the content of a table that have data missing, for example:

/var/vcap/packages/postgres-9.6.8/bin/psql -p 5432 -h localhost -U vcap atc -c "select * from pipelines"

The postgres logs, postgresql.conf and startup.log, can be also checked for eventual errors during the stop and start of postgresql around the stemcell upgrade.

joshzarrabi commented 6 years ago

Hi @valeriap! We tried to fix the issue by recreating the vm after the initial failure but we did not download the logs, so they are lost :(.

We were previously on gcp stemcell v3541.5 which has some (overly strict) permissions issues, and we upgraded to v3586.16. We speculate the data loss might've gone like this:

  1. happy concourse, happy postgres, old & strange stemcell
  2. upgrade stemcell, and on the recreate of the db vm postgress-release does not have permission to see the postgres.conf file.
  3. postgres-release makes a new db, chugs along. old db is sitting alongside new one, possibly in postgres-previous or whatever?
  4. recreate, losing our logs, but maybe permissions are still strange or something? we're not sure where to find the lost data, and are particularly suspicious that the running DB seems to have larger data in it than it should. (ie /var/vcap/store/postgres/postgres-9.6.8/base is possibly bigger than it should be given the data loss)

All of that feels a bit speculative and handwavey. We are really not sure what happened. We would like to know if there is unused data sitting around in that directory but we are not sure how to find out.

valeriap commented 6 years ago

@joshzarrabi It's really hard to guess what happened here. Some considerations:

A couple of questions:

joshzarrabi commented 6 years ago

@valeriap We can confirm for sure that the only change in the installation was the stemcell, we did not change the postgres-release version.

db/8e62d0a8-0ed3-4f37-870f-2c97fca35250:/var/vcap/store# tree -d
.
├── lost+found
└── postgres
    ├── pg_upgrade_tmp
    ├── postgres-9.6.8
    │   ├── base
    │   │   ├── 1
    │   │   ├── 12406
    │   │   ├── 12407
    │   │   ├── 16385
    │   │   ├── 16400
    │   │   ├── 18124
    │   │   └── pgsql_tmp
    │   ├── global
    │   ├── pg_clog
    │   ├── pg_commit_ts
    │   ├── pg_dynshmem
    │   ├── pg_logical
    │   │   ├── mappings
    │   │   └── snapshots
    │   ├── pg_multixact
    │   │   ├── members
    │   │   └── offsets
    │   ├── pg_notify
    │   ├── pg_replslot
    │   ├── pg_serial
    │   ├── pg_snapshots
    │   ├── pg_stat
    │   ├── pg_stat_tmp
    │   ├── pg_subtrans
    │   ├── pg_tblspc
    │   ├── pg_twophase
    │   └── pg_xlog
    │       └── archive_status
    └── postgres-previous
        ├── base
        │   ├── 1
        │   ├── 12406
        │   ├── 12407
        │   ├── 16400
        │   └── 18124
        ├── global
        ├── pg_clog
        ├── pg_commit_ts
        ├── pg_dynshmem
        ├── pg_logical
        │   ├── mappings
        │   └── snapshots
        ├── pg_multixact
        │   ├── members
        │   └── offsets
        ├── pg_notify
        ├── pg_replslot
        ├── pg_serial
        ├── pg_snapshots
        ├── pg_stat
        ├── pg_stat_tmp
        ├── pg_subtrans
        ├── pg_tblspc
        ├── pg_twophase
        └── pg_xlog
            └── archive_status

61 directories

Thanks so much for helping out with this, we would like to get to the bottom of this so no one loses data that they need.

valeriap commented 6 years ago

@joshzarrabi The presence of old data is meaningfull. it's unlikely that a bug in the pre-start or in the control job may cause data to be moved in the past. This because the control job only manipulates the data directory in three cases:

This seems more a disk problem.

PostgreSQL periodically creates checkpoints. It could be possible in some particular conditions, i.e. if last transactional log files are lost, that you loose the data since the last checkpoint. But you falled too much in the past to be in this case.

A mount problem could have caused it if you have worked for a period on the local disk or on a different disk instead of the current persistent disk. But it implies that you should have experienced another data inconsistency in the past.

You can check with a mount command if the persistent disk is properly mounted now, but I see no way to check which was the situation before the stemcell upgrade. Did you check if you have some orphaned disks or a snapshot of the disk? I'm not an expert on the bosh side, but maybe you can get some more information about disks from the bosh tasks logs.

In relation to the strange dimension of the base directory that you mentioned in previous post, did you check if the pgsql_tmp directory size may explain it?

joshzarrabi commented 6 years ago

Hi @valeriap, we looked pretty closely at the BOSH logs and we are fairly sure that the disk that existed before the upgraded got mounted back to the vm after the upgrade (the cid of the disk that got unmounted is the same that got mounted).

It looks like there is nothing in the pgsql_tmp directory.

db/8e62d0a8-0ed3-4f37-870f-2c97fca35250:/var/vcap/store/postgres/postgres-9.6.8/base/pgsql_tmp# tree
.

0 directories, 0 files
pivotal-jwynne commented 6 years ago

@valeriap

We seem to be seeing similar behaviour. After an upgrade from .22 to .28 for concourse.

The postgres db that is starting on the concourse db seems to be still pointing to the 9.6.4 directories even though all of the data seems to have been moved to the 9.6.8 dircetory.

There is a file called pgconfig.sh that seems to have some logic about versions. (This is currently showing the wrong versions btw) I tried to manually change this but no joy.

current_version="9.6.4"
pgversion_current="postgres-${current_version}"
pgversion_old="postgres-9.6.3"
pgversion_older="postgres-9.6.2"

Also the atc is failing to start with some error about failed to migrate database: file does not exist

I'm convinced the data is there but everything seems to be wired up screwy.

James

valeriap commented 6 years ago

@pivotal-jwynne The pgconfig.sh that you are showing is the .22 one and if bosh has installed the new v28 templates, the /var/vcap/jobs/postgres/bin directory should not point to it.

The pgconfig.sh .28 would look differently.

Can you share the content of the logs in the /var/vcap/sys/log/postgres directory? And also the listing of the /var/vcap/store/postgres content.

Note that the v22 version had a bug in reporting back an error during the upgrade fixed in v24, so please check the pre-start logs for errors.

Once checked which is the current situation, we can suggest how to recover and go on.

I'm assuming this is not the same environment that @joshzarrabi is using, so it's not necessarily the same problem.

valeriap commented 6 years ago

@pivotal-jwynne As per our discussion in chat, your problem is different from the original one. The logs are not there anymore and I cannot verify it, but it seems that in your case it happened that:

  1. first bosh deploy from v22 to v28:
    • v28 pre-start created postgres-previous as copy of /var/vcap/store/postgres/postgres-9.6.4
    • v28 pre-start moved /var/vcap/store/postgres/postgres-9.6.4 to /var/vcap/store/postgres/postgres-9.6.8
    • deploy was killed since hanging on some job
  2. second bosh deploy:
    • for some reason the jobs linked in /var/vcap/jobs/postgres at this time were the v22 one (as per your snippet of the pgconfig.sh)
    • v22 pre-start checked for /var/vcap/store/postgres/postgres-9.6.4 existence but since it was not present it initialized the directory with an init_db
    • postgres was started with the new directory with no data defined in it.

@joshzarrabi Without logs I cannot further dig in the problem. We are closing the issue; you can reopen if the problem would occur again. In that case please collect all the logs.