Closed joshzarrabi closed 6 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/158278085
The labels on this github issue will be updated when the story is started.
@joshzarrabi we never faced such a problem before.
I'm assuming that postgres was already on v28 before the stemcell upgrade, in this case the DATADIR is /var/vcap/store/postgres/postgres-9.6.8
, and that the data that has been lost was already in the database when the stemcell upgrade was started.
The first log to check is the pre-start one. The pre-start checks for the existence in the persistent disk of the DATADIR/postgresql.conf file; basing on its presence, the DATADIR is reused or initialized.
If the DATADIR was correctly reused, another check you can try is to connect directly to postgres and list the content of a table that have data missing, for example:
/var/vcap/packages/postgres-9.6.8/bin/psql -p 5432 -h localhost -U vcap atc -c "select * from pipelines"
The postgres logs, postgresql.conf and startup.log, can be also checked for eventual errors during the stop and start of postgresql around the stemcell upgrade.
Hi @valeriap! We tried to fix the issue by recreating the vm after the initial failure but we did not download the logs, so they are lost :(.
We were previously on gcp stemcell v3541.5 which has some (overly strict) permissions issues, and we upgraded to v3586.16. We speculate the data loss might've gone like this:
postgres.conf
file. All of that feels a bit speculative and handwavey. We are really not sure what happened. We would like to know if there is unused data sitting around in that directory but we are not sure how to find out.
@joshzarrabi It's really hard to guess what happened here. Some considerations:
/var/vcap/store/postgres/postgres-9.6.8/postgresql.conf
, so at least at that point the permissions were correct. By the way do you confirm that before the stemcell upgrade postgres was already on 9.6.8?A couple of questions:
/var/vcap/store
directory?/var/vcap/store
properly mounted before and after the stemcell upgrade?@valeriap We can confirm for sure that the only change in the installation was the stemcell, we did not change the postgres-release version.
/var/vcap/store
was properly mounted, is there an easy way to find out.db/8e62d0a8-0ed3-4f37-870f-2c97fca35250:/var/vcap/store# tree -d
.
├── lost+found
└── postgres
├── pg_upgrade_tmp
├── postgres-9.6.8
│ ├── base
│ │ ├── 1
│ │ ├── 12406
│ │ ├── 12407
│ │ ├── 16385
│ │ ├── 16400
│ │ ├── 18124
│ │ └── pgsql_tmp
│ ├── global
│ ├── pg_clog
│ ├── pg_commit_ts
│ ├── pg_dynshmem
│ ├── pg_logical
│ │ ├── mappings
│ │ └── snapshots
│ ├── pg_multixact
│ │ ├── members
│ │ └── offsets
│ ├── pg_notify
│ ├── pg_replslot
│ ├── pg_serial
│ ├── pg_snapshots
│ ├── pg_stat
│ ├── pg_stat_tmp
│ ├── pg_subtrans
│ ├── pg_tblspc
│ ├── pg_twophase
│ └── pg_xlog
│ └── archive_status
└── postgres-previous
├── base
│ ├── 1
│ ├── 12406
│ ├── 12407
│ ├── 16400
│ └── 18124
├── global
├── pg_clog
├── pg_commit_ts
├── pg_dynshmem
├── pg_logical
│ ├── mappings
│ └── snapshots
├── pg_multixact
│ ├── members
│ └── offsets
├── pg_notify
├── pg_replslot
├── pg_serial
├── pg_snapshots
├── pg_stat
├── pg_stat_tmp
├── pg_subtrans
├── pg_tblspc
├── pg_twophase
└── pg_xlog
└── archive_status
61 directories
Thanks so much for helping out with this, we would like to get to the bottom of this so no one loses data that they need.
@joshzarrabi The presence of old data is meaningfull. it's unlikely that a bug in the pre-start or in the control job may cause data to be moved in the past. This because the control job only manipulates the data directory in three cases:
initdb: directory "/var/vcap/store/postgres/postgres-9.6.8" exists but is not empty
This seems more a disk problem.
PostgreSQL periodically creates checkpoints. It could be possible in some particular conditions, i.e. if last transactional log files are lost, that you loose the data since the last checkpoint. But you falled too much in the past to be in this case.
A mount problem could have caused it if you have worked for a period on the local disk or on a different disk instead of the current persistent disk. But it implies that you should have experienced another data inconsistency in the past.
You can check with a mount
command if the persistent disk is properly mounted now, but I see no way to check which was the situation before the stemcell upgrade.
Did you check if you have some orphaned disks or a snapshot of the disk?
I'm not an expert on the bosh side, but maybe you can get some more information about disks from the bosh tasks logs.
In relation to the strange dimension of the base directory that you mentioned in previous post, did you check if the pgsql_tmp directory size may explain it?
Hi @valeriap, we looked pretty closely at the BOSH logs and we are fairly sure that the disk that existed before the upgraded got mounted back to the vm after the upgrade (the cid of the disk that got unmounted is the same that got mounted).
It looks like there is nothing in the pgsql_tmp directory.
db/8e62d0a8-0ed3-4f37-870f-2c97fca35250:/var/vcap/store/postgres/postgres-9.6.8/base/pgsql_tmp# tree
.
0 directories, 0 files
@valeriap
We seem to be seeing similar behaviour. After an upgrade from .22 to .28 for concourse.
The postgres db that is starting on the concourse db seems to be still pointing to the 9.6.4 directories even though all of the data seems to have been moved to the 9.6.8 dircetory.
There is a file called pgconfig.sh
that seems to have some logic about versions. (This is currently showing the wrong versions btw) I tried to manually change this but no joy.
current_version="9.6.4"
pgversion_current="postgres-${current_version}"
pgversion_old="postgres-9.6.3"
pgversion_older="postgres-9.6.2"
Also the atc is failing to start with some error about failed to migrate database: file does not exist
I'm convinced the data is there but everything seems to be wired up screwy.
James
@pivotal-jwynne
The pgconfig.sh
that you are showing is the .22 one and if bosh has installed the new v28 templates, the /var/vcap/jobs/postgres/bin
directory should not point to it.
The pgconfig.sh .28 would look differently.
Can you share the content of the logs in the /var/vcap/sys/log/postgres
directory?
And also the listing of the /var/vcap/store/postgres
content.
Note that the v22 version had a bug in reporting back an error during the upgrade fixed in v24, so please check the pre-start logs for errors.
Once checked which is the current situation, we can suggest how to recover and go on.
I'm assuming this is not the same environment that @joshzarrabi is using, so it's not necessarily the same problem.
@pivotal-jwynne As per our discussion in chat, your problem is different from the original one. The logs are not there anymore and I cannot verify it, but it seems that in your case it happened that:
pre-start
created postgres-previous
as copy of /var/vcap/store/postgres/postgres-9.6.4
pre-start
moved /var/vcap/store/postgres/postgres-9.6.4
to /var/vcap/store/postgres/postgres-9.6.8
/var/vcap/jobs/postgres
at this time were the v22 one (as per your snippet of the pgconfig.sh
)/var/vcap/store/postgres/postgres-9.6.4
existence but since it was not present it initialized the directory with an init_db@joshzarrabi Without logs I cannot further dig in the problem. We are closing the issue; you can reopen if the problem would occur again. In that case please collect all the logs.
We manage a concourse that uses postgres-release v28 and when upgrading the stemcell resulted in our concourse displaying very old data (it looks like the state of the database right after the original
fly set-pipeline
command.) Based on the result ofdu -h
we still have the data sitting on the persistent disk, but the db is serving old data instead. We could not find anything in the bosh logs that would point to something obvious happen, and we are not sure where to look in the postgres logs.