CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.95k stars 593 forks source link

Some question and problems.. #1963

Closed IJOL closed 4 years ago

IJOL commented 4 years ago

we repeatedly found that when trying to add a replica the replica stucks, and we see in the logs a stream, of :

2020-10-12 07:39:26.751 UTC [48630] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000003D0000005D000000D1 has already been removed 2020-10-12 07:39:31.795 UTC [48654] LOG: started streaming WAL from primary at 5D/D1000000 on timeline 61 2020-10-12 07:39:31.795 UTC [48654] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000003D0000005D000000D1 has already been removed 2020-10-12 07:39:36.646 UTC [48661] LOG: started streaming WAL from primary at 5D/D1000000 on timeline 61 2020-10-12 07:39:36.646 UTC [48661] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000003D0000005D000000D1 has already been removed 2020-10-12 07:39:41.752 UTC [48730] LOG: started streaming WAL from primary at 5D/D1000000 on timeline 61 2020-10-12 07:39:41.752 UTC [48730] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000003D0000005D000000D1 has already been removed

Why it occurring that? Why a replica ask for a wall segment that it's not there, where is the problem exactly?

We need a way to start a replica with a different pvc size that we started the cluster.. we use longhorn for PVC and volume management, there is something that keeps our volumes continually needing expansion, DB that consumed in our old systems 6 to 7 gb, no suddenly need to 10Gb more to keep them running, we suspect of the wall log of course, but not found where is the problem exactly, we suspect too that this later problem is the main culprit of the above .. we try hard to understand what the pods are doing..

In relation to the stream of volume expansion, backrest pods the ones that have the pgbr-repo vol, end disconnected from the job that does backups, and we need to recreate the jobs, to have backups running.. that little annoying POD_NAME or

As said a number of glitches and little annoying things that make using cpgo in a daily "serious" use a source of continuous little crises and noises that are very time consuming to solve and make the "production" use more cumbersome that it should be for an "operator" that tries to solve many little annoyances of deploying pg itself, so defeating the purpose of an "operator".

jkatz commented 4 years ago

I happen to cover replica synchronization quite extensively in this article: https://info.crunchydata.com/blog/wheres-my-replica-troubleshooting-streaming-replication-synchronization-in-postgresql. I suggest reading through this, as my answer will build on that.

This is also one of the reasons why, by default, the Postgres Operator is using pgBackRest to house the WAL archives: it contains a "permanent" (or at least as long as your retention policy is set) number of archives, so when you try to create a replica, or if your replica should lag (a topic I also cover extensively in https://info.crunchydata.com/blog/postgresql-monitoring-for-app-developers-alerts-troubleshooting and https://info.crunchydata.com/blog/postgresql-monitoring-for-application-developers-dba-stats).

(I will be providing a much more detailed answer in regards to pgBackRest and its advantages and tradeoffs in deployment models in #1965).

That said, I also do not have any insight into what customizations you may have set for how your PostgreSQL clusters are configured, so I cannot pinpoint the exact issue. I don't know if you've adjusted wal_keep_segments or if you have your replicas tuned to a WAL archive to ensure you can continuously load WAL logs should the replication link be severed. These are all things that the Operator helps with out-of-the-box, but adding in customizations can skew this behavior. If you're deleting WAL logs directly (which is not recommended, covered in one of the above blogs) and a replica is lagging, that could also affect a replicas ability to replay logs.

If you're insistent on using pg_basebackup, I'd recommend using replication slots as part of replication, but this starts skewing heavily from the default Operator behavior and as mentioned in earlier issues, is not recommended.

The PostgreSQL Operator is designed to help with the overall cluster management (HA, backups, DR, monitoring) and provides constructs to help with all of this. That said, there is still a stateful service, i.e. PostgreSQL running underneath it. In addition to monitoring what is going on, it requires an understanding of what it takes to run the service in production.

I believe you actually mentioned this yourself in ensuring you understood the system components in https://github.com/CrunchyData/postgres-operator/issues/1786#issuecomment-674409440, i.e.

it works great despite my issues, was a fundamental lack of knowledge on our part, we are seasoned patroni users, but rookies a pgbackrest, sorry for the noises.. great software!!

With regards to the ongoing volume expansion, I would also recommend checking your pg_log directory to see if the files there are taking up too much space. If needed, you can adjust your logging settings, e.g. redirect the PostgreSQL logs to stderr (with the risk that if you don't have a logging aggregator in your Kubernetes environment, you may lose valuable information about the state of your system).