Commits for pg10+ change functional behaviour in a possibly unintended manner.

TritonDataCenter / pg_prefaulter

Faults pages into PostgreSQL shared_buffers or filesystem caches in advance of WAL apply

Apache License 2.0

56 stars 13 forks source link

Commits for pg10+ change functional behaviour in a possibly unintended manner. #50

Open bschofield opened 3 years ago

bschofield commented 3 years ago

In this commit, a translation layer is added to enable _pgprefaulter to work with postgres 10+.

However, the old SQL query

SELECT timeline_id, redo_location, pg_last_xlog_replay_location() FROM pg_control_checkpoint()

is translated as

SELECT timeline_id, redo_lsn, pg_last_wal_receive_lsn() FROM pg_control_checkpoint()

This seems to have the effect of causing the code to attempt to prefault files just ahead of the most-recently-received WAL. The old behaviour appears to have been to prefault just ahead of the most-recently-replayed WAL. I am unsure whether this change in functionality was intended, but the old behaviour does seem more logical to me.

To revert to the old behaviour, change pg_last_wal_receive_lsn() to pg_last_wal_replay_lsn().

bschofield commented 3 years ago

I have created a (somewhat butchered) fork with this change at https://github.com/bschofield/pg_prefaulter/.

bahamat commented 3 years ago

@chudley What's your thought on this?

chudley commented 3 years ago

This all sounds reasonable to me. Looking at 9.6 docs and 10.0 docs, @bschofield is right. Likely a typo in my work, so thanks for catching!

The testing I did for this is outlined in MANTA-4020. I don't know where we stand with making internal tickets public these days, but @bahamat feel free to mark as public if that'll help here. I found the prefaulter hard to test/verify overall. My tests show that we're prefaulting something, though given this ticket it's likely the wrong thing.

I'm going to struggle prioritising this work at the moment, but I'm happy to review a change and possibly walk through setting up a test environment. Likely the fix is a s/receive/replay as @bschofield said, with testing setup etc. taking most of the time.

bschofield commented 3 years ago

Thanks for taking a look, @chudley.

I think it's worth mentioning that in the steady state, there won't actually be much difference between prefaulting just-received WAL, and prefaulting about-to-be-replayed WAL. When the replica is caught up with the primary, those two categories should be pretty much identical. So, if your main use-case for this is to accelerate performance of in-sync replicas, then you may not actually be seeing any issues arising from the typo.

The difference does matter when catching up a replica which is well behind the primary. In that situation, the current code prefaults WAL files which are well ahead of the actual replay point, so the effectiveness of it is limited (and it may actually be a net negative). With the fix in place, I found that _pgprefaulter was very effective in speeding up WAL replay on postgres 13.

By the way, thank you very much for making this utility public. With this bit of minor tweaking, I successfully used it to catch up a primary which had gotten 500GB behind the master. That saved my weekend!