cybertec-postgresql / pg_squeeze

A PostgreSQL extension for automatic bloat cleanup
Other
470 stars 31 forks source link

Segmentation fault #70

Closed ramkly closed 1 month ago

ramkly commented 7 months ago

Hi I'm using squeeze REL1_6, and I have an issue on some server All servers running postgresql v14.11 All servers have the same pg_squeeze version pg_squeeze configuration in postgresql.conf is the same on all servers. and the same table is added to squeeze.tables table to be squeezed automatically. I have two issues now:

  1. on some server squeeze never start automatically (I checked pg_stat_statement and squeeze worker does exist, see screenshot) image
  2. if I run squeeze.squeeze_table manually to squeeze the table, it causes Segmentation fault (before I run squeeze.squeeze_table, I stop the worker by running squeeze.stop_worker()) It's weird because I get Segmentation fault only for this table. when I run squeeze.squeeze_table for other tables it works as expected. I tried to drop and recreate the table, but still same issue. This table is used frequently in the database

I attach the postgresql logs for Segmentation fault.

"logical decoding found initial starting point at 4FCC/E827FD48","Waiting for transactions (approximately 10) older than 725779050 to end.",,,,,,,,"","squeeze worker"
"logical decoding found initial consistent point at 4FCD/16D3C3E0","Waiting for transactions (approximately 6) older than 725781876 to end.",,,,,,,,"","squeeze worker"
"logical decoding found consistent point at 4FCD/244960A0","There are no old transactions anymore.",,,,,,,,"","squeeze worker"
"starting logical decoding for slot ""pg_squeeze_slot_16401_71778""","Streaming transactions committing after 4FCD/244960E0, reading WAL from 4FCC/E8175020.",,,,,,,,"","squeeze worker"
"invalid memory alloc request size 14888425372",,,,,,,,,"","squeeze worker"
"background worker ""squeeze worker"" (PID 71778) was terminated by signal 11: Segmentation fault","Failed process was running: INSERT INTO squeeze.errors(tabschema, tabname, sql_state, err_msg, err_detail) VALUES ('fleet', 'terminal_status', 'XX000', 'invalid memory alloc request size 14888425372', '')"
kovmir commented 7 months ago

Could you please try to come up with a way to reproduce this bug starting with a clean cluster? If you cannot find the reason, then please attach a stack trace.

ahouska commented 7 months ago

Right, the stack trace would be useful.

What I find weird is that in PG 14, the message "starting logical decoding for slot ""pg_squeeze_slot_16401_71778""", is printed out by CreateDecodingContext(), but pg_squeeze v16 does not call this function. Are you sure you are using pg_squeeze REL1_6?

ramkly commented 7 months ago

Hi. It happened again, and I attempted to collect GDB logs. Please find the attached file. Regarding the pg_squeeze version, I have to confirm that yes, I installed REL1_6 image

debuglog1.txt

ahouska commented 7 months ago

Unfortunately the version number has not been updated in the master branch, so the pg_extension catalog shows version version 1.6 even for the master branch. Please check which branch you have checked out from the repository. (I think it's master.)

Regarding the log, it does not mention the "segmentation fault" (SIGSEGV) error.

Do you happen to find the core file (e.g. postgres.core) in your data directory? If not, please tell me which operating system you're using. If you do see it, please try to get the stack trace from the core file using gdb according to https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Debugging_the_core_dump_-_example

ramkly commented 7 months ago

Unfortunately, core dump was not active on the server. Regarding the version, I selected REL1_6 and then downloaded it. I'll try to enable it and if it happen will share the core dump file

ahouska commented 7 months ago

Please do not share the core dump - it's huge and might contain some data of your database (possibly confidential). I'm only interested in the backttrace. I can assist in getting it from the dump, if needed.

ahouska commented 6 months ago

I'm still looking at debuglog1.txt that you provided earlier. Some backtraces in there look quite weird.

Have you built the binary from source? And if so, did you always run make cleanbefore building a different branch? I wonder if object files of different branches got mixed up somehow ...

ramkly commented 6 months ago

Yes, I built it from source, but I didn't run make clean before building a different branch. I just delete pg_squeeze.so from PostgreSQL lib directory, then built a different branch. I installed "master" branch, then I got a segmentation fault, then removed "squeeze.so" and compiled "REL1_6", but again I faced a segmentation fault. after I downgrade to REL1_5 it starts working. with "master" and "REL1_6", a segmentation fault is not the only issue (it's the biggest one as it sends the database to recovery mode), sometimes I also faced with the following errors 1- "initial slot snapshot too large" (I received this error on almost all my servers) 2- "invalid memory alloc request size xxxxxxxx" (for example "invalid memory alloc request size 17209330808", while the bloated table size is much less than this number, I don't know why squeeze needs this amount of memory to squeeze a tiny table) 3- "Unexpected number of TOAST indexes" 4- "all replication slots are in use" (sometimes squeeze don't delete the created replication slot)

ahouska commented 6 months ago

It seems like too many problems unrelated to one another. I still suspect that the binary (pg_squeeze.so) is broken. To rule this out, can you please try to install REL1_6 from the community repository (https://www.postgresql.org/download/) ?

Also, if you still have the library that you built from source, I'd be interested in the output of nm pg_squeeze.so

Thanks

ramkly commented 6 months ago

nm.log Please find attached the log file, output of nm command

ahouska commented 6 months ago

Thanks. I'm not seeing an obvious problem there. No idea what else I can do without the core dump.

kovmir commented 2 months ago

https://github.com/cybertec-postgresql/pg_squeeze/issues/71#issuecomment-2331352960

kovmir commented 1 month ago

@ramkly, re-open if you are still interested.