citusdata / pg_cron

Run periodic jobs in PostgreSQL
PostgreSQL License
2.8k stars 191 forks source link

Segmentation fault error in version 1.6 #340

Open slothyrulez opened 1 month ago

slothyrulez commented 1 month ago

Hi!

We are running postgresql 14.10 on AWS RDS, we have installed recently the pg_cron 1.6 extension, following this guide: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL_pg_cron.html

We only have one job configured, to launch every day out of peak hours, a VACUUM VERBOSE ANALYZE on the whole database.

I have found we have 2 crashes related to pg_cron:


    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[703]:LOG: background worker "pg_cron" (PID 24839) was terminated by signal 11: Segmentation fault
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[703]:DETAIL: Failed process was running: VACUUM VERBOSE ANALYZE
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[703]:LOG: terminating any other active server processes /etc/rds/dbbin/pgscripts/rds_wal_archive: line 29: 29367 Quit (core dumped) usleep 50000
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[2788]:FATAL: archive command failed with exit code 131
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[2788]:DETAIL: The failed archive command was: /etc/rds/dbbin/pgscripts/rds_wal_archive pg_wal/0000000100004C6600000027 skipTimeCheckForWalSegmentAndReadyFile
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[703]:LOG: all server processes terminated; reinitializing
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC::@:[29368]:LOG: database system was interrupted; last known up at 2024-07-24 07:16:46 UTC
    2024-07-24T07:18:48.000Z    2024-07-24 07:18:48 UTC:10.2.0.249(58010):rdsrepladmin@[unknown]:[29369]:FATAL: the database system is in recovery mode

and

2024-07-21 07:18:59 UTC::@:[723]:LOG: cron job 1: INFO: vacuuming "pg_toast.pg_toast_5855542"
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[703]:LOG: background worker "pg_cron" (PID 30526) was terminated by signal 11: Segmentation fault
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[703]:DETAIL: Failed process was running: VACUUM VERBOSE ANALYZE
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[703]:LOG: terminating any other active server processes
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[703]:LOG: all server processes terminated; reinitializing
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: database system was interrupted; last known up at 2024-07-21 07:18:13 UTC
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC:10.2.0.249(39436):rdsrepladmin@[unknown]:[2583]:FATAL: the database system is in recovery mode
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: database system was not properly shut down; automatic recovery in progress
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: redo starts at 4C55/52048718
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: invalid record length at 4C55/5956D5B0: wanted 24, got 0
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: redo done at 4C55/5956D588 system usage: CPU: user: 0.19 s, system: 0.06 s, elapsed: 0.27 s
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC::@:[2582]:LOG: checkpoint starting: end-of-recovery immediate
    2024-07-21T07:18:59.000Z    2024-07-21 07:18:59 UTC:172.31.21.40(48308)::[2584]:FATAL: the database system is in recovery mode

Do you have any hints ?

marcoslot commented 1 month ago

That's an interesting one. It's possible that the VACUUM VERBOSE ANALYZE crashes itself (e.g. due to another extension), though I wonder if the amount of notices produced by VERBOSE is causing an issue with the background worker mode used by RDS. Can you try to run without VERBOSE to see whether the issue goes away? (which is a strong indication that pg_cron is indeed the problem)

slothyrulez commented 1 month ago

Ey!, thanks for your time.

Yes, I think is possible that the cause is not pg_cron itself, but VACUUM. Ok, I'll reconfigure the job to run without the VERBOSE, and let you know.

Thanks again.

slothyrulez commented 1 month ago

@marcoslot update here: we have been running VACUUM ANALYZE 5 days without any issue for the moment. imagen

slothyrulez commented 1 month ago

Checked again and running as expected, I don't know if i can help you somehow.

rinshadka commented 4 weeks ago

Hi, Me too getting this issue in aurora postgreSQL RDS database. while executing a materialized view via pg_cron.

2024-08-21 06:47:00 UTC::@:[554]:LOG: background worker "pg_cron" (PID 22225) was terminated by signal 11: Segmentation fault
2024-08-21 06:47:00 UTC::@:[554]:DETAIL: Failed process was running: SELECT ingress.refresh_interactions_materialized_view();
2024-08-21 06:47:00 UTC::@:[554]:LOG: terminating any other active server processes
2024-08-21 06:47:00 UTC::@:[554]:FATAL: Aurora Runtime process unexpectedly exited
2024-08-21 06:47:00 UTC::@:[554]:LOG: database system is shut down

When I manually executes this, it will take around 4 minutes to finish.

Thanks.

marcoslot commented 4 weeks ago

@bdrouvotAWS do you happen to have any way of getting a stack trace?

bdrouvotAWS commented 3 weeks ago

@rinshadka @slothyrulez could you please ping me at bdrouvot@amazon.com?