IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL
Apache License 2.0
292 stars 160 forks source link

Out of memory: Killed process 256968 (postgres) #1408

Open conraddit opened 1 year ago

conraddit commented 1 year ago

OS Your OS: Ubuntu 22

Versions The db-sync version (eg cardano-db-sync --version): PostgreSQL version: 15

Build/Install Method The method you use to build or install cardano-db-sync: build

Run method The method you used to run cardano-db-sync (eg Nix/Docker/systemd/none): systemd

Additional context We're running some queries which result in Postgres crashing. The attached file shows the entire crash log.

The server has over 40% free RAM. It's a 64GB server with 32GB allocated for Postgres via PGtune. We're running the latest version of DB Sync (13.1.0.2) and Cardano Node 1.35.5. This is a 16-core server. It has 1TB of NVMe storage.

Problem Report May 19 15:28:03 ubuntu systemd[1]: Finished PostgreSQL RDBMS. May 19 15:35:01 ubuntu CRON[256627]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) May 19 15:45:01 ubuntu CRON[256979]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) May 19 15:46:55 ubuntu kernel: [145715.504439] cardano-db-sy:w invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 May 19 15:46:55 ubuntu kernel: [145715.505078] CPU: 11 PID: 251385 Comm: cardano-db-sy:w Not tainted 5.15.0-72-generic #79-Ubuntu May 19 15:46:55 ubuntu kernel: [145715.505082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 May 19 15:46:55 ubuntu kernel: [145715.505092] Call Trace: May 19 15:46:55 ubuntu kernel: [145715.505095] May 19 15:46:55 ubuntu kernel: [145715.505108] show_stack+0x52/0x5c May 19 15:46:55 ubuntu kernel: [145715.505115] dump_stack_lvl+0x4a/0x63 May 19 15:46:55 ubuntu kernel: [145715.505119] dump_stack+0x10/0x16 May 19 15:46:55 ubuntu kernel: [145715.505121] dump_header+0x53/0x228 May 19 15:46:55 ubuntu kernel: [145715.505126] oom_kill_process.cold+0xb/0x10 May 19 15:46:55 ubuntu kernel: [145715.505129] out_of_memory+0x106/0x2e0 May 19 15:46:55 ubuntu kernel: [145715.505135] alloc_pages_slowpath.constprop.0+0x9b7/0xa80 May 19 15:46:55 ubuntu kernel: [145715.505139] alloc_pages+0x311/0x330 May 19 15:46:55 ubuntu kernel: [145715.505142] alloc_pages+0x9e/0x1e0 May 19 15:46:55 ubuntu kernel: [145715.505144] page_cache_alloc+0x7e/0x90 May 19 15:46:55 ubuntu kernel: [145715.505146] pagecache_get_page+0x152/0x590 May 19 15:46:55 ubuntu kernel: [145715.505149] ? page_cache_ra_unbounded+0x166/0x210 May 19 15:46:55 ubuntu kernel: [145715.505151] filemap_fault+0x488/0xab0 May 19 15:46:55 ubuntu kernel: [145715.505154] ? filemap_map_pages+0x309/0x400 May 19 15:46:55 ubuntu kernel: [145715.505156] do_fault+0x39/0x120 May 19 15:46:55 ubuntu kernel: [145715.505160] do_read_fault+0xeb/0x160 May 19 15:46:55 ubuntu kernel: [145715.505162] do_fault+0xa0/0x2e0 May 19 15:46:55 ubuntu kernel: [145715.505164] handle_pte_fault+0x1cd/0x240 May 19 15:46:55 ubuntu kernel: [145715.505167] __handle_mm_fault+0x405/0x6f0 May 19 15:46:55 ubuntu kernel: [145715.505170] handle_mm_fault+0xd8/0x2c0 May 19 15:46:55 ubuntu kernel: [145715.505172] do_user_addr_fault+0x1c9/0x670 May 19 15:46:55 ubuntu kernel: [145715.505177] exc_page_fault+0x77/0x170 May 19 15:46:55 ubuntu kernel: [145715.505180] asm_exc_page_fault+0x27/0x30 May 19 15:46:55 ubuntu kernel: [145715.505183] RIP: 0033:0x419f56d May 19 15:46:55 ubuntu kernel: [145715.505206] Code: Unable to access opcode bytes at RIP 0x419f543. May 19 15:46:55 ubuntu kernel: [145715.505207] RSP: 002b:00007ff6ca055858 EFLAGS: 00010217 May 19 15:46:55 ubuntu kernel: [145715.505209] RAX: 0000000000000003 RBX: 00000042f4b7ad9f RCX: 0000000000000980 May 19 15:46:55 ubuntu kernel: [145715.505210] RDX: 00000043016fb000 RSI: 00000043016fbfff RDI: 0000000005a24fe0 May 19 15:46:55 ubuntu kernel: [145715.505212] RBP: 000000438e37fcf0 R08: 0000000000000001 R09: 0000000000000000 May 19 15:46:55 ubuntu kernel: [145715.505213] R10: 0000000000000001 R11: 0000000000000001 R12: 00000043016fb978 May 19 15:46:55 ubuntu kernel: [145715.505215] R13: 00000000080ed618 R14: 0000000000000001 R15: 000000438e3780c0 May 19 15:46:55 ubuntu kernel: [145715.505217] May 19 15:46:55 ubuntu kernel: [145715.505220] Mem-Info: May 19 15:46:55 ubuntu kernel: [145715.505227] active_anon:1230896 inactive_anon:14975339 isolated_anon:0 May 19 15:46:55 ubuntu kernel: [145715.505227] active_file:245 inactive_file:119 isolated_file:0 May 19 15:46:55 ubuntu kernel: [145715.505227] unevictable:6910 dirty:0 writeback:0 May 19 15:46:55 ubuntu kernel: [145715.505227] slab_reclaimable:64584 slab_unreclaimable:27191 May 19 15:46:55 ubuntu kernel: [145715.505227] mapped:1256574 shmem:2144859 pagetables:34731 bounce:0 May 19 15:46:55 ubuntu kernel: [145715.505227] kernel_misc_reclaimable:0 May 19 15:46:55 ubuntu kernel: [145715.505227] free:81966 free_pcp:754 free_cma:0 May 19 15:46:55 ubuntu kernel: [145715.505234] Node 0 active_anon:4923584kB inactive_anon:59901356kB active_file:980kB inactive_file:476kB unevictable:27640kB isolated(anon):0kB isolated(file):0kB mapped:5026296kB dirty:0kB writeback:0kB shmem:8579436kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:6992kB pagetables:138924kB all_unreclaimable? yes May 19 15:46:55 ubuntu kernel: [145715.505238] Node 0 DMA free:11264kB min:12kB low:24kB high:36kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB May 19 15:46:55 ubuntu kernel: [145715.505250] lowmem_reserve[]: 0 2960 64225 64225 64225 May 19 15:46:55 ubuntu kernel: [145715.505253] Node 0 DMA32 free:248088kB min:3112kB low:6140kB high:9168kB reserved_highatomic:0KB active_anon:67656kB inactive_anon:2723548kB active_file:312kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129196kB managed:3063660kB mlocked:0kB bounce:0kB free_pcp:1168kB local_pcp:48kB free_cma:0kB May 19 15:46:55 ubuntu kernel: [145715.505259] lowmem_reserve[]: 0 0 61264 61264 61264 May 19 15:46:55 ubuntu kernel: [145715.505262] Node 0 Normal free:68512kB min:64452kB low:127184kB high:189916kB reserved_highatomic:4096KB active_anon:4855928kB inactive_anon:57177808kB active_file:204kB inactive_file:60kB unevictable:27640kB writepending:0kB present:63963136kB managed:62743536kB mlocked:27640kB bounce:0kB free_pcp:1848kB local_pcp:212kB free_cma:0kB May 19 15:46:55 ubuntu kernel: [145715.505267] lowmem_reserve[]: 0 0 0 0 0 May 19 15:46:55 ubuntu kernel: [145715.505269] Node 0 DMA: 04kB 08kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 11024kB (U) 12048kB (M) 24096kB (M) = 11264kB May 19 15:46:55 ubuntu kernel: [145715.505278] Node 0 DMA32: 424kB (UME) 228kB (UE) 19616kB (UE) 16432kB (UE) 17564kB (UME) 169128kB (UE) 98256kB (UME) 37512kB (UME) 91024kB (UE) 32048kB (UE) 364096kB (UM) = 248408kB May 19 15:46:55 ubuntu kernel: [145715.505290] Node 0 Normal: 14174kB (UME) 7568kB (UME) 80616kB (UME) 83732kB (UME) 23164kB (UME) 24128kB (UME) 0256kB 0512kB 01024kB 02048kB 0*4096kB = 69252kB May 19 15:46:55 ubuntu kernel: [145715.505301] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB May 19 15:46:55 ubuntu kernel: [145715.505303] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB May 19 15:46:55 ubuntu kernel: [145715.505304] 2147680 total pagecache pages May 19 15:46:55 ubuntu kernel: [145715.505307] 0 pages in swap cache May 19 15:46:55 ubuntu kernel: [145715.505309] Swap cache stats: add 0, delete 0, find 0/0 May 19 15:46:55 ubuntu kernel: [145715.505310] Free swap = 0kB May 19 15:46:55 ubuntu kernel: [145715.505311] Total swap = 0kB May 19 15:46:55 ubuntu kernel: [145715.505311] 16777081 pages RAM May 19 15:46:55 ubuntu kernel: [145715.505312] 0 pages HighMem/MovableOnly May 19 15:46:55 ubuntu kernel: [145715.505313] 321442 pages reserved May 19 15:46:55 ubuntu kernel: [145715.505313] 0 pages hwpoisoned May 19 15:46:55 ubuntu kernel: [145715.505314] Tasks state (memory values in pages): May 19 15:46:55 ubuntu kernel: [145715.505315] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name May 19 15:46:55 ubuntu kernel: [145715.505330] [ 496] 0 496 18115 939 139264 0 -250 systemd-journal May 19 15:46:55 ubuntu kernel: [145715.505335] [ 534] 0 534 72327 6774 110592 0 -1000 multipathd May 19 15:46:55 ubuntu kernel: [145715.505337] [ 538] 0 538 5823 1013 65536 0 -1000 systemd-udevd May 19 15:46:55 ubuntu kernel: [145715.505340] [ 669] 103 669 22339 752 77824 0 0 systemd-timesyn May 19 15:46:55 ubuntu kernel: [145715.505342] [ 730] 100 730 4030 737 69632 0 0 systemd-network May 19 15:46:55 ubuntu kernel: [145715.505344] [ 732] 101 732 6381 1571 86016 0 0 systemd-resolve May 19 15:46:55 ubuntu kernel: [145715.505347] [ 743] 0 743 1821 607 57344 0 0 cron May 19 15:46:55 ubuntu kernel: [145715.505350] [ 745] 102 745 2213 749 53248 0 -900 dbus-daemon May 19 15:46:55 ubuntu kernel: [145715.505353] [ 751] 0 751 20696 425 61440 0 0 irqbalance May 19 15:46:55 ubuntu kernel: [145715.505356] [ 752] 0 752 8249 3031 98304 0 0 networkd-dispat May 19 15:46:55 ubuntu kernel: [145715.505358] [ 753] 0 753 58622 961 81920 0 0 polkitd May 19 15:46:55 ubuntu kernel: [145715.505360] [ 754] 104 754 55600 978 73728 0 0 rsyslogd May 19 15:46:55 ubuntu kernel: [145715.505363] [ 758] 0 758 570112 2736 368640 0 -900 snapd May 19 15:46:55 ubuntu kernel: [145715.505366] [ 763] 0 763 4008 980 65536 0 0 systemd-logind May 19 15:46:55 ubuntu kernel: [145715.505368] [ 765] 0 765 98112 656 126976 0 0 udisksd May 19 15:46:55 ubuntu kernel: [145715.505369] [ 790] 0 790 3856 982 69632 0 -1000 sshd May 19 15:46:55 ubuntu kernel: [145715.505372] [ 794] 0 794 79488 491 118784 0 0 ModemManager May 19 15:46:55 ubuntu kernel: [145715.505374] [ 799] 0 799 1554 192 53248 0 0 agetty May 19 15:46:55 ubuntu kernel: [145715.505376] [ 806] 0 806 1543 211 45056 0 0 agetty May 19 15:46:55 ubuntu kernel: [145715.505378] [ 811] 0 811 27518 2854 114688 0 0 unattended-upgr May 19 15:46:55 ubuntu kernel: [145715.505380] [ 958] 0 958 4269 757 73728 0 0 systemd May 19 15:46:55 ubuntu kernel: [145715.505382] [ 959] 0 959 42318 947 90112 0 0 (sd-pam) May 19 15:46:55 ubuntu kernel: [145715.505385] [ 1061] 0 1061 3082 1595 57344 0 0 tmux: server May 19 15:46:55 ubuntu kernel: [145715.505387] [ 1314] 0 1314 74011 792 159744 0 0 packagekitd May 19 15:46:55 ubuntu kernel: [145715.505390] [ 170183] 0 170183 2301 898 61440 0 0 bash May 19 15:46:55 ubuntu kernel: [145715.505392] [ 179975] 0 179975 2306 982 53248 0 0 bash May 19 15:46:55 ubuntu kernel: [145715.505397] [ 242484] 0 242484 4297 1319 73728 0 0 sshd May 19 15:46:55 ubuntu kernel: [145715.505400] [ 242565] 0 242565 2375 1062 57344 0 0 bash May 19 15:46:55 ubuntu kernel: [145715.505402] [ 242701] 0 242701 2301 682 53248 0 0 bash May 19 15:46:55 ubuntu kernel: [145715.505405] [ 242702] 0 242702 268813908 3150997 25657344 0 0 cardano-node May 19 15:46:55 ubuntu kernel: [145715.505407] [ 251379] 0 251379 268745974 4191978 33992704 0 0 cardano-db-sync May 19 15:46:55 ubuntu kernel: [145715.505411] [ 255959] 0 255959 4296 1334 73728 0 0 sshd May 19 15:46:55 ubuntu kernel: [145715.505414] [ 256039] 0 256039 2274 962 57344 0 0 bash May 19 15:46:55 ubuntu kernel: [145715.505416] [ 256130] 0 256130 1555 169 45056 0 0 tail May 19 15:46:55 ubuntu kernel: [145715.505418] [ 256361] 113 256361 2169806 48203 528384 0 -900 postgres May 19 15:46:55 ubuntu kernel: [145715.505421] [ 256362] 113 256362 2169895 115453 2338816 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505423] [ 256363] 113 256363 2169848 39956 1236992 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505425] [ 256365] 113 256365 2169806 5060 167936 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505428] [ 256366] 113 256366 2170232 1046 180224 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505430] [ 256367] 113 256367 2170202 1255 151552 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505432] [ 256380] 0 256380 2446 1051 61440 0 0 htop May 19 15:46:55 ubuntu kernel: [145715.505435] [ 256386] 113 256386 2199336 1032752 13484032 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505437] [ 256390] 113 256390 2501564 941034 10391552 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505439] [ 256968] 113 256968 8576808 6402717 51597312 0 0 postgres May 19 15:46:55 ubuntu kernel: [145715.505441] [ 257033] 0 257033 3326 69 53248 0 0 sshd May 19 15:46:55 ubuntu kernel: [145715.505443] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/system-postgresql.slice/postgresql@15-main.service,task=postgres,pid=256968,uid=113 May 19 15:46:55 ubuntu kernel: [145715.505490] Out of memory: Killed process 256968 (postgres) total-vm:34307232kB, anon-rss:25542468kB, file-rss:1184kB, shmem-rss:67216kB, UID:113 pgtables:50388kB oom_score_adj:0 May 19 15:46:55 ubuntu systemd[1]: postgresql@15-main.service: A process of this unit has been killed by the OOM killer. May 19 15:46:57 ubuntu kernel: [145717.543999] oom_reaper: reaped process 256968 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:67216kB May 19 15:46:58 ubuntu systemd[1]: postgresql@15-main.service: Failed with result 'oom-kill'. May 19 15:46:58 ubuntu systemd[1]: postgresql@15-main.service: Consumed 11min 8.015s CPU time.

Fell-x27 commented 1 year ago

Got the same problem.

kderme commented 1 year ago

We're running some queries which result in Postgres crashing.

I would need more information to help here. What kind of queries and why they consume so much memory.

conraddit commented 1 year ago

We're running some queries which result in Postgres crashing.

I would need more information to help here. What kind of queries and why they consume so much memory.

Thank you @kderme!

We are not actually running out of memory. The server never got close to the total available RAM.

rdlrt commented 1 year ago

Well that's not what the line below suggests (rss + shared w/pgtables), the actual rule for oom to kick in might be based on different reasons (availability/absence of swap to overcommit, sharing of swap to memory, etc - upperbound being 32G)

May 19 15:46:55 ubuntu kernel: [145715.505490] Out of memory: Killed process 256968 (postgres) total-vm:34307232kB, anon-rss:25542468kB, file-rss:1184kB, shmem-rss:67216kB, UID:113 pgtables:50388kB oom_score_adj:0

But more importantly, since you mention you receive OOM (which is actioned by kernel upon reaching usage thresholds) when running some queries - maybe start with the queries that are being run?