has the disk I/O requirement increased in v22.11.1?

x3nb63 commented 1 year ago

Issue and Steps to Reproduce

Since upgrading to v22.11.1 (from 0.11.1) I observe my node getting gradually slower over about 3 weeks until it gets killed (+restarted) eventually on failing its liveness probe on timeout.

After that it is fast again and the degradation repeats.

Primary bottleneck of my machine is disks: its got spinning disks setup as a "tripple mirror" using btrfs ("raid1c3"). So they are very low in random I/O performance compared to even cheap SSDs.

However, this was alright until v22.11.1 since the machine has plenty of memory.

With top and iostat/iotop I observe a lot of I/O waiting as these 3 weeks pass. I can trigger even more I/Owait when using the summary plugin to view my channels or - worse - when logging into the RideTheLighting UI (it appears to query way more things during the login procedure).

The processes stalling on I/O are mainly topology and bitcoind (which is a full node on the same machine). topology is usually invisible in the I/O wait process ranking but becomes visible as soon I use any of the above. Then also bitcoind becomes more active ...

To me it looks as if topology somehow starts inspecting things on the blockchain in addition to what it does in the lightning database (which is sqlite, btw., about 150 MB in my case). These two together then push my disks over their edge.

Does that ring a bell regarding v22? Do I need SSDs now?

I have not seen this with the 0.11.1 I ran before, on that very machine. (note: I did not try v22.11 without ".1" nor anything else in-between)

grubles commented 1 year ago

In my experience, over time lightningd.sqlite3 will become very fragmented and will need a defrag periodically when using a CoW filesystem like btrfs and spinning rust. With SSDs it's imperceptible, but on HDD it will really affect performance.

x3nb63 commented 1 year ago

will try vacuuming later tonight, thanks!

Has anyone tried performance tuning lightningd.sqlite3 like described in this blog? I read at other places that particulary synchronous=off (no fsync) gives lots of queries/sec.

I don't understand enough to say if synchronous=off together with WAL mode is "safe enough" on a corruption-save-by-checksum filesystem such as btrfs?

(I also do run with lightningd --wallet=${main}:${backup} with $backup on a different btrfs subvolume - on the same disks, unfortunately)

grubles commented 1 year ago

No do not use WAL. Your node will crash randomly. (I've tried WAL if you can't tell).

cdecker commented 1 year ago

topology is a plugin that does not have direct access to lightningd.sqlite3, instead it relies on gossip_store to read the gossip it uses for listnodes and listchannels. Restarting just that plugin might be an option (lightning-cli plugin stop topology followed by lightning-cli plugin start topology), if that does not help it is likely that there is something suboptimal in the way we handle the gossip messages (a flamegraph would be useful to see where we're spending time).

Restarts can also "fix" this because it causes gossipd to compact the gossip_store thus reducing the number of messages in it, potentially reducing the computations we perform.

endothermicdev commented 1 year ago

If it is gossip related, the gossip_store accepts ~24% more gossip starting in 0.12.0, but compacting the gossip_store (restarting the node) should clear out most of the difference.

x3nb63 commented 1 year ago

vacuum shrinked lightningd.sqlite3 from 253 MB to 211 MB, for main and backup db file.

the gossip_store was 150 MB before restart and about an hour after restart is at 59 MB.

more I can't say yet, since its fast since restarting. Neither use of summary plugin nor RTL logins trigger IOwait, topology does not get visible in the process list.

I collected a flamegraph an hour after start to compare with later. Hope ~10min of sampling is sufficient?

x3nb63 commented 1 year ago

after 4 days its getting noticeably slower, lightningd_gossipd appears in iotop listing every some minutes. As soon as I run lightning-cli summary the topology process joins the list

gossip_store file is at 150 MB again.

made two flamegraphs: first is a sampling over many minutes (I forgot how many); it contains at least one summary plugin run

lightningd after-4d long-sampling

second one covers just one call to the summary plugin, which took close to 3 min to print its output.

lightningd after-4d show-summary

nothing got killed so far.

looking at top the IOwait peaks at 37 and drops to 0.x after summary is done

topology then disappears from iotop while lightningd_gossipd remains listed with some 20 K/s read and 1.6 K/s write over 10sec (which fluctuates quite a lot)

x3nb63 commented 1 year ago

it crashed again (or actually was oom-killed this time) and I failed to make another flamegraph before that ...

last thing I observed was gossip_store file at 179MB a day or so before. After multiple restart attempts (because of not freed TOR listening port) it got cleaned down to 59MB again and also a gossip_store.corrupt file of that size appeared.

Because of the many restarts I lost the output of the initial instance, but then found in the journal that the oom-killer was involved. This I did never observe before and the memory limit I did not change for at least a year. Here is the journal:

Feb 04 13:19:54 hostname kernel: lightning_gossi invoked oom-killer: gfp_mask=0x8c40(GFP_NOFS|__GFP_NOFAIL), order=0, oom_score_adj=985
Feb 04 13:19:54 hostname kernel: CPU: 1 PID: 31742 Comm: lightning_gossi Not tainted 6.1.7-arch1-1 #1 a2d6f1dcaa775aaae1f25aaf758ae968e3493665
Feb 04 13:19:54 hostname kernel: Hardware name:  /DH67BL, BIOS BLH6710H.86A.0163.2018.1023.1559 10/23/2018
Feb 04 13:19:54 hostname kernel: Call Trace:
Feb 04 13:19:54 hostname kernel:  <TASK>
Feb 04 13:19:54 hostname kernel:  dump_stack_lvl+0x48/0x60
Feb 04 13:19:54 hostname kernel:  dump_header+0x4a/0x211
Feb 04 13:19:54 hostname kernel:  oom_kill_process.cold+0xb/0x10
Feb 04 13:19:54 hostname kernel:  out_of_memory+0x1f1/0x520
Feb 04 13:19:54 hostname kernel:  mem_cgroup_out_of_memory+0x134/0x150
Feb 04 13:19:54 hostname kernel:  try_charge_memcg+0x693/0x780
Feb 04 13:19:54 hostname kernel:  ? lru_gen_look_around+0x3d7/0x820
Feb 04 13:19:54 hostname kernel:  charge_memcg+0x39/0xf0
Feb 04 13:19:54 hostname kernel:  __mem_cgroup_charge+0x2c/0x80
Feb 04 13:19:54 hostname kernel:  __filemap_add_folio+0x217/0x430
Feb 04 13:19:54 hostname kernel:  ? scan_shadow_nodes+0x30/0x30
Feb 04 13:19:54 hostname kernel:  filemap_add_folio+0x3c/0xa0
Feb 04 13:19:54 hostname kernel:  __filemap_get_folio+0x193/0x380
Feb 04 13:19:54 hostname kernel:  pagecache_get_page+0x15/0x70
Feb 04 13:19:54 hostname kernel:  alloc_extent_buffer+0x25c/0x780 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  read_tree_block+0x25/0xa0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  read_block_for_search+0x220/0x330 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  btrfs_search_slot+0x32b/0xc90 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  btrfs_lookup_csum+0x73/0x160 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  btrfs_lookup_bio_sums+0x230/0x560 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  btrfs_submit_data_read_bio+0x48/0x80 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  submit_one_bio+0x90/0xc0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  submit_extent_page+0x15d/0x540 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  ? __set_extent_bit+0x178/0x6b0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  btrfs_do_readpage+0x2d9/0x740 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  extent_readahead+0x31e/0x3e0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  ? btrfs_repair_one_sector+0x3a0/0x3a0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel:  read_pages+0x88/0x260
Feb 04 13:19:54 hostname kernel:  page_cache_ra_unbounded+0x105/0x180
Feb 04 13:19:54 hostname kernel:  filemap_fault+0x552/0x920
Feb 04 13:19:54 hostname kernel:  ? filemap_map_pages+0x154/0x6f0
Feb 04 13:19:54 hostname kernel:  __do_fault+0x33/0x110
Feb 04 13:19:54 hostname kernel:  do_fault+0x1b9/0x410
Feb 04 13:19:54 hostname kernel:  __handle_mm_fault+0x660/0xfa0
Feb 04 13:19:54 hostname kernel:  handle_mm_fault+0xdf/0x2d0
Feb 04 13:19:54 hostname kernel:  do_user_addr_fault+0x1be/0x6a0
Feb 04 13:19:54 hostname kernel:  exc_page_fault+0x74/0x170
Feb 04 13:19:54 hostname kernel:  asm_exc_page_fault+0x26/0x30
Feb 04 13:19:54 hostname kernel: RIP: 0033:0x55835dbbb4d2
Feb 04 13:19:54 hostname kernel: Code: Unable to access opcode bytes at 0x55835dbbb4a8.
Feb 04 13:19:54 hostname kernel: RSP: 002b:00007ffd657c9120 EFLAGS: 00010212
Feb 04 13:19:54 hostname kernel: RAX: 0000ffffffffffff RBX: 003ffffffffffffc RCX: 00047754d739c3e3
Feb 04 13:19:54 hostname kernel: RDX: 00000000000da593 RSI: 000205e6fb443ace RDI: 0001000003d10000
Feb 04 13:19:54 hostname kernel: RBP: 00007ffd657c9a20 R08: 000aa5d081dc209f R09: 0000000000000000
Feb 04 13:19:54 hostname kernel: R10: d1f8c6ee278e24f0 R11: 0000e4c8f019a40b R12: 0003fffffffffffc
Feb 04 13:19:54 hostname kernel: R13: 003ffffbfffff0bc R14: 00007ffd657c9a20 R15: 00007ffd657c9270
Feb 04 13:19:54 hostname kernel:  </TASK>
Feb 04 13:19:54 hostname kernel: memory: usage 512028kB, limit 512000kB, failcnt 123853
Feb 04 13:19:54 hostname kernel: swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 04 13:19:54 hostname kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope:
Feb 04 13:19:54 hostname kernel: anon 509526016
                                file 2347008
                                kernel 7282688
                                kernel_stack 524288
                                pagetables 2293760
                                sec_pagetables 0
                                percpu 0
                                sock 5160960
                                vmalloc 0
                                shmem 0
                                zswap 0
                                zswapped 0
                                file_mapped 0
                                file_dirty 8192
                                file_writeback 0
                                swapcached 0
                                anon_thp 190840832
                                file_thp 0
                                shmem_thp 0
                                inactive_anon 317571072
                                active_anon 191950848
                                inactive_file 1544192
                                active_file 794624
                                unevictable 4096
                                slab_reclaimable 2505512
                                slab_unreclaimable 1176648
                                slab 3682160
                                workingset_refault_anon 0
                                workingset_refault_file 87202038
                                workingset_activate_anon 0
                                workingset_activate_file 539718
                                workingset_restore_anon 0
                                workingset_restore_file 10302541
                                workingset_nodereclaim 0
                                pgscan 115540186
                                pgsteal 114113256
                                pgscan_kswapd 0
                                pgscan_direct 115540186
                                pgsteal_kswapd 0
                                pgsteal_direct 114113256
                                pgfault 52093671
                                pgmajfault 44641
                                pgrefill 51731843
                                pgactivate 444399
                                pgdeactivate 0
                                pglazyfree 1
                                pglazyfreed 0
                                zswpin 0
                                zswpout 0
                                thp_fault_alloc 3044
                                thp_collapse_alloc 1591
Feb 04 13:19:54 hostname kernel: Tasks state (memory values in pages):
Feb 04 13:19:54 hostname kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Feb 04 13:19:54 hostname kernel: [  27787] 10001 27787      595       18    40960        0           985 tini
Feb 04 13:19:54 hostname kernel: [  27801] 10001 27801    65760    63545   565248        0           985 lightningd
Feb 04 13:19:54 hostname kernel: [  28071] 10001 28071     1106       33    49152        0           985 autoclean
Feb 04 13:19:54 hostname kernel: [  28072] 10001 28072     1106       21    49152        0           985 chanbackup
Feb 04 13:19:54 hostname kernel: [  28073] 10001 28073     6700     5310    90112        0           985 bcli
Feb 04 13:19:54 hostname kernel: [  28074] 10001 28074     1112       30    49152        0           985 commando
Feb 04 13:19:54 hostname kernel: [  28076] 10001 28076     1111       36    49152        0           985 funder
Feb 04 13:19:54 hostname kernel: [  28077] 10001 28077    65197     1738   118784        0           985 topology
Feb 04 13:19:54 hostname kernel: [  28078] 10001 28078     1450       33    49152        0           985 keysend
Feb 04 13:19:54 hostname kernel: [  28079] 10001 28079     1133       34    45056        0           985 offers
Feb 04 13:19:54 hostname kernel: [  28080] 10001 28080     1463       24    45056        0           985 pay
Feb 04 13:19:54 hostname kernel: [  28081] 10001 28081     1106       22    45056        0           985 txprepare
Feb 04 13:19:54 hostname kernel: [  28082] 10001 28082     1115       22    49152        0           985 spenderp
Feb 04 13:19:54 hostname kernel: [  28083] 10001 28083     1866       54    57344        0           985 bookkeeper
Feb 04 13:19:54 hostname kernel: [  28084] 10001 28084    48531     5589   143360        0           985 python3
Feb 04 13:19:54 hostname kernel: [  28085] 10001 28085     9652     3805   110592        0           985 python3
Feb 04 13:19:54 hostname kernel: [  28086] 10001 28086     9984     4084   106496        0           985 python3
Feb 04 13:19:54 hostname kernel: [  28469] 10001 28469     1169       90    49152        0           985 lightning_hsmd
Feb 04 13:19:54 hostname kernel: [  28551] 10001 28551    27706    26577   266240        0           985 lightning_conne
Feb 04 13:19:54 hostname kernel: [  31742] 10001 31742    13972    12836   155648        0           985 lightning_gossi
Feb 04 13:19:54 hostname kernel: [3115526] 10001 3115526     1138       37    45056        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3115581] 10001 3115581     1138       35    49152        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3123221] 10001 3123221     1138       36    40960        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3124854] 10001 3124854     1138       22    45056        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3127401] 10001 3127401     1138       37    49152        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3129090] 10001 3129090     1138       36    45056        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3130661] 10001 3130661     1138       35    45056        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3132298] 10001 3132298     1138       34    45056        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3133827] 10001 3133827     1138       33    49152        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3135390] 10001 3135390     1138       37    53248        0           985 lightning_chann
Feb 04 13:19:54 hostname kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,task=lightningd,pid=27801,uid=10001
Feb 04 13:19:54 hostname kernel: Memory cgroup out of memory: Killed process 27801 (lightningd) total-vm:263040kB, anon-rss:254180kB, file-rss:0kB, shmem-rss:0kB, UID:10001 pgtables:552kB oom_score_adj:985

note: it now runs on btrfs, about two months ago, still using 0.10.x, the node was running on ZFS for over a year - don't know if that matters

x3nb63 commented 1 year ago

gossip_store is at 200MB now, below is another flamegraph while I issued lightning-cli summary. This time I Ctrl-C'ed the summary plugin without seeing any output after 28 minutes.

Looking with top I see I/O wait of 25..50 and a load of 4..5, which isn't going away and its not due to CPU.

With iotop -d 10 -P (and manual sort by DISK READ) it looks as if topology never gets done. Also lightning_gossipd appears almost constantly now.

Total DISK READ :       6.82 M/s | Total DISK WRITE :     562.44 K/s
Actual DISK READ:       6.84 M/s | Actual DISK WRITE:    1964.59 K/s
    PID  PRIO  USER    DISK READ>  DISK WRITE  SWAPIN      IO    COMMAND
3152394 be/4 10001       3.35 M/s    0.00 B/s  ?unavailable?  topology
   6986 be/4 arch     1930.53 K/s    0.00 B/s  ?unavailable?  bitcoind -conf=/~coin/bitcoin.conf
3152637 be/4 10001    1375.61 K/s    0.00 B/s  ?unavailable?  lightning_connectd
3154078 be/4 10001     251.91 K/s    0.00 B/s  ?unavailable?  lightning_gossipd
3152387 be/4 10001       4.75 K/s    0.00 B/s  ?unavailable?  lightningd --net~n/lightningd.conf
      1 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  systemd --system --deserialize 99
      2 be/4 root        0.00 B/s    0.00 B/s  ?unavailable?  [kthreadd]
...

Flamegraph: lightningd gossip_store200M

rustyrussell commented 1 year ago

I wonder if we've hit some tipping point? In particular, perhaps lightning network growth has outpaced your RAM growth, and keeping the gossip_map in memory (particularly since it no longer trims spam messages as of 0.12) has meant you're now in swapping hell.

It's hard to know what to do about this. We can add an option to keep the gossip_store smaller, but it's at best going to make 30% difference, and the network keeps growing. We could elect to only keep partial gossip, maybe?

x3nb63 commented 1 year ago

no swap! am running the machine without swap.

just doubled the memory limit to 1 GB and restarted... lets see.

x3nb63 commented 1 year ago

it ran into more OOMkills, so i increased the limit multiple times to now 3 GB and with that its running stable for a week.

in the process I had to disable the summary plugin cause I hit this issue, which caused very excessive logging in my case ... so I can't trigger the slowing as before ...

lets see if it remains stable for another week

my current working hypothesis is: the mem-limit (which is the absolute cgroup size and thus visible to the process) forced some lower lever I/O functions (libc?) to not being able to buffer causing a lot of forget-and-read-back-in operations.

x3nb63 commented 1 year ago

yep, no issue for another week.

bottom line: use enough memory.

ElementsProject / lightning

has the disk I/O requirement increased in v22.11.1? #5920

Issue and Steps to Reproduce