Closed x3nb63 closed 1 year ago
In my experience, over time lightningd.sqlite3
will become very fragmented and will need a defrag periodically when using a CoW filesystem like btrfs and spinning rust. With SSDs it's imperceptible, but on HDD it will really affect performance.
will try vacuuming later tonight, thanks!
Has anyone tried performance tuning lightningd.sqlite3
like described in this blog? I read at other places that particulary synchronous=off
(no fsync) gives lots of queries/sec.
I don't understand enough to say if synchronous=off
together with WAL mode is "safe enough" on a corruption-save-by-checksum filesystem such as btrfs?
(I also do run with lightningd --wallet=${main}:${backup}
with $backup on a different btrfs subvolume - on the same disks, unfortunately)
No do not use WAL. Your node will crash randomly. (I've tried WAL if you can't tell).
topology
is a plugin that does not have direct access to lightningd.sqlite3
, instead it relies on gossip_store
to read the gossip it uses for listnodes
and listchannels
. Restarting just that plugin might be an option (lightning-cli plugin stop topology
followed by lightning-cli plugin start topology
), if that does not help it is likely that there is something suboptimal in the way we handle the gossip messages (a flamegraph would be useful to see where we're spending time).
Restarts can also "fix" this because it causes gossipd
to compact the gossip_store
thus reducing the number of messages in it, potentially reducing the computations we perform.
If it is gossip related, the gossip_store
accepts ~24% more gossip starting in 0.12.0, but compacting the gossip_store
(restarting the node) should clear out most of the difference.
vacuum shrinked lightningd.sqlite3
from 253 MB to 211 MB, for main and backup db file.
the gossip_store
was 150 MB before restart and about an hour after restart is at 59 MB.
more I can't say yet, since its fast since restarting. Neither use of summary plugin nor RTL logins trigger IOwait, topology
does not get visible in the process list.
I collected a flamegraph an hour after start to compare with later. Hope ~10min of sampling is sufficient?
after 4 days its getting noticeably slower, lightningd_gossipd
appears in iotop listing every some minutes. As soon as I run lightning-cli summary
the topology
process joins the list
gossip_store file is at 150 MB again.
made two flamegraphs: first is a sampling over many minutes (I forgot how many); it contains at least one summary plugin run
second one covers just one call to the summary plugin, which took close to 3 min to print its output.
nothing got killed so far.
looking at top the IOwait peaks at 37 and drops to 0.x after summary is done
topology
then disappears from iotop while lightningd_gossipd
remains listed with some 20 K/s read and 1.6 K/s write over 10sec (which fluctuates quite a lot)
it crashed again (or actually was oom-killed this time) and I failed to make another flamegraph before that ...
last thing I observed was gossip_store
file at 179MB a day or so before. After multiple restart attempts (because of not freed TOR listening port) it got cleaned down to 59MB again and also a gossip_store.corrupt
file of that size appeared.
Because of the many restarts I lost the output of the initial instance, but then found in the journal that the oom-killer was involved. This I did never observe before and the memory limit I did not change for at least a year. Here is the journal:
Feb 04 13:19:54 hostname kernel: lightning_gossi invoked oom-killer: gfp_mask=0x8c40(GFP_NOFS|__GFP_NOFAIL), order=0, oom_score_adj=985
Feb 04 13:19:54 hostname kernel: CPU: 1 PID: 31742 Comm: lightning_gossi Not tainted 6.1.7-arch1-1 #1 a2d6f1dcaa775aaae1f25aaf758ae968e3493665
Feb 04 13:19:54 hostname kernel: Hardware name: /DH67BL, BIOS BLH6710H.86A.0163.2018.1023.1559 10/23/2018
Feb 04 13:19:54 hostname kernel: Call Trace:
Feb 04 13:19:54 hostname kernel: <TASK>
Feb 04 13:19:54 hostname kernel: dump_stack_lvl+0x48/0x60
Feb 04 13:19:54 hostname kernel: dump_header+0x4a/0x211
Feb 04 13:19:54 hostname kernel: oom_kill_process.cold+0xb/0x10
Feb 04 13:19:54 hostname kernel: out_of_memory+0x1f1/0x520
Feb 04 13:19:54 hostname kernel: mem_cgroup_out_of_memory+0x134/0x150
Feb 04 13:19:54 hostname kernel: try_charge_memcg+0x693/0x780
Feb 04 13:19:54 hostname kernel: ? lru_gen_look_around+0x3d7/0x820
Feb 04 13:19:54 hostname kernel: charge_memcg+0x39/0xf0
Feb 04 13:19:54 hostname kernel: __mem_cgroup_charge+0x2c/0x80
Feb 04 13:19:54 hostname kernel: __filemap_add_folio+0x217/0x430
Feb 04 13:19:54 hostname kernel: ? scan_shadow_nodes+0x30/0x30
Feb 04 13:19:54 hostname kernel: filemap_add_folio+0x3c/0xa0
Feb 04 13:19:54 hostname kernel: __filemap_get_folio+0x193/0x380
Feb 04 13:19:54 hostname kernel: pagecache_get_page+0x15/0x70
Feb 04 13:19:54 hostname kernel: alloc_extent_buffer+0x25c/0x780 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: read_tree_block+0x25/0xa0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: read_block_for_search+0x220/0x330 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: btrfs_search_slot+0x32b/0xc90 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: btrfs_lookup_csum+0x73/0x160 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: btrfs_lookup_bio_sums+0x230/0x560 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: btrfs_submit_data_read_bio+0x48/0x80 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: submit_one_bio+0x90/0xc0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: submit_extent_page+0x15d/0x540 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: ? __set_extent_bit+0x178/0x6b0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: btrfs_do_readpage+0x2d9/0x740 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: extent_readahead+0x31e/0x3e0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: ? btrfs_repair_one_sector+0x3a0/0x3a0 [btrfs 58bc029900f145fbc1c0c21b495c5650ba90e7f0]
Feb 04 13:19:54 hostname kernel: read_pages+0x88/0x260
Feb 04 13:19:54 hostname kernel: page_cache_ra_unbounded+0x105/0x180
Feb 04 13:19:54 hostname kernel: filemap_fault+0x552/0x920
Feb 04 13:19:54 hostname kernel: ? filemap_map_pages+0x154/0x6f0
Feb 04 13:19:54 hostname kernel: __do_fault+0x33/0x110
Feb 04 13:19:54 hostname kernel: do_fault+0x1b9/0x410
Feb 04 13:19:54 hostname kernel: __handle_mm_fault+0x660/0xfa0
Feb 04 13:19:54 hostname kernel: handle_mm_fault+0xdf/0x2d0
Feb 04 13:19:54 hostname kernel: do_user_addr_fault+0x1be/0x6a0
Feb 04 13:19:54 hostname kernel: exc_page_fault+0x74/0x170
Feb 04 13:19:54 hostname kernel: asm_exc_page_fault+0x26/0x30
Feb 04 13:19:54 hostname kernel: RIP: 0033:0x55835dbbb4d2
Feb 04 13:19:54 hostname kernel: Code: Unable to access opcode bytes at 0x55835dbbb4a8.
Feb 04 13:19:54 hostname kernel: RSP: 002b:00007ffd657c9120 EFLAGS: 00010212
Feb 04 13:19:54 hostname kernel: RAX: 0000ffffffffffff RBX: 003ffffffffffffc RCX: 00047754d739c3e3
Feb 04 13:19:54 hostname kernel: RDX: 00000000000da593 RSI: 000205e6fb443ace RDI: 0001000003d10000
Feb 04 13:19:54 hostname kernel: RBP: 00007ffd657c9a20 R08: 000aa5d081dc209f R09: 0000000000000000
Feb 04 13:19:54 hostname kernel: R10: d1f8c6ee278e24f0 R11: 0000e4c8f019a40b R12: 0003fffffffffffc
Feb 04 13:19:54 hostname kernel: R13: 003ffffbfffff0bc R14: 00007ffd657c9a20 R15: 00007ffd657c9270
Feb 04 13:19:54 hostname kernel: </TASK>
Feb 04 13:19:54 hostname kernel: memory: usage 512028kB, limit 512000kB, failcnt 123853
Feb 04 13:19:54 hostname kernel: swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 04 13:19:54 hostname kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope:
Feb 04 13:19:54 hostname kernel: anon 509526016
file 2347008
kernel 7282688
kernel_stack 524288
pagetables 2293760
sec_pagetables 0
percpu 0
sock 5160960
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 8192
file_writeback 0
swapcached 0
anon_thp 190840832
file_thp 0
shmem_thp 0
inactive_anon 317571072
active_anon 191950848
inactive_file 1544192
active_file 794624
unevictable 4096
slab_reclaimable 2505512
slab_unreclaimable 1176648
slab 3682160
workingset_refault_anon 0
workingset_refault_file 87202038
workingset_activate_anon 0
workingset_activate_file 539718
workingset_restore_anon 0
workingset_restore_file 10302541
workingset_nodereclaim 0
pgscan 115540186
pgsteal 114113256
pgscan_kswapd 0
pgscan_direct 115540186
pgsteal_kswapd 0
pgsteal_direct 114113256
pgfault 52093671
pgmajfault 44641
pgrefill 51731843
pgactivate 444399
pgdeactivate 0
pglazyfree 1
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 3044
thp_collapse_alloc 1591
Feb 04 13:19:54 hostname kernel: Tasks state (memory values in pages):
Feb 04 13:19:54 hostname kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Feb 04 13:19:54 hostname kernel: [ 27787] 10001 27787 595 18 40960 0 985 tini
Feb 04 13:19:54 hostname kernel: [ 27801] 10001 27801 65760 63545 565248 0 985 lightningd
Feb 04 13:19:54 hostname kernel: [ 28071] 10001 28071 1106 33 49152 0 985 autoclean
Feb 04 13:19:54 hostname kernel: [ 28072] 10001 28072 1106 21 49152 0 985 chanbackup
Feb 04 13:19:54 hostname kernel: [ 28073] 10001 28073 6700 5310 90112 0 985 bcli
Feb 04 13:19:54 hostname kernel: [ 28074] 10001 28074 1112 30 49152 0 985 commando
Feb 04 13:19:54 hostname kernel: [ 28076] 10001 28076 1111 36 49152 0 985 funder
Feb 04 13:19:54 hostname kernel: [ 28077] 10001 28077 65197 1738 118784 0 985 topology
Feb 04 13:19:54 hostname kernel: [ 28078] 10001 28078 1450 33 49152 0 985 keysend
Feb 04 13:19:54 hostname kernel: [ 28079] 10001 28079 1133 34 45056 0 985 offers
Feb 04 13:19:54 hostname kernel: [ 28080] 10001 28080 1463 24 45056 0 985 pay
Feb 04 13:19:54 hostname kernel: [ 28081] 10001 28081 1106 22 45056 0 985 txprepare
Feb 04 13:19:54 hostname kernel: [ 28082] 10001 28082 1115 22 49152 0 985 spenderp
Feb 04 13:19:54 hostname kernel: [ 28083] 10001 28083 1866 54 57344 0 985 bookkeeper
Feb 04 13:19:54 hostname kernel: [ 28084] 10001 28084 48531 5589 143360 0 985 python3
Feb 04 13:19:54 hostname kernel: [ 28085] 10001 28085 9652 3805 110592 0 985 python3
Feb 04 13:19:54 hostname kernel: [ 28086] 10001 28086 9984 4084 106496 0 985 python3
Feb 04 13:19:54 hostname kernel: [ 28469] 10001 28469 1169 90 49152 0 985 lightning_hsmd
Feb 04 13:19:54 hostname kernel: [ 28551] 10001 28551 27706 26577 266240 0 985 lightning_conne
Feb 04 13:19:54 hostname kernel: [ 31742] 10001 31742 13972 12836 155648 0 985 lightning_gossi
Feb 04 13:19:54 hostname kernel: [3115526] 10001 3115526 1138 37 45056 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3115581] 10001 3115581 1138 35 49152 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3123221] 10001 3123221 1138 36 40960 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3124854] 10001 3124854 1138 22 45056 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3127401] 10001 3127401 1138 37 49152 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3129090] 10001 3129090 1138 36 45056 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3130661] 10001 3130661 1138 35 45056 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3132298] 10001 3132298 1138 34 45056 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3133827] 10001 3133827 1138 33 49152 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: [3135390] 10001 3135390 1138 37 53248 0 985 lightning_chann
Feb 04 13:19:54 hostname kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod743bd9ce_733f_44fe_8984_2f370e141e0a.slice/cri-containerd-6a0845fb6affaf8bd87978cf7674e203be02b74f704ec403da83e0935892d292.scope,task=lightningd,pid=27801,uid=10001
Feb 04 13:19:54 hostname kernel: Memory cgroup out of memory: Killed process 27801 (lightningd) total-vm:263040kB, anon-rss:254180kB, file-rss:0kB, shmem-rss:0kB, UID:10001 pgtables:552kB oom_score_adj:985
note: it now runs on btrfs, about two months ago, still using 0.10.x, the node was running on ZFS for over a year - don't know if that matters
gossip_store
is at 200MB now, below is another flamegraph while I issued lightning-cli summary
. This time I Ctrl-C'ed the summary plugin without seeing any output after 28 minutes.
Looking with top
I see I/O wait of 25..50 and a load of 4..5, which isn't going away and its not due to CPU.
With iotop -d 10 -P
(and manual sort by DISK READ) it looks as if topology
never gets done. Also lightning_gossipd
appears almost constantly now.
Total DISK READ : 6.82 M/s | Total DISK WRITE : 562.44 K/s
Actual DISK READ: 6.84 M/s | Actual DISK WRITE: 1964.59 K/s
PID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND
3152394 be/4 10001 3.35 M/s 0.00 B/s ?unavailable? topology
6986 be/4 arch 1930.53 K/s 0.00 B/s ?unavailable? bitcoind -conf=/~coin/bitcoin.conf
3152637 be/4 10001 1375.61 K/s 0.00 B/s ?unavailable? lightning_connectd
3154078 be/4 10001 251.91 K/s 0.00 B/s ?unavailable? lightning_gossipd
3152387 be/4 10001 4.75 K/s 0.00 B/s ?unavailable? lightningd --net~n/lightningd.conf
1 be/4 root 0.00 B/s 0.00 B/s ?unavailable? systemd --system --deserialize 99
2 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [kthreadd]
...
Flamegraph:
I wonder if we've hit some tipping point? In particular, perhaps lightning network growth has outpaced your RAM growth, and keeping the gossip_map in memory (particularly since it no longer trims spam messages as of 0.12) has meant you're now in swapping hell.
It's hard to know what to do about this. We can add an option to keep the gossip_store smaller, but it's at best going to make 30% difference, and the network keeps growing. We could elect to only keep partial gossip, maybe?
no swap! am running the machine without swap.
just doubled the memory limit to 1 GB and restarted... lets see.
it ran into more OOMkills, so i increased the limit multiple times to now 3 GB and with that its running stable for a week.
in the process I had to disable the summary plugin cause I hit this issue, which caused very excessive logging in my case ... so I can't trigger the slowing as before ...
lets see if it remains stable for another week
my current working hypothesis is: the mem-limit (which is the absolute cgroup size and thus visible to the process) forced some lower lever I/O functions (libc?) to not being able to buffer causing a lot of forget-and-read-back-in operations.
yep, no issue for another week.
bottom line: use enough memory.
Issue and Steps to Reproduce
Since upgrading to v22.11.1 (from 0.11.1) I observe my node getting gradually slower over about 3 weeks until it gets killed (+restarted) eventually on failing its liveness probe on timeout.
After that it is fast again and the degradation repeats.
Primary bottleneck of my machine is disks: its got spinning disks setup as a "tripple mirror" using btrfs ("raid1c3"). So they are very low in random I/O performance compared to even cheap SSDs.
However, this was alright until v22.11.1 since the machine has plenty of memory.
With
top
andiostat
/iotop
I observe a lot of I/O waiting as these 3 weeks pass. I can trigger even more I/Owait when using thesummary
plugin to view my channels or - worse - when logging into the RideTheLighting UI (it appears to query way more things during the login procedure).The processes stalling on I/O are mainly
topology
andbitcoind
(which is a full node on the same machine).topology
is usually invisible in the I/O wait process ranking but becomes visible as soon I use any of the above. Then alsobitcoind
becomes more active ...To me it looks as if
topology
somehow starts inspecting things on the blockchain in addition to what it does in the lightning database (which is sqlite, btw., about 150 MB in my case). These two together then push my disks over their edge.Does that ring a bell regarding v22? Do I need SSDs now?
I have not seen this with the 0.11.1 I ran before, on that very machine. (note: I did not try v22.11 without ".1" nor anything else in-between)