ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.81k stars 892 forks source link

Huge memory usage of topology process: memory leak? #4721

Closed whitslack closed 2 years ago

whitslack commented 3 years ago

Issue and Steps to Reproduce

Is it expected that the topology plugin process should be using multiple gigabytes of RAM?

# grep -E '^(Vm|Rss)' "/proc/$(pidof topology)/status"
VmPeak:  2626832 kB
VmSize:  2626832 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   1212888 kB
VmRSS:       284 kB
RssAnon:               0 kB
RssFile:             284 kB
RssShmem:              0 kB
VmData:  2505364 kB
VmStk:       132 kB
VmExe:       492 kB
VmLib:      4288 kB
VmPTE:      5200 kB
VmSwap:  2505060 kB

Almost all of the process's memory is swapped out, suggesting that the process is not actively referencing those pages.

Immediately after a call to lightning-cli listnodes; lightning-cli listchannels; lightning-cli listincoming, the process has only swapped back in a small (61MiB) portion of its huge (2583MiB) memory footprint.

I suspect a memory leak. What would be the best way of determining if that is indeed occurring?

getinfo output

This is the release version 0.10.1.

ghost commented 3 years ago

I also am experiencing memory leak ish instability after upgrading to 10.1 that has led me to disable the plugin I am using (clboss) for now.

rustyrussell commented 3 years ago

I think this is a false positive: you're seeing the mmap of the gossip store?

If it grows significantly over time, that's an issue...

whitslack commented 3 years ago

@rustyrussell: Memory-mapped gossip store pages wouldn't be accounted in VmSwap. Since they're file-backed, they'd simply be evicted from RAM, to be re-fetched from file as needed. Only anonymous pages (and COW'd private copies of file-backed pages) go into VmSwap.

rustyrussell commented 3 years ago

Hmm, usage here is much lighter, but jumped 200M the first time I called (lightning-cli listnodes; lightning-cli listchannels; lightning-cli listincoming) > /dev/null.

(Then nothing moved it again). I think I know what it must be, let me see if I'm right...

whitslack commented 3 years ago

A possibly helpful piece of information: I was able to ramp topology up to 2.5 GB by lots of calls to listchannels specifying a SCID. In fact, at one point due to the performance regression, I had several processes all hammering on listchannels <scid> concurrently and continuously for many hours. (I rewrote that script too so that it doesn't call listchannels at all anymore.)

rustyrussell commented 3 years ago

Thanks for the great report! Indeed, I found one (not where I was expecting, in fact). I'm testing it on my node now...

rustyrussell commented 3 years ago

This seems to help, however, I still get significant growth:

$ grep -E '^(Vm|Rss)' "/proc/$(pidof topology)/status"
VmPeak:   166344 kB
VmSize:   124544 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    150084 kB
VmRSS:     70600 kB
RssAnon:       59944 kB
RssFile:       10656 kB
RssShmem:          0 kB
VmData:    60316 kB
VmStk:       132 kB
VmExe:      1100 kB
VmLib:      3864 kB
VmPTE:       292 kB
VmSwap:        0 kB

Then I run listchannels: lightning-cli listchannels > /dev/null

And now:

$ grep -E '^(Vm|Rss)' "/proc/$(pidof topology)/status"
VmPeak:   418140 kB
VmSize:   345944 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    390976 kB
VmRSS:    336624 kB
RssAnon:      281524 kB
RssFile:       55100 kB
RssShmem:          0 kB
VmData:   281684 kB
VmStk:       132 kB
VmExe:      1100 kB
VmLib:      3864 kB
VmPTE:       724 kB
VmSwap:        0 kB

Running it multiple times doesn't make it worse, but running malloc_trim(0) does make it return an awful lot of RAM to the system (until I run listchannels again):

$ grep -E '^(Vm|Rss)' "/proc/$(pidof topology)/status"
VmPeak:   418140 kB
VmSize:   123820 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    390976 kB
VmRSS:     96120 kB
RssAnon:       59512 kB
RssFile:       36608 kB
RssShmem:          0 kB
VmData:    59552 kB
VmStk:       132 kB
VmExe:      1100 kB
VmLib:      3864 kB
VmPTE:       288 kB
VmSwap:        0 kB
rusty@ubuntu-1gb-sgp1-01:~/lightning$ lightning-cli listchannels > /dev/null
rusty@ubuntu-1gb-sgp1-01:~/lightning$ grep -E '^(Vm|Rss)' "/proc/$(pidof topology)/status"
VmPeak:   418152 kB
VmSize:   345956 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    390980 kB
VmRSS:    336644 kB
RssAnon:      281536 kB
RssFile:       55108 kB
RssShmem:          0 kB
VmData:   281684 kB
VmStk:       132 kB
VmExe:      1100 kB
VmLib:      3864 kB
VmPTE:       724 kB
VmSwap:        0 kB
whitslack commented 3 years ago

@rustyrussell: What in the world are you allocating that's so huge? Is your JSON parser that inefficient? :grimacing:

rustyrussell commented 3 years ago

It's 77MB of JSON. But let me run massif and see what the rest of the RAM is for!

rustyrussell commented 3 years ago

Huh, weird. On my laptop, topology after listchannels (on regtest, developer mode, importing gossip_store) gives a much more expected result:

Peak:     195772 kB
VmSize:   123576 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    175864 kB
VmRSS:    121528 kB
RssAnon:       70628 kB
RssFile:       50900 kB
RssShmem:          0 kB
VmData:    70656 kB
VmStk:       136 kB
VmExe:       720 kB
VmLib:      2664 kB
VmPTE:       276 kB
VmSwap:        0 kB

And massif shows nothing surprising. OK, let me try running massif on my actual live machine...

rustyrussell commented 3 years ago

Nope, massif on my actual machine shows the same thing: we peak at 130MB, as expected. glibc's allocator hates us?