There appears to be a memory leak in RLP after several hours of heavy use

jwinkler2083233 commented 2 years ago

System information

OS & Version: Linux Commit hash : latest

Expected behaviour

I would expect RLP allocations to be garbage collected after a few hours of heavy use.

Actual behaviour

'pprof' is reporting that the 'encbuf.list()' function is allocating memory, and that memory is not getting garbage collected.

I verified that the allocation done by the only 'append' call in the 'encbuf.list()' method in 'rlp/encode.go' could be eliminated via the use of an additional 'sync.Pool' for 'listhead' objects. That worked great, I can submit the change if desired.

Steps to reproduce the behaviour

I enabled memory profiling in my application. My application makes heavy use of the 'EncodeToBytes()' public method I ran a three-hour load test, to see what the GC did, and then ran a memory profile.

Here is the relevant data from pprof:

(pprof) Showing nodes accounting for 2557379131, 86.13% of 2969340570 total
Dropped 1799 nodes (cum <= 14846702)
Showing top 100 nodes out of 162
      flat  flat%   sum%        cum   cum%
 905631166 30.50% 30.50%  905631166 30.50%  github.com/ethereum/go-ethereum/rlp.(*encbuf).list (inline)
 130792360  4.40% 71.80%  130792360  4.40%  github.com/ethereum/go-ethereum/rlp.(*encbuf).toBytes

The 'toBytes' function doesn't appear to be GC'd either. I don't have a workaround for that one. The reason I don't think the GC accessed these objects is because the 'cumulative' value from the 'pprof' output above matches the 'flat' output. That is not the case for other, smaller, allocations in my test. The GC worked normally for other objects. I've run this test several times as well as multiple variations, and I'm seeing similar results, where the 'cumulative' value matches the 'flat' column.

My theory is that the 'encbuf.reset()' method is somehow creating a shortcut and the sync.Pool loses track of the allocations. The other alternative I can think of is that there is a bug in pprof, but I doubt that.

Backtrace

[backtrace]

When submitting logs: please submit them as text and not screenshots.

holiman commented 2 years ago

Interesting. Does it make any difference if you call reset() before returning it to the pool in EncodeToBytes?

fjl commented 2 years ago

In memory profiles, the 'cumulative' and 'flat' measurements are not related to GC. Rather, what they mean is:

'flat': memory allocated specifically in this function
'cumulative': memory allocated in this function and all functions called by it

The same is also true for CPU profiles, but instead of memory allocated, it is 'time spent'.

Since allocation happens directly in list and toBytes and they do not call other functions that allocate, the flat and cumulative measurements are the same. But this does not mean that memory allocated by these functions can't released by GC. In fact, the reporting of held memory through memory profiles is not super precise because it only reports where allocation happened and not how much of the memory is still held.

I think that, if your load test is about RLP encoding, then it's pretty certain most allocation will happen in package rlp.

You can identify whether there is a memory leak by observing the memory held by your process. If the memory is steadily increasing while your load test runs, there is a leak. If possible, please provide a link to your test code, so that we are able to reproduce your issue.

fjl commented 2 years ago

btw, I think adding a sync.Pool for listhead will not help because list headers are not allocated on their own. The slice of list headers in rlp.encbuf contains them as values and not pointers, so the allocation you are seeing in the profile is allocation of the slice's backing array when it is extended by append.

If you allocate the listhead value in sync.Pool and then pass the pooled value to append, you are moving the allocation to another function (i.e. 'flat' and 'cumulative' will not match anymore), but it ultimately just allocates more.

One thing that should be noted is that rlp.encbuf does not have any limits on the size of objects that will be returned to the pool. What I mean is: when you encode a large object where the output contains many RLP lists, the encbuf list header slice will be very long at the end of the encoding process. encbuf will be returned to the pool with the full slice length. When subsequent encoding operations reuse the buffer, they truncate the len of the list header slice to zero, but leave the cap as-is, so the backing array will never shrink again.

I decided on this behavior because it is not trivial to determine when and by how much these internal buffers should shrink. For go-ethereum's RLP encoding needs, this has never been a problem because our encoding operations only create at most single-digit megabytes of output. Also, GC will occasionally empty the pool and thus remove the large buffers. If your use of package rlp involves heavy concurrent encoding with very large outputs, then I would not be surprised to see high memory use. It still doesn't mean there is a leak, but if you are constantly encoding, GC will not empty the pool and those large buffers can stick around for a while.

holiman commented 2 years ago

Seems to a non-issue. Please reopen if it's indeed an issue here somewhere

jwinkler2083233 commented 2 years ago

Thank you for the comments. I'll have to create a minimum example to prove my point.

fjl commented 2 years ago

Yeah, closing the ticket was a bit premature. I'm happy to check out your example!

jwinkler2083233 commented 2 years ago

This is just a draft, but it's closer to how I'd do this in native code (C++). I noticed a recent change that added an encbuffer.go file and an 'internal' directory. I haven't integrated those changes yet:

https://github.com/onflow/flow-go/blob/a42e624e8b77ce819e053e328485a5703374adcd/model/flow/rlp/encode.go

My tests are showing a 1.5% speed improvement, along with a drop in heap usage. Your mileage may vary..

I'm still wondering how the 'append' method would increase the backing store by so much over time, for the original version. When I modified it to store pointers instead, it dropped to zero.

Let me know if you like it, and I'll put together a formal PR. thx

fjl commented 2 years ago

Hmm, so now you are talking about optimizations though.

This issue was originally about a 'leak'. Can you provide a code example (it can be your entire application) that exhibits a memory leak related to package rlp? And by leak, I mean unbounded heap growth.

I'm still wondering how the 'append' method would increase the backing store by so much over time, for the original version. When I modified it to store pointers instead, it dropped to zero.

I would like to know how you arrive at this conclusion ("dropped to zero"). append will always allocate if the existing slice capacity is not sufficient to hold another item. It doesn't matter whether the slice holds pointers or not. Specifically, append in Go grows slices by doubling their capacity, i.e. append(slice, x) is equivalent to

func append(slice []X, x X) []X {
    if len(slice) == cap(slice) {
        newslice := make([]X, cap(slice)*2)
        copy(newslice, slice)
        slice = newslice[:len(slice)]
    }
    end := len(slice)
    slice = slice[:end+1]
    slice[end] = x
    return slice
}

Please provide additional data about your use of package rlp. For example, you could sample and collect the list header slice len and cap at the end of encoding and then provide the data here.

jwinkler2083233 commented 2 years ago

Exactly. I'm going by what the profiler is reporting. When I changed 'list()' to use pointers, the apparent memory leak went away, according to pprof. By using pointers in the 'lheads' slice, I was able to store/reuse pointers to 'listhead' objects that were retrieved from sync.Pool. What's odd is that the pointers should have taken up the same amount of space as the list head object (8 bytes for a 64-bit pointer), and the pointers were allocated via append(), but the memory they used didn't even register with pprof. It's possible that the garbage collector has something to do with this. Maybe it's tuned to skip over the 'listhead'-style allocations because they're values.
The impetus for this is that our servers are overloaded with billions of small allocations, and then the garbage collection eventually takes up > 30% of CPU.

Here's the latest before and after comparison.

Before:

(pprof) Showing nodes accounting for 96635006, 82.81% of 116700167 total
Dropped 989 nodes (cum <= 583500)
Showing top 100 nodes out of 226
flat  flat%   sum%        cum   cum%
28836259 24.71% 24.71%   28836259 24.71%  [github.com/ethereum/go-ethereum/rlp.(*encbuf](http://github.com/ethereum/go-ethereum/rlp.(*encbuf)).list (inline)
4239344  3.63% 63.58%    4239344  3.63%  [github.com/ethereum/go-ethereum/rlp.(*encbuf](http://github.com/ethereum/go-ethereum/rlp.(*encbuf)).toBytes

After:

(pprof) Showing nodes accounting for 1713208044, 82.77% of 2069818576 total
Dropped 1757 nodes (cum <= 10349092)
Showing top 100 nodes out of 183
      flat  flat%   sum%        cum   cum%
 128612311  6.21% 58.03%  132281690  6.39%  github.com/onflow/flow-go/model/flow/rlp.EncodeToBytes

As you can see, the percentage dropped from (24.7 + 3.6 = 28.3%), to 6.4%

holiman commented 2 years ago

What's odd is that the pointers should have taken up the same amount of space as the list head object (8 bytes for a 64-bit pointer),

A listhead is a struct with two ints, so 16 bytes. The pointer is 8 bytes. So a pointer is only half the size. And if you pool the listheads, avoiding those allocs, then I'm not surprised you see a drop in allocs. Am I missing something?

fjl commented 2 years ago

I still don't understand this issue.

As mentioned earlier, I think our existing pooling strategy should cover this: When encoding, a encBuffer instance is taken from the pool, then used, growing encBuffer.lheads. After encoding, the buffer instance is returned back into the pool. When encoding is heavily used, all pooled encBuffer instances will have reasonably-sized lheads and no allocation should happen at all.

My own profiling of geth itself confirmed this in the past, we don't usually allocate much in encBuffer.list. However, in the profile shown here, there is significant allocation in this method.

fjl commented 2 years ago

@jwinkler2083233 What I would really like to see is a report of encBuffer sizes before and after encoding.

To get this, you could apply this change for example:

https://github.com/fjl/go-ethereum/commit/a51db444cf38d1347c448956579f81dd0db17b84

And then observe the metrics somehow.

fjl commented 2 years ago

One more thing:

On the go-ethereum master branch, the recent changes in package rlp permit doing manual encoder buffer management, like this:


var (
    buf = rlp.NewEncoderBuffer(nil)
    encbytes []byte
)

for {
    buf.Reset(nil)
    err := rlp.Encode(buf, value)
    if err != nil {
         // handle it
    }
    encbytes = buf.AppendToBytes(encbytes[:0])
    // here, encbytes contains the RLP encoding of value
}

Using it like this will for sure allocate less.

ethereum / go-ethereum