Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
748 stars 70 forks source link

Fixed two newly introduced memory leaks #266

Closed shuhaowu closed 3 years ago

shuhaowu commented 3 years ago

Over the last little while we introduced some memory leaks while resolving other issues. Two problems include:

  1. a slice-to-string conversion in DMLEventBase.Annotation, from now on referred to as the string leak. This leak only occurs when the target verifier is on.
  2. a case where we called deferred in a loop, from now on referred to as the defer leak. This leak occurs whenever the binlog streamer runs (so in all normal ghostferry configurations).

A benchmark run was conducted with our internal setup and showed the following Ghostferry RSS usage before and after the fix.

image

We originally thought that we were impacted by this Golang issue, so all tests in the above plot (except the green line) is conducted with MADV_DONTNEED. It turns out that we don't need to set the env var to work around the MADV_FREE problem (as seen by the negligible difference between the green and yellow line).

The rest of this will be dedicated on how @Scalvando and I found these bugs for future reference, along with any discussions that may be needed for each bug.

string leak investigation

We originally noticed the problem due to consistent OOM kills we've gotten while running Ghostferry in production. We reproduced this via a test setup and saw by logging the output of ps that Ghostferry uses 2GB of RAM within 10 minutes of launch (blue line above). This suggest a memory leak.

@Scalvando hooked up pprof to Ghostferry, which allowed us to dump the heap profile from a live process. Using this, and the command go tool pprof -http=:8080 `which ghostferry-sharding` 'http://localhost:6060/debug/pprof/heap?debug=1' allowed us to visualize the amount of memory currently in used and the line of code responsible for their allocation. This immediately gave us the following graph:

string+defer-pprof

as well as the following source code trace:

string+defer-source

This narrows down the culprit to be the line .FindStringSubmatch(string(e.query)) within the function DMLEventBase.Annotation(). Specifically, the CALL runtime.slicebytetostring(SB) suggests that the problem lies in the string(e.query) part of the line. This line is called for every binlog row event on the target database, which means we end up converting a lot of queries from slices to bytes. If a memory leak occurs here, then it's understandable that it'll blow up, if the target database is very busy (and it is for our case).

I'm unfortunately not an expert in how strings are allocated so I'm still not sure why this code will a memory leak. I would have assumed that the string would be allocated and then deallocated by the end of this function call.

In any case, @Scalvando hypothesized that we can replace the FindStringSubmatch function call with FindSubmatch and skip the conversion to string. This ended up solving one problem, which caused us to observe the orange line in terms of memory usage. Since we still saw a linear increase in RSS usage, we hypothesized a second memory leak, so we continued our investigation.

Note: a similar thing may have occured in this StackOverflow question, but we didn't investigate further.

defer leak investigation

Since there's likely a second memory leak, we looked again with pprof and saw the following:

defer-pprof

as well as the following code trace:

defer-source

There are only a few places where we use context in Ghostferry, and we recently did touch the context handling code in the BinlogStreamer. In that PR, we (including myself as one of the reviewer) made the incorrect conclusion call defer cancel() within the binlog streamer main loop. Since the binlog streamer main loop does not exit until Ghostferry exit, the deferred function calls will add unbounded, which causes our memory leak. Additionally, it's also possible that by not cancelling the context.WithDeadline() context, we leak memory further.

Once this is fixed, we get the yellow line.

Since at the beginning of the investigation we thought we may have been impacted by the MADV_FREE problem, we turned on MADV_DONTNEED via an environment variable. I decided to test if we actually need that environment variable after fixing both leaks. The tests suggest that we are not impacted as the RSS usage remain negligible.

Note: you might notice that the slope of the blue and orange line appears to be similar. It is possible that the string leak caused a large increase in RSS usage at the beginning, and the rest maybe due to the defer leak. I didn't try to isolate just the string leak and observe its behaviour.