Closed jgraettinger closed 3 years ago
Update: this is a new shutdown race bug, but isn't the droid we're looking for.
Briefly, we're wedged in a tight loop here:
goroutine 12367 [runnable]:
github.com/estuary/flow/go/shuffle.(*read).next(0xc004a4bb80, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00508f740, 0x0)
/home/runner/work/flow/flow/go/shuffle/read.go:380 +0x85
github.com/estuary/flow/go/shuffle.StartReplayRead.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
home/runner/work/flow/flow/go/shuffle/reader.go:108 +0x10e
It happens because we're pulling a zero-valued readResult from a closed channel. The channel was closed without delivering a final error, because we use a non-blocking and best-effort send of that final error. However, because no error was returned, we'll sit in a tight loop in this case.
I captured these from a recent build that's been getting oom killed:
Another suspicious trace in the shuffle
package:
goroutine 511 [chan send, 40 minutes]:
github.com/estuary/flow/go/shuffle.StartReadingMessages.func1(0xc000a1f920, 0xc000fa3380, 0x2e27608, 0xc0008b70c0)
/home/runner/work/flow/flow/go/shuffle/reader.go:81 +0x1f2
created by github.com/estuary/flow/go/shuffle.StartReadingMessages
/home/runner/work/flow/flow/go/shuffle/reader.go:69 +0x2ae
It's blocking attempting to send into the channel of message envelope delivered to the consumer framework.
I've got a pretty good reproduction using the citi-bike dataset. Findings so far:
Okay! Managed to wire up heaptrack to my reproduction and got some interesting results after letting it run for a while:
heaptrack stats:
allocations: 277515513
leaked allocations: 2346
temporary allocations: 767061
Notably, the "leaks" are all compiled static constants like regex's. So there are no actual resource leak.
Analyzing the dump, heaptrack reports a peak usage of 300MB and a peak RSS of 2.1GB. We've got a very fragmented heap
Still analyzing output, but 99% of the allocation work is of course serde_json::Value
.
I switched Flow over to using jemalloc in libbindings. I wrote a simple flow catalog that captures citibike rides from S3 and runs them through a derivation that's just an identity function. Here's some graphs!
Here's the drained document counts, which correlate directly to transactions. The left side of the graph shows it running with jemalloc, and the right side shows it running with the system allocator.
Here's a graph of allocated memory for the run that's using the system allocator. Notice how the drop in allocations is correlated with the commit of each derivation transaction.
Here's the allocated memory graph from the run using jemalloc. Again you can see how these correlate with the derive transactions.
Here's the resident set size, again with the jemalloc version on the left and system allocator on the right. With jemalloc, the RSS goes down immediately as allocations are freed.
At least in this example, I don't actually think heap fragmentation was a major factor, even with the system allocator. The RSS got up to 12.5GiB, which isn't much higher than the peak allocated memory of ~12GiB. It seems that in this case, the major benefit of jemalloc is just that it returns unused memory to the OS, whereas the system allocator keeps it. But there also seems to be a significant performance improvement with jemalloc, such that the throughput of the derivation was significantly higher.
This issue has been a smorgasbord as we've knocked out little bugs, and investigated performance.
Key outcomes are:
There's more to do here, but future work will be under separate issues.
( To be filled in with details )