Explore reducing memory usage

PeterBooker commented 6 years ago

Running a full slurp at concurrency of 100-200 can easily result in memory usage of 3GB+. We should explore ways this can be reduced, to reduce pressure on the host computer while slurping.

The pprof tool identifies most memory usage as storing the zip file in memory, while we download it and then extract it.

PeterBooker commented 6 years ago

I explored using a sync.Pool to share bytes.Buffer between downloads, but due to the difficult of sizing byte slices and large variance in zip sizes (6kb to over 20mb) it results in high memory usage too. It could potentially be reduced by holding many pools of bytes.Buffers at different sizes, but this adds a large amount of complexity.

A tip from the #performance channel on Golang's slack community has led to a potentially large saving. By stream writing the zip archive to disk, we can significant reduce the memory usage during download. We can then pass a file handler to the zip extractor, which can stream relevant parts of the file as needed. This reduces memory usage by both stages only streaming data, meaning the executable holds far less in memory at any one time. The cost of this is writing the zip archive to disk, but the potential memory savings seem worth exploring.

An initial test with low concurrency reduced memory usage from 600mb to 50mb. I will now test with a full slurp with high concurrency.

PeterBooker commented 6 years ago

The latest update is that it never completes a slurp with the current code. While the memory usage is reduced, I cannot identify what causes the block. At some point it simply stops downloading/extracting archives.

A stack trace from when it is blocked lists the goroutines as stuck at net/http.RoundTrip and net/http.setRequestCancel, while at the start many are at net/http.WriteLoop. I do not understand the mechanics behind HTTP requests well enough currently to read much into this. I will continue researching.

It occurred to me that I have drastically changed the way disk i/o is working, but lack knowledge of the implications of this. It is entirely possible that the disk i/o is resulting in blocks to the HTTP requests, given that they are now tied together.

Either way, this needs more investigation and research. Very promising but currently flawed.

PeterBooker / wpds

Explore reducing memory usage #6