Support writes to S3 - Githubissues

pedro93 commented 1 year ago

Is there an existing proposal for this?

[X] I have searched the existing proposals

Is your feature request related to a problem?

I have an issue when using memray on python processes running in K8s where the pod where the process is running gets evicted or killed due to OOM. In those cases I lose access to the memray files unless I configure persistent volumes. These complicate my deployment significantly.

Describe the solution you'd like

I would love if Memray could support write to S3 rather than to a local file.

Alternatives you considered

Using a persistent volume claim in my python pods that I want to profile.

godlygeek commented 1 year ago

This doesn't seem architecturally reasonable, unfortunately, for a few different reasons.

First off, as things stand today, we've got two different output methods. One writes records to disk using memory-mapped file IO, the other streams records over a socket. Both of these handle the OOM killer fairly gracefully, as intermediate data is continually written while the process is running, but that's not easy to do with an S3 output. We can't do memory mapped IO with S3, nor can we feasibly write each individual record to S3 (the records are generally only a handful of bytes, some as small as 1 byte). We could in theory do some in-memory batching and only write to S3 when the buffer has filled up, but then if the OOM killer kills your process you lose any buffered data, including which particular allocation put you over the top - which is likely the most valuable thing to learn.

Secondly, all the record writing happens in C++. We can't reasonably move this to Python land, because the record writing occurs while the GIL is not held. We could buffer records until the GIL can be acquired and some Python writer could pick them up and process them, but that would require arbitrarily large buffers, potentially causing the process's memory usage to balloon and exacerbating whatever memory problem you're trying to debug. Since this record writing happens in C++, we'd need to use a C++ S3 SDK to upload to S3, but that's quite a heavy dependency for us to pick up. Even if we decided to forgo an Amazon SDK and roll the S3 uploads by hand, we'd still need to pick up new dependencies for the crypto required to sign the requests, and for the HTTP requests themselves. These new dependencies would make Memray considerably harder for Linux distributions to pick up and package, and would make it considerably harder to install Memray in some of the more locked down corporate environments.

Have you considered rolling this yourself by driving things through the Memray API? Off the top of my head, it seems like it would be reasonable for you create a Tracker and enter its context (either using a with block or an explicit call to __enter__), and then monitor your own memory usage using something like resource.getrusage(). Once your memory usage exceeds some threshold, you could cause the Tracker to exit (either by throwing an exception that causes the with block to be exited, or by calling its __exit__ method explicitly). Once the tracker has exited, its output file will have been flushed, and you can upload it to S3 yourself using something like boto3.

pablogsal commented 1 year ago

I agree with @godlygeek, I am closing as per above but feel free to keep the discussion going if you need.

bloomberg / memray

Support writes to S3 #462

Is there an existing proposal for this?

Is your feature request related to a problem?

Describe the solution you'd like

Alternatives you considered