Memray not reporting memory leak as expected

nrs-status commented 6 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

memray summary shows that memory does not go about a certain threshold, yet sampling current memory usage with memory_profiler and watching htop shows there's a memory leak. I fixed my own issue: it was caused by pytorch's profiler, so removing the pytorch profiler lines in the code that will be provided will fix the memory leak. But the main issue is that memray wasn't catching it.

Sorry in advance for the VERY sketchy Dockerfile and code, I put it up quickly together by request from a Jupyter notebook I was working from as well as my previous debugging attempts and currently don't have time to clean it.

Steps To Reproduce

Here follows a Dockerfile where the problem should be reproducible. Upon running it, run memray run bert.py and exit whenever you want to produce the bin, then get the bin with docker copy.

Memray Version

1.11.0

Python Version

3.11

Operating System

Linux

Anything else?

FROM bitnami/pytorch

WORKDIR /workdir

#integrate these to requirements and clean unused requirements eventually
RUN pip install transformers
RUN pip install datasets
RUN pip install tqdm
RUN pip install pytorch_lightning
RUN pip install mlflow
RUN pip install memray
RUN pip install scalene
RUN pip install memory_profiler
RUN pip install guppy3
RUN pip install graphviz
RUN pip install objgraph
RUN pip install ipdb

USER root
RUN adduser --disabled-password --gecos '' appuser
RUN chown -R appuser:appuser /workdir

RUN mkdir /.cache
RUN chown -R appuser:appuser /.cache

RUN touch /.pdbhistory
RUN chown -R appuser:appuser /.pdbhistory

RUN apt update
RUN apt upgrade
RUN apt install -y curl
RUN curl -L -o bert.py https://raw.githubusercontent.com/nrs-status/shared/main/bert.py
RUN mkdir data
RUN curl -L -o data/imps.txt https://raw.githubusercontent.com/nrs-status/shared/main/data/imps.txt
RUN curl -L -o data/decls.txt https://raw.githubusercontent.com/nrs-status/shared/main/data/decls.txt

USER appuser

ENV SENTENCE_TRANSFORMERS_HOME=/workdir/.cache
ENTRYPOINT ["/bin/sh"]

pablogsal commented 6 months ago

Thanks for opening this issue @nrs-status. I tried using your Docker container but I cannot reproduce your issue. Indeed, memray correctly shows the leak both in the plots and the flamegraph. Take a look: blech

That shows that the resident memory (and the heap memory) doesn't stop growing.

blech33 blech22

And in this flamegraph you can see that most memory comes from the pytorch profiler (torch::autograd::profiler::disableProfiler() at <unknown>:0).

pablogsal commented 6 months ago

Notice that the resident size is much much bigger than the heap size. Check out the docs to understand why it could be:

https://bloomberg.github.io/memray/memory.html

pablogsal commented 6 months ago

We will try to investigate to check if we can understand why the resident memory is growing without the heap memory growing.

pablogsal commented 6 months ago

We did some investigation and this is indeed heap fragmentation. We profile a single iteration (between two hits of the breakpoint) and memray correctly accounts for every allocation and deallocation that happens. The problem is that internally posix_memalign (which is the allocator that PyTorch profiler is using underneath) calls brk and when freeing the pointer it never recedes the heap (the heap always grows - also notice that "heap" here is the brk-based heap). You can check this by running strace -e brk -p PID between iterations:

brk(0xaaab3660f000)                     = 0xaaab3660f000
brk(0xaaab3678f000)                     = 0xaaab3678f000
brk(0xaaab3690f000)                     = 0xaaab3690f000
brk(0xaaab36a8f000)                     = 0xaaab36a8f000
brk(0xaaab36c0f000)                     = 0xaaab36c0f000
brk(0xaaab36d8f000)                     = 0xaaab36d8f000
brk(0xaaab36f0f000)                     = 0xaaab36f0f000
brk(0xaaab3714f000)                     = 0xaaab3714f000
brk(0xaaab3750f000)                     = 0xaaab3750f000
brk(0xaaab37e0f000)                     = 0xaaab37e0f000
brk(0xaaab3840f000)                     = 0xaaab3840f000
brk(0xaaab3720f000)                     = 0xaaab3720f000
brk(0xaaab3744f000)                     = 0xaaab3744f000
brk(0xaaab3750f000)                     = 0xaaab3750f000
brk(0xaaab3768f000)                     = 0xaaab3768f000
brk(0xaaab3780f000)                     = 0xaaab3780f000
brk(0xaaab3798f000)                     = 0xaaab3798f000
brk(0xaaab37b0f000)                     = 0xaaab37b0f000
brk(0xaaab37c8f000)                     = 0xaaab37c8f000
brk(0xaaab37ecf000)                     = 0xaaab37ecf000
brk(0xaaab38290000)                     = 0xaaab38290000
brk(0xaaab38b90000)                     = 0xaaab38b90000
brk(0xaaab39190000)                     = 0xaaab39190000
brk(0xaaab37e10000)                     = 0xaaab37e10000
brk(0xaaab38050000)                     = 0xaaab38050000
brk(0xaaab38110000)                     = 0xaaab38110000
brk(0xaaab38290000)                     = 0xaaab38290000
brk(0xaaab38410000)                     = 0xaaab38410000
brk(0xaaab38590000)                     = 0xaaab38590000
brk(0xaaab38710000)                     = 0xaaab38710000
brk(0xaaab38890000)                     = 0xaaab38890000
brk(0xaaab38a10000)                     = 0xaaab38a10000
brk(0xaaab38c50000)                     = 0xaaab38c50000
brk(0xaaab39310000)                     = 0xaaab39310000
brk(0xaaab39910000)                     = 0xaaab39910000
brk(0xaaab39a90000)                     = 0xaaab39a90000
brk(0xaaab3a390000)                     = 0xaaab3a390000
brk(0xaaab3a990000)                     = 0xaaab3a990000
brk(0xaaab3b110000)                     = 0xaaab3b110000
brk(0xaaab3b710000)                     = 0xaaab3b710000
brk(0xaaab3a210000)                     = 0xaaab3a210000
brk(0xaaab3a450000)                     = 0xaaab3a450000
brk(0xaaab3a510000)                     = 0xaaab3a510000
brk(0xaaab3a690000)                     = 0xaaab3a690000
brk(0xaaab3a810000)                     = 0xaaab3a810000
brk(0xaaab3a990000)                     = 0xaaab3a990000
brk(0xaaab3ab10000)                     = 0xaaab3ab10000
brk(0xaaab3ac90000)                     = 0xaaab3ac90000
brk(0xaaab3aed0000)                     = 0xaaab3aed0000
brk(0xaaab3b590000)                     = 0xaaab3b590000
brk(0xaaab3bb90000)                     = 0xaaab3bb90000

as can see the heap pointer always grows. Notice that brk being called is an implementation detail of glibc when it calls posix_memalign and what's being "leaked" is the resident size (because the actual allocation is actually being freed later and memray doesn't complain about it). The problem is that this leaves the heap fragmented. You can read more about this here:

https://bloomberg.github.io/memray/memory.html#memory-can-be-fragmented

I am closing the issue as this is not a problem of memray

nrs-status commented 6 months ago

Thank you very much for taking the time, I wasn't making proper use of the heap vs. resident distinction. I realize this might not be a memray issue but I'm just sharing the extra info in case it is of any use. I've sort of got lost in my debugging attempts and failed to notice that the flamegraph in the OP reported the existence of the profiler, but if you run the test again cloning my repo https://github.com/nrs-status/shared (I've included the .bin this time) and building the Dockerfile there, the report instead looks like what is in the next image. As you can see, there would have been no way to infer that the problem was mainly due to the pytorch profiler.

https://imgur.com/a/RNTv9d9

Sorry for not giving the proper setup, I was doing some debugging on my own and stopped midway to make the docker image. Also thanks for the explanation about how the Pytorch Profiler allocates memory.

As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!

pablogsal commented 6 months ago

I modified memray to show in the plot the memory that's fragmented (taken from calling mallinfo2()):

Screenshot 2024-03-13 at 18 43 39

pablogsal commented 6 months ago

As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!

I think we can do something like I showed in the previous plot. It's still unclear how someone could fix things with this info, but at least you can know what's happening.

nrs-status commented 6 months ago

Very cool for the modification to memray! On my side, I've continued playing with this issue for a bit, this time using heaptrack and profiling the CPython itself. Here's the result of an allocations flamegraph:

https://imgur.com/a/NnuXWvT

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program. Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

pablogsal commented 6 months ago

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.

Isn't that more or less the same information as in the flamegraphs in this comment: https://github.com/bloomberg/memray/issues/565#issuecomment-1991197469 ?

Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph

nrs-status commented 6 months ago

The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.

Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?

Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse

What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph

With respect to the difference with your flamegraphs: the run you've benchmarked indeed suggested that the profiler might be related to the memory issue. But take a look at my first flamegraph screenshot:

https://imgur.com/a/RNTv9d9

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand. I'm dealing with a similar situation in my current benchmark: considering total memory allocation, the torch profiler uses a total of 0 to 20 mb at most out of a max usage of 700 mb, so it would be totally invisible if only considered with respect to this metric. Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)

With respect to the numbers I'm referring to: the current memray flamegraph offers the number of allocations when you hover over the components of the flamegraph

pablogsal commented 6 months ago

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

pablogsal commented 6 months ago

Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)

Ah, this is an interesting point

nrs-status commented 6 months ago

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

pablogsal commented 6 months ago

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

Check out https://bloomberg.github.io/memray/run.html#native-tracking

nrs-status commented 6 months ago

As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.

To include C++ code inside you need to pass --native to memray run (that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack is giving you + the Python frames instead of _PyEval_EvalFrameDefault.

Ah my bad I completely missed this

Check out https://bloomberg.github.io/memray/run.html#native-tracking

Can confirm that my new flamegraph looks a bit more similar to yours, but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

pablogsal commented 6 months ago

but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.

nrs-status commented 6 months ago

but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared

We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.

Yeah, thanks for pointing out the --native option which was the main thing I was missing. Thanks for taking the time!

nrs-status commented 6 months ago

Another bit of feedback (I imagine you guys are pretty busy so I hope these are at least useful and not just a bother heheh): using the summary command, the allocations indeed show up, but not the profiler (it shows up, but not with the enormous allocation count, like it does in the flamegraph when you hover over it). I've updated the .bin file in my shared repo to be the last one I've been reporting on.

bloomberg / memray