Closed nrs-status closed 6 months ago
Thanks for opening this issue @nrs-status. I tried using your Docker container but I cannot reproduce your issue. Indeed, memray
correctly shows the leak both in the plots and the flamegraph. Take a look:
That shows that the resident memory (and the heap memory) doesn't stop growing.
And in this flamegraph you can see that most memory comes from the pytorch profiler (torch::autograd::profiler::disableProfiler() at <unknown>:0
).
Notice that the resident size is much much bigger than the heap size. Check out the docs to understand why it could be:
We will try to investigate to check if we can understand why the resident memory is growing without the heap memory growing.
We did some investigation and this is indeed heap fragmentation. We profile a single iteration (between two hits of the breakpoint) and memray correctly accounts for every allocation and deallocation that happens. The problem is that internally posix_memalign
(which is the allocator that PyTorch profiler is using underneath) calls brk
and when freeing the pointer it never recedes the heap (the heap always grows - also notice that "heap" here is the brk
-based heap). You can check this by running strace -e brk -p PID
between iterations:
brk(0xaaab3660f000) = 0xaaab3660f000
brk(0xaaab3678f000) = 0xaaab3678f000
brk(0xaaab3690f000) = 0xaaab3690f000
brk(0xaaab36a8f000) = 0xaaab36a8f000
brk(0xaaab36c0f000) = 0xaaab36c0f000
brk(0xaaab36d8f000) = 0xaaab36d8f000
brk(0xaaab36f0f000) = 0xaaab36f0f000
brk(0xaaab3714f000) = 0xaaab3714f000
brk(0xaaab3750f000) = 0xaaab3750f000
brk(0xaaab37e0f000) = 0xaaab37e0f000
brk(0xaaab3840f000) = 0xaaab3840f000
brk(0xaaab3720f000) = 0xaaab3720f000
brk(0xaaab3744f000) = 0xaaab3744f000
brk(0xaaab3750f000) = 0xaaab3750f000
brk(0xaaab3768f000) = 0xaaab3768f000
brk(0xaaab3780f000) = 0xaaab3780f000
brk(0xaaab3798f000) = 0xaaab3798f000
brk(0xaaab37b0f000) = 0xaaab37b0f000
brk(0xaaab37c8f000) = 0xaaab37c8f000
brk(0xaaab37ecf000) = 0xaaab37ecf000
brk(0xaaab38290000) = 0xaaab38290000
brk(0xaaab38b90000) = 0xaaab38b90000
brk(0xaaab39190000) = 0xaaab39190000
brk(0xaaab37e10000) = 0xaaab37e10000
brk(0xaaab38050000) = 0xaaab38050000
brk(0xaaab38110000) = 0xaaab38110000
brk(0xaaab38290000) = 0xaaab38290000
brk(0xaaab38410000) = 0xaaab38410000
brk(0xaaab38590000) = 0xaaab38590000
brk(0xaaab38710000) = 0xaaab38710000
brk(0xaaab38890000) = 0xaaab38890000
brk(0xaaab38a10000) = 0xaaab38a10000
brk(0xaaab38c50000) = 0xaaab38c50000
brk(0xaaab39310000) = 0xaaab39310000
brk(0xaaab39910000) = 0xaaab39910000
brk(0xaaab39a90000) = 0xaaab39a90000
brk(0xaaab3a390000) = 0xaaab3a390000
brk(0xaaab3a990000) = 0xaaab3a990000
brk(0xaaab3b110000) = 0xaaab3b110000
brk(0xaaab3b710000) = 0xaaab3b710000
brk(0xaaab3a210000) = 0xaaab3a210000
brk(0xaaab3a450000) = 0xaaab3a450000
brk(0xaaab3a510000) = 0xaaab3a510000
brk(0xaaab3a690000) = 0xaaab3a690000
brk(0xaaab3a810000) = 0xaaab3a810000
brk(0xaaab3a990000) = 0xaaab3a990000
brk(0xaaab3ab10000) = 0xaaab3ab10000
brk(0xaaab3ac90000) = 0xaaab3ac90000
brk(0xaaab3aed0000) = 0xaaab3aed0000
brk(0xaaab3b590000) = 0xaaab3b590000
brk(0xaaab3bb90000) = 0xaaab3bb90000
as can see the heap pointer always grows. Notice that brk
being called is an implementation detail of glibc when it calls posix_memalign
and what's being "leaked" is the resident size (because the actual allocation is actually being freed later and memray doesn't complain about it). The problem is that this leaves the heap fragmented. You can read more about this here:
https://bloomberg.github.io/memray/memory.html#memory-can-be-fragmented
I am closing the issue as this is not a problem of memray
Thank you very much for taking the time, I wasn't making proper use of the heap vs. resident distinction. I realize this might not be a memray issue but I'm just sharing the extra info in case it is of any use. I've sort of got lost in my debugging attempts and failed to notice that the flamegraph in the OP reported the existence of the profiler, but if you run the test again cloning my repo https://github.com/nrs-status/shared (I've included the .bin this time) and building the Dockerfile there, the report instead looks like what is in the next image. As you can see, there would have been no way to infer that the problem was mainly due to the pytorch profiler.
Sorry for not giving the proper setup, I was doing some debugging on my own and stopped midway to make the docker image. Also thanks for the explanation about how the Pytorch Profiler allocates memory.
As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!
I modified memray
to show in the plot the memory that's fragmented (taken from calling mallinfo2()
):
As an aside, is there any way to profile the heap memory to be able to detect cases like these in the future? Cheers!
I think we can do something like I showed in the previous plot. It's still unclear how someone could fix things with this info, but at least you can know what's happening.
Very cool for the modification to memray! On my side, I've continued playing with this issue for a bit, this time using heaptrack
and profiling the CPython itself. Here's the result of an allocations flamegraph:
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program. Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.
Isn't that more or less the same information as in the flamegraphs in this comment: https://github.com/bloomberg/memray/issues/565#issuecomment-1991197469 ?
Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph
The above makes evident that the torch profiler is making a disproportionate number of allocations relative to the rest of the program.
Isn't that more or less the same information as in the flamegraphs in this comment: #565 (comment) ?
Maybe this could help diagnose fragmentation issues with memray? I'm seeing we already have those numbers available on the current flamegraphs when we hover with mouse
What numbers are you referring to? I guess I am failing to see what information is helping you here from the heaptrack flamegraph
With respect to the difference with your flamegraphs: the run you've benchmarked indeed suggested that the profiler might be related to the memory issue. But take a look at my first flamegraph screenshot:
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand. I'm dealing with a similar situation in my current benchmark: considering total memory allocation, the torch profiler uses a total of 0 to 20 mb at most out of a max usage of 700 mb, so it would be totally invisible if only considered with respect to this metric. Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack
setup if you want)
With respect to the numbers I'm referring to: the current memray flamegraph offers the number of allocations when you hover over the components of the flamegraph
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass --native
to memray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information as heaptrack
is giving you + the Python frames instead of _PyEval_EvalFrameDefault
.
Yet, if we look at the total allocations not by size but by simple amount of times an allocation has been made, it represents a staggering 42% of total allocations (I can send you the Dockerfile with the heaptrack setup if you want)
Ah, this is an interesting point
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.
Ah my bad I completely missed this
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.Ah my bad I completely missed this
Check out https://bloomberg.github.io/memray/run.html#native-tracking
As you can see, it is totally impossible, using only total memory allocation by size, to see that the torch profiler is in any way related to the problem at hand.
To include C++ code inside you need to pass
--native
tomemray run
(that's how I generated my run). That will allow you to get the same information you are getting in the flamegraph that you are showing. It will include the same information asheaptrack
is giving you + the Python frames instead of_PyEval_EvalFrameDefault
.Ah my bad I completely missed this
Check out https://bloomberg.github.io/memray/run.html#native-tracking
Can confirm that my new flamegraph looks a bit more similar to yours, but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
We capture tha information so it's there. You can get it a bit better in other reporters such as memray summary
where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.
but we have the same phenomenon I pointed out: the Profiler has a max usage of about 30% of the max allocation by size, but in my run I obtained that the torch profiler was doing 99% of allocations by simple number of allocations (lol!). This is using the exact setup available on my shared repo: https://github.com/nrs-status/shared
We capture tha information so it's there. You can get it a bit better in other reporters such as
memray summary
where you can sort by number of allocations. The flamegraph doesn't allow to change the size by number of allocations but we can modify to allow it.
Yeah, thanks for pointing out the --native
option which was the main thing I was missing. Thanks for taking the time!
Another bit of feedback (I imagine you guys are pretty busy so I hope these are at least useful and not just a bother heheh): using the summary command, the allocations indeed show up, but not the profiler (it shows up, but not with the enormous allocation count, like it does in the flamegraph when you hover over it). I've updated the .bin
file in my shared repo to be the last one I've been reporting on.
Is there an existing issue for this?
Current Behavior
memray summary
shows that memory does not go about a certain threshold, yet sampling current memory usage withmemory_profiler
and watchinghtop
shows there's a memory leak. I fixed my own issue: it was caused bypytorch
's profiler, so removing the pytorch profiler lines in the code that will be provided will fix the memory leak. But the main issue is that memray wasn't catching it.Sorry in advance for the VERY sketchy Dockerfile and code, I put it up quickly together by request from a Jupyter notebook I was working from as well as my previous debugging attempts and currently don't have time to clean it.
Steps To Reproduce
Here follows a Dockerfile where the problem should be reproducible. Upon running it, run
memray run bert.py
and exit whenever you want to produce the bin, then get the bin withdocker copy
.Memray Version
1.11.0
Python Version
3.11
Operating System
Linux
Anything else?