Open lilanxiao opened 2 years ago
It's unlikely to think that there's memory leak when over 10k iterations, the memory fluctuates from 21.2-21.7.
When there's a memory leak, the memory consumption CONSTANTLY increases. Can you share your log?
# sparse_quantize
0, memory used: 21.2
100, memory used: 21.2
200, memory used: 21.5
300, memory used: 21.3
400, memory used: 21.4
500, memory used: 21.4
600, memory used: 21.5
700, memory used: 21.5
800, memory used: 21.5
900, memory used: 21.5
1000, memory used: 21.4
1100, memory used: 21.4
1200, memory used: 21.6
1300, memory used: 21.7
1400, memory used: 21.4
1500, memory used: 21.5
1600, memory used: 21.5
1700, memory used: 21.6
1800, memory used: 21.7
1900, memory used: 21.6
2000, memory used: 21.7
...
9800, memory used: 21.8
9900, memory used: 21.6
10000, memory used: 21.7
10100, memory used: 21.6
10200, memory used: 21.6
@chrischoy thank you for your reply!
That is really strange because I see different behavior on my machine. As you can see, the RAM usage indeed increases CONSTANTLY. Maybe you can try more iterations?
0, memory used: 12.6
100, memory used: 12.7
200, memory used: 12.7
300, memory used: 12.7
400, memory used: 12.8
500, memory used: 12.8
600, memory used: 12.7
700, memory used: 12.8
800, memory used: 12.9
900, memory used: 12.9
1000, memory used: 12.6
1100, memory used: 12.6
1200, memory used: 12.7
1300, memory used: 12.6
1400, memory used: 12.7
1500, memory used: 12.6
1600, memory used: 12.8
1700, memory used: 12.9
1800, memory used: 12.9
1900, memory used: 13.0
.........
16900, memory used: 13.9
17000, memory used: 13.8
17100, memory used: 13.8
17200, memory used: 13.9
17300, memory used: 13.8
17400, memory used: 13.9
17500, memory used: 13.8
17600, memory used: 13.8
17700, memory used: 13.9
17800, memory used: 14.0
17900, memory used: 13.9
18000, memory used: 13.9
18100, memory used: 14.0
18200, memory used: 13.9
18300, memory used: 13.9
18400, memory used: 13.9
18500, memory used: 13.9
18600, memory used: 14.0
18700, memory used: 14.0
18800, memory used: 13.9
18900, memory used: 14.0
19000, memory used: 13.9
19100, memory used: 13.9
19200, memory used: 14.0
19300, memory used: 14.0
19400, memory used: 14.2
19500, memory used: 14.2
19600, memory used: 14.2
19700, memory used: 14.3
19800, memory used: 14.3
19900, memory used: 14.1
20000, memory used: 14.2
20100, memory used: 14.1
20200, memory used: 14.2
20300, memory used: 14.1
....
I've never seen a memory leak that doesn't leak until 10k iterations but starts to leak after 10k iterations. This looks more like OS level memory management and garbage collection rather than a memory leak, but I'll do more analysis.
ok, perhaps it's not proper to describe the problem as a memory leak.
Maybe I can provide more information. The demo code doesn't allocate more and more RAM if I remove the sparse_quantize
. To create a comparable baseline, I replace
vox, feats = ME.utils.sparse_quantize(coords, feats, quantization_size=0.02)
vox = vox.numpy()
with NumPy functions, which do similar things but are not as efficient as sparse_quantize
.
vox = np.floor(coords/0.02).astype(np.int32)
vox, index = np.unique(vox, axis=0, return_index=True)
feats = feats[index]
The RAM usage looks like this:
I run the two versions on the same machine. So, the sparse_quantize
indeed seems to have some strange behavior. With sparse_quantize
, the RAM usage is increased by 1.3 GB after 1e5 iterations, i.g. 13 KB per iteration.
With NumPy, the RAM usage doesn't increase. (I don't know why it drops BTW, maybe due to some system activities in the background?)
Update: I ran even more iterations and can confirm the RAM usage increases almost linearly:
Describe the bug The
ME.utils.sparse_quantize
function seems to cause a slow memory leak. The RAM (not GPU RAM) usage increases gradually and ends up with OOM in a long training schedule.When the following code runs, the RAM usage increases slowly. This code prints the percentage of RAM usage. You need to wait for a while to see the difference (15~20 minutes should be enough). Note that this code monitor the RAM usage of the entire system, you should not manipulate your computer when it runs.
Here I use Dataloader to accelerate the process. But you can reproduce this behavior without the Dataloader (i.g. iterate through the dataset directly).
To Reproduce
On my machine with 32GB RAM, the RAM usage linearly increases when this code runs. The RAM usage is related to the variance of data. A larger variance brings a faster usage increase. For instance, you see a slower increase with the two commented lines. If I use real point clouds instead of random numbers, the code can reach OOM if runs for an extremely long time.
Expected behavior The RAM usage is steady.
Desktop (please complete the following information):
python -c "import MinkowskiEngine as ME; ME.print_diagnostics()"
. Otherwise, paste the output of the following command.) ==========System========== Linux-5.4.0-87-generic-x86_64-with-debian-buster-sid DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS" 3.7.10 (default, Jun 4 2021, 14:48:32) [GCC 7.5.0] ==========Pytorch========== 1.8.1 torch.cuda.is_available(): True ==========NVIDIA-SMI========== /usr/bin/nvidia-smi Driver Version 470.63.01 CUDA Version 11.4 VBIOS Version 90.04.76.40.91 Image Version G001.0000.02.04 GSP Firmware Version N/A ==========NVCC========== /usr/local/cuda-10.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:24:38_PDT_2019 Cuda compilation tools, release 10.2, V10.2.89 ==========CC========== /usr/bin/c++ c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ==========MinkowskiEngine========== 0.5.4 MinkowskiEngine compiled with CUDA Support: True NVCC version MinkowskiEngine is compiled: 10020 CUDART version MinkowskiEngine is compiled: 10020Additional context The memory leak comes probably from the C/C++ extension side, as I cannot trace it with tracemalloc from the Python side.