Open Temigo opened 5 years ago
Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce
I just monitored the memory usage (free -h
) when running my training on a single machine (nu-gpu
) with 4 V100 at the same time. The jobs crash with this error when the machine runs out of memory. I think we already ran into this issue in the past and it is not a bug. @drinkingkazu please reopen if you think I am wrong and we should look more into this.
Config to reproduce https://gist.github.com/Temigo/9a58ecacbc3b0d58cd34078a5f6c92fa (12 features) https://gist.github.com/Temigo/b067cd6f2b09bd3e30cd449d3a9dae72 (1 feature)
Since I never do what I promise, I write what I would try here...
fill_3d_pcloud
and fill_3d_voxel
in larcv/core/PyUtils/PyUtils.cxx
. One can try commenting out the whole function body, then un-comment little by little to identify the memory leak location (but if you want to try something straight, see below).PyArray_Free(pyarray, (void *)carray);
in line 157 and 215 anyway (and this might be the culprit).Can reproduce the issue using the 12 features configuration provided by @Temigo .
Wrote a simple script to...
iotool
section from the config file linked above)Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time).
... ok then it seems my guess was right... I implemented one of suggestions I made before (see my earlier reply on this thread above):
* I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).
With that implementation, here's the same script's output:
There's little evidence for memory leak in this plot (the memory fluctuation due to background process is dominating). We might record free RAM memory in the train log (will open another enhancement issue later) so that we can use any long-period training to monitor this.
Anyhow, seems this may be solved. Will close after making the container image available with the fix in larcv2
.
... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result...
We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else.
Here's a trial with a 1 channel data. The memory increase over 900 iterations.
Not very conclusive. Running a test with 1800 iterations now.
Here's what 1 channel data looks like after running 1800 iterations (twice longer)
I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data.
Especially on V100, training UResNet (
uresnet_lonely
fromTemigo/lartpc_mlreco3d
, branchtemigo
) with batch size 64 and spatial size 768px.