Understand why DataLoader gets killed

Temigo commented 5 years ago

Especially on V100, training UResNet (uresnet_lonely from Temigo/lartpc_mlreco3d, branch temigo) with batch size 64 and spatial size 768px.

Traceback (most recent call last):
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 33, in <module>
    main()
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 28, in main
    train(cfg)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 36, in train
    train_loop(cfg, handlers)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 236, in train_loop
    res = handlers.trainer.train_step(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 66, in train_step
    res_combined = self.forward(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 83, in forward
    res = self._forward(blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 127, in _forward
    result = self._net(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/models/uresnet_lonely.py", line 140, in forward
    x = self.input((coords, features))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 63, in forward
    self.mode
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 184, in forward
    mode
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 93813) is killed by signal: Killed.

drinkingkazu commented 5 years ago

Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce

Temigo commented 5 years ago

I just monitored the memory usage (free -h) when running my training on a single machine (nu-gpu) with 4 V100 at the same time. The jobs crash with this error when the machine runs out of memory. I think we already ran into this issue in the past and it is not a bug. @drinkingkazu please reopen if you think I am wrong and we should look more into this.

Temigo commented 5 years ago

In the past we were loading all data into RAM, so this is not related to what we saw previously (my mistake).
To clarify, it is not crashing because of using 4 GPUs at the same time, but because of running out of CPU memory.
The config was 12 features + image size 768px + batch size 64 + 4 workers in DataLoader, which is using a lot of CPU memory (~14Gb per worker).
Monitoring over time the memory usage, it seems to me that there is a memory leak somewhere.

Config to reproduce https://gist.github.com/Temigo/9a58ecacbc3b0d58cd34078a5f6c92fa (12 features) https://gist.github.com/Temigo/b067cd6f2b09bd3e30cd449d3a9dae72 (1 feature)

drinkingkazu commented 5 years ago

Since I never do what I promise, I write what I would try here...

I would try running the whole process (training) by commenting out the body of a parser functions but returning the same (a constant empty array in global scope or something is fine) numpy array every time it's called. If memory is stable, then it's a problem within the parser.
If the answer is yes to the above trial, I would look into fill_3d_pcloud and fill_3d_voxel in larcv/core/PyUtils/PyUtils.cxx. One can try commenting out the whole function body, then un-comment little by little to identify the memory leak location (but if you want to try something straight, see below).
I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

drinkingkazu commented 5 years ago

Can reproduce the issue using the 12 features configuration provided by @Temigo .

Wrote a simple script to...

Run only data streaming w/ DataLoader (copy iotool section from the config file linked above)
Record increase in memory usage (i.e. relative to the mem usage at process start), iteration number, and time (though iter or time, only 1 needed)

Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time).

before

drinkingkazu commented 5 years ago

... ok then it seems my guess was right... I implemented one of suggestions I made before (see my earlier reply on this thread above):

* I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

With that implementation, here's the same script's output: after

There's little evidence for memory leak in this plot (the memory fluctuation due to background process is dominating). We might record free RAM memory in the train log (will open another enhancement issue later) so that we can use any long-period training to monitor this.

Anyhow, seems this may be solved. Will close after making the container image available with the fix in larcv2.

drinkingkazu commented 5 years ago

... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result...

download

We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else.

drinkingkazu commented 5 years ago

Here's a trial with a 1 channel data. The memory increase over 900 iterations.

download

drinkingkazu commented 5 years ago

3 channel data ... the memory increased by roughly 2.3GB from iteration 0 (~4.3GB) to 700 (~6.6GB)
1 channel data ... the memory increased by roughly 1.3GB from iteration 0 (~3.0GB) to 700 (~4.3GB)
In both cases, there's a drop in memory usage at iteration 700...?

Not very conclusive. Running a test with 1800 iterations now.

drinkingkazu commented 5 years ago

Here's what 1 channel data looks like after running 1800 iterations (twice longer)

download

drinkingkazu commented 5 years ago

the memory leak takes a break around 700 iterations, then re-start increasing around 1000

I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data.

DeepLearnPhysics / lartpc_mlreco3d

Understand why DataLoader gets killed #35