DeepLearnPhysics / lartpc_mlreco3d

9 stars 32 forks source link

Understand why DataLoader gets killed #35

Open Temigo opened 5 years ago

Temigo commented 5 years ago

Especially on V100, training UResNet (uresnet_lonely from Temigo/lartpc_mlreco3d, branch temigo) with batch size 64 and spatial size 768px.

Traceback (most recent call last):
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 33, in <module>
    main()
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 28, in main
    train(cfg)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 36, in train
    train_loop(cfg, handlers)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 236, in train_loop
    res = handlers.trainer.train_step(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 66, in train_step
    res_combined = self.forward(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 83, in forward
    res = self._forward(blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 127, in _forward
    result = self._net(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/models/uresnet_lonely.py", line 140, in forward
    x = self.input((coords, features))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 63, in forward
    self.mode
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 184, in forward
    mode
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 93813) is killed by signal: Killed.
drinkingkazu commented 5 years ago

Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce

Temigo commented 5 years ago

I just monitored the memory usage (free -h) when running my training on a single machine (nu-gpu) with 4 V100 at the same time. The jobs crash with this error when the machine runs out of memory. I think we already ran into this issue in the past and it is not a bug. @drinkingkazu please reopen if you think I am wrong and we should look more into this.

Temigo commented 5 years ago

Config to reproduce https://gist.github.com/Temigo/9a58ecacbc3b0d58cd34078a5f6c92fa (12 features) https://gist.github.com/Temigo/b067cd6f2b09bd3e30cd449d3a9dae72 (1 feature)

drinkingkazu commented 5 years ago

Since I never do what I promise, I write what I would try here...

drinkingkazu commented 5 years ago

Can reproduce the issue using the 12 features configuration provided by @Temigo .

Wrote a simple script to...

Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time).

before

drinkingkazu commented 5 years ago

... ok then it seems my guess was right... I implemented one of suggestions I made before (see my earlier reply on this thread above):

* I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

With that implementation, here's the same script's output: after

There's little evidence for memory leak in this plot (the memory fluctuation due to background process is dominating). We might record free RAM memory in the train log (will open another enhancement issue later) so that we can use any long-period training to monitor this.

Anyhow, seems this may be solved. Will close after making the container image available with the fix in larcv2.

drinkingkazu commented 5 years ago

... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result...

download

We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else.

drinkingkazu commented 5 years ago

Here's a trial with a 1 channel data. The memory increase over 900 iterations.

download

drinkingkazu commented 5 years ago

Not very conclusive. Running a test with 1800 iterations now.

drinkingkazu commented 5 years ago

Here's what 1 channel data looks like after running 1800 iterations (twice longer)

download

drinkingkazu commented 5 years ago

I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data.