DataLoader worker is killed by singal: Killed

lihao2333 commented 4 years ago

I train from the pre-train model. After train one epoch and save successfully, it stack a while and raise following error.

Traceback (most recent call last):
  File "./train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 259, in train
    save_scans=self.ARCH["train"]["save_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 414, in validate
    output = model(in_vol, proj_mask)
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../../tasks/semantic/modules/segmentator.py", line 149, in forward
    y, skips = self.backbone(x)
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../..//backbones/darknet.py", line 167, in forward
    x, skips, os = self.run_layer(x, self.conv1, skips, os)
  File "../..//backbones/darknet.py", line 150, in run_layer
    y = layer(x)
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 490, in __call__
    if torch._C._get_tracing_state():
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4350) is killed by signal: Killed.

How can i solve it?

tano297 commented 4 years ago

Hi,

This looks to me like a RAM problem. Can you try workers=0 in the config file?

lihao2333 commented 4 years ago

I try worker=4 and batch_size=2. It seems explore the new issue.

Lr: 4.997e-03 | Update: 3.565e-04 mean,1.880e-04 std | Epoch: [0][9559/9565] | Time 0.684 (0.685) | Data 0.216 (0.216) | Loss 0.6257 (0.4301) | acc 0.828 (0.855) | IoU 0.287 (0.372)
Lr: 4.997e-03 | Update: 1.518e-04 mean,8.389e-05 std | Epoch: [0][9560/9565] | Time 0.678 (0.685) | Data 0.216 (0.216) | Loss 0.4195 (0.4301) | acc 0.844 (0.855) | IoU 0.347 (0.372)
Lr: 4.998e-03 | Update: 2.638e-04 mean,2.165e-04 std | Epoch: [0][9561/9565] | Time 0.675 (0.685) | Data 0.216 (0.216) | Loss 0.6339 (0.4301) | acc 0.682 (0.855) | IoU 0.241 (0.372)
Lr: 4.998e-03 | Update: 1.150e-04 mean,8.185e-05 std | Epoch: [0][9562/9565] | Time 0.680 (0.685) | Data 0.216 (0.216) | Loss 0.3510 (0.4301) | acc 0.843 (0.855) | IoU 0.309 (0.372)
Lr: 4.999e-03 | Update: 6.491e-04 mean,2.584e-04 std | Epoch: [0][9563/9565] | Time 0.681 (0.685) | Data 0.220 (0.216) | Loss 0.7378 (0.4301) | acc 0.738 (0.855) | IoU 0.237 (0.372)
Lr: 4.999e-03 | Update: 4.932e-04 mean,1.972e-04 std | Epoch: [0][9564/9565] | Time 0.680 (0.685) | Data 0.216 (0.216) | Loss 0.8861 (0.4302) | acc 0.776 (0.855) | IoU 0.263 (0.372)
Best mean iou in training set so far, save model!
********************************************************************************
Validation set:
Time avg per batch 0.195
Loss avg 1.0029
Acc avg 0.765
IoU avg 0.341
IoU class 0 [unlabeled] = 0.000
IoU class 1 [car] = 0.854
IoU class 2 [bicycle] = 0.095
IoU class 3 [motorcycle] = 0.140
IoU class 4 [truck] = 0.091
IoU class 5 [other-vehicle] = 0.149
IoU class 6 [person] = 0.191
IoU class 7 [bicyclist] = 0.366
IoU class 8 [motorcyclist] = 0.000
IoU class 9 [road] = 0.786
IoU class 10 [parking] = 0.216
IoU class 11 [sidewalk] = 0.665
IoU class 12 [other-ground] = 0.013
IoU class 13 [building] = 0.705
IoU class 14 [fence] = 0.184
IoU class 15 [vegetation] = 0.597
IoU class 16 [trunk] = 0.339
IoU class 17 [terrain] = 0.673
IoU class 18 [pole] = 0.299
IoU class 19 [traffic-sign] = 0.110
Best mean iou in validation so far, save model!
********************************************************************************
********************************************************************************
Traceback (most recent call last):
  File "./train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 236, in train
    show_scans=self.ARCH["train"]["show_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 307, in train_epoch
    for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _) in enumerate(train_loader):
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "/home/haoli/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
    w.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

It seems that the memory is not released correctly?

lihao2333 commented 4 years ago

I guess that loading data using multi-process runs out of memory.
I solved this by adding a swap file.

PRBonn / lidar-bonnetal

DataLoader worker is killed by singal: Killed #52