Closed lihao2333 closed 4 years ago
Hi,
This looks to me like a RAM problem. Can you try workers=0 in the config file?
I try worker=4 and batch_size=2. It seems explore the new issue.
Lr: 4.997e-03 | Update: 3.565e-04 mean,1.880e-04 std | Epoch: [0][9559/9565] | Time 0.684 (0.685) | Data 0.216 (0.216) | Loss 0.6257 (0.4301) | acc 0.828 (0.855) | IoU 0.287 (0.372)
Lr: 4.997e-03 | Update: 1.518e-04 mean,8.389e-05 std | Epoch: [0][9560/9565] | Time 0.678 (0.685) | Data 0.216 (0.216) | Loss 0.4195 (0.4301) | acc 0.844 (0.855) | IoU 0.347 (0.372)
Lr: 4.998e-03 | Update: 2.638e-04 mean,2.165e-04 std | Epoch: [0][9561/9565] | Time 0.675 (0.685) | Data 0.216 (0.216) | Loss 0.6339 (0.4301) | acc 0.682 (0.855) | IoU 0.241 (0.372)
Lr: 4.998e-03 | Update: 1.150e-04 mean,8.185e-05 std | Epoch: [0][9562/9565] | Time 0.680 (0.685) | Data 0.216 (0.216) | Loss 0.3510 (0.4301) | acc 0.843 (0.855) | IoU 0.309 (0.372)
Lr: 4.999e-03 | Update: 6.491e-04 mean,2.584e-04 std | Epoch: [0][9563/9565] | Time 0.681 (0.685) | Data 0.220 (0.216) | Loss 0.7378 (0.4301) | acc 0.738 (0.855) | IoU 0.237 (0.372)
Lr: 4.999e-03 | Update: 4.932e-04 mean,1.972e-04 std | Epoch: [0][9564/9565] | Time 0.680 (0.685) | Data 0.216 (0.216) | Loss 0.8861 (0.4302) | acc 0.776 (0.855) | IoU 0.263 (0.372)
Best mean iou in training set so far, save model!
********************************************************************************
Validation set:
Time avg per batch 0.195
Loss avg 1.0029
Acc avg 0.765
IoU avg 0.341
IoU class 0 [unlabeled] = 0.000
IoU class 1 [car] = 0.854
IoU class 2 [bicycle] = 0.095
IoU class 3 [motorcycle] = 0.140
IoU class 4 [truck] = 0.091
IoU class 5 [other-vehicle] = 0.149
IoU class 6 [person] = 0.191
IoU class 7 [bicyclist] = 0.366
IoU class 8 [motorcyclist] = 0.000
IoU class 9 [road] = 0.786
IoU class 10 [parking] = 0.216
IoU class 11 [sidewalk] = 0.665
IoU class 12 [other-ground] = 0.013
IoU class 13 [building] = 0.705
IoU class 14 [fence] = 0.184
IoU class 15 [vegetation] = 0.597
IoU class 16 [trunk] = 0.339
IoU class 17 [terrain] = 0.673
IoU class 18 [pole] = 0.299
IoU class 19 [traffic-sign] = 0.110
Best mean iou in validation so far, save model!
********************************************************************************
********************************************************************************
Traceback (most recent call last):
File "./train.py", line 115, in <module>
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 236, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 307, in train_epoch
for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _) in enumerate(train_loader):
File "/home/haoli/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
return _DataLoaderIter(self)
File "/home/haoli/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
w.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
It seems that the memory is not released correctly?
I guess that loading data using multi-process runs out of memory.
I solved this by adding a swap file.
I train from the pre-train model. After train one epoch and save successfully, it stack a while and raise following error.
How can i solve it?