RuntimeError: DataLoader worker (pid 37531) is killed by signal: Killed.

YunhuaZhang commented 4 years ago

Hi,

When I run the trained model on test data, I always get this error:

41%|███████▍ | 525/1276 [37:39<43:27, 3.47s/it, 29.33 (4.19) / 27.63]tensor([[62.2167]], device='cuda:0') 41%|███████▍ | 526/1276 [37:41<40:23, 3.23s/it, 29.30 (2.20) / 27.65]tensor([[64.5487]], device='cuda:0') 41%|███████ | 527/1276 [37:45<40:46, 3.27s/it, 29.27 (13.39) / 27.67]tensor([[58.4618]], device='cuda:0') 41%|██████▌ | 528/1276 [37:47<38:45, 3.11s/it, 29.43 (157.65) / 27.72]tensor([[61.3680]], device='cuda:0') 41%|███████▍ | 529/1276 [37:51<39:09, 3.15s/it, 29.40 (8.24) / 27.73]tensor([[61.9610]], device='cuda:0') 42%|███████▍ | 530/1276 [37:57<51:21, 4.13s/it, 29.33 (4.85) / 27.71]tensor([[64.5215]], device='cuda:0') 42%|███████▍ | 531/1276 [38:03<57:19, 4.62s/it, 29.26 (2.91) / 27.70]tensor([[49.3714]], device='cuda:0') 42%|███████ | 532/1276 [38:07<56:39, 4.57s/it, 29.26 (28.22) / 27.68]tensor([[48.1342]], device='cuda:0') 42%|███████ | 533/1276 [38:10<50:35, 4.09s/it, 29.28 (47.93) / 27.67]tensor([[29.6015]], device='cuda:0') 42%|██████▎ | 534/1276 [38:20<1:12:10, 5.84s/it, 29.37 (72.82) / 27.62]tensor([[66.5562]], device='cuda:0') 42%|██████▎ | 535/1276 [38:26<1:11:19, 5.78s/it, 29.34 (10.90) / 27.66]tensor([[20.8735]], device='cuda:0') 42%|███████▌ | 536/1276 [38:27<53:51, 4.37s/it, 29.33 (9.51) / 27.66]tensor([[40.1236]], device='cuda:0') 42%|███████▏ | 537/1276 [38:30<50:27, 4.10s/it, 29.33 (27.54) / 27.65] Traceback (most recent call last): File "run_ef.py", line 3, in echonet.utils.video.run(modelname="r2plus1d_18",frames=32, period=2,pretrained=True,batch_size=8) File "/home/yzhang8/dynamic/echonet/utils/video.py", line 184, in run blocks=2) File "/home/yzhang8/dynamic/echonet/utils/video.py", line 266, in run_epoch tmp = model(X[j:(j + blocks), ...]) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torchvision/models/video/resnet.py", line 233, in forward x = self.layer4(x) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, *kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torchvision/models/video/resnet.py", line 107, in forward out = self.conv2(out) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled File "/home/yzhang8/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 37531) is killed by signal: Killed.

douyang commented 4 years ago

It appears to be running appropriately for many iterations. I would check that the dataloader is not shuffling the test dataset. If it is not shuffling the test dataset, I would also check that the particular video in the test set is not corrupted on download. Rewrite your script to print out which video it is working on at inference time and check that video is OK.

A working inference time model is also provided at script/InitializationNotebook.ipynb and can be compared as well.

YunhuaZhang commented 4 years ago

Thank you. Finally, I set num_workers=0 then the error disappears.

But, in the code, you set blocks=100 for testing. However, this leads to the CUDA out of memory problem. How to set the blocks?

douyang commented 4 years ago

The blocks variable is a parameter to can set to optimize for your hardware set up. You can make it smaller to not overflow your gpu memory.

echonet / dynamic

RuntimeError: DataLoader worker (pid 37531) is killed by signal: Killed. #4