Open IenLong opened 5 years ago
Problems seem to be caused by the parameter of num_workers in torch.utils.data.DataLoader(...), which have been discussed intensively at https://github.com/pytorch/pytorch/issues/1355. In my investigations, Setting _C.DATALOADER.NUM_WORKERS > 0 may lead to errors mentioned above. Therefore, I made _C.DATALOADER.NUM_WORKERS = 0 and the training has been keeping on for tens of thousands of iterations without anything unusual happends. However, less num_works means more training time is needed.
Yes, it looks like you run out of shared memory. Could you try increasing it?
@hyichao the problem is probably because you are running out of shared memory, and increasing it will probably fix the issue.
Check https://github.com/pytorch/pytorch/issues/1355#issuecomment-297184037 for more details
I have the same issue, the following code can solve it
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
This code is from #11201
I have the same issue, the following code can solve it
import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
This code is from #11201
Thank this solution for saving my day. BTW, do you know why this "share strategy" causes the issue? It's weird that it only occurs when I'm using DDP and in the specific machine, while the codes run properly on my other machines with DDP.
I have the same issue, the following code can solve it
import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
This code is from #11201
Thank this solution for saving my day. BTW, do you know why this "share strategy" causes the issue? It's weird that it only occurs when I'm using DDP and in the specific machine, while the codes run properly on my other machines with DDP.
Wow, it has been a long time since I encountered this error. I vaguely remember that this problem is related to multi-process and process lock, but I am not sure if I am correct. I hope my reply will not mislead you.
Hi,guys!I cannot use try catch sentence to catch the “”runtimeerror“”. It seems the training program just stucks at there and do not exit the try part?What should I do?
I have the same issue, the following code can solve it
import torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
This code is from #11201
Not working for me. Tried a lot of methods, only set num_workers=0
in the dataloader works, while it really slow the training (about 3x more time required...)
🐛 Bug
Thanks for the maskrcnn-benchmark project, which is really an awesome job! However, I got some problems while training with my own instance segmentation dataset as described below.
To Reproduce
Steps to reproduce the behavior:
Everything was ok during the training period at beginning. However, after several thousands of iteration, the training broke down. For simplicity, I paste the finall training output information here:
2018-11-03 11:11:27,514 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:50 iter: 6840 loss: 0.8033 (1.0337) loss_classifier: 0.2025 (0.2768) loss_box_reg: 0.1245 (0.1395) loss_mask: 0.3138 (0.4098) loss_objectness: 0.0600 (0.1297) loss_rpn_box_reg: 0.0195 (0.0779) time: 0.3105 (0.3053) data: 0.0067 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:33,930 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:57 iter: 6860 loss: 0.9135 (1.0337) loss_classifier: 0.1846 (0.2767) loss_box_reg: 0.0630 (0.1395) loss_mask: 0.3499 (0.4097) loss_objectness: 0.0861 (0.1298) loss_rpn_box_reg: 0.0168 (0.0780) time: 0.2981 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:40,246 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:52:00 iter: 6880 loss: 0.7548 (1.0331) loss_classifier: 0.1516 (0.2764) loss_box_reg: 0.0880 (0.1395) loss_mask: 0.3342 (0.4095) loss_objectness: 0.0588 (0.1298) loss_rpn_box_reg: 0.0457 (0.0780) time: 0.3046 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:46,088 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:43 iter: 6900 loss: 0.5536 (1.0324) loss_classifier: 0.1185 (0.2762) loss_box_reg: 0.0669 (0.1394) loss_mask: 0.2970 (0.4092) loss_objectness: 0.0445 (0.1297) loss_rpn_box_reg: 0.0095 (0.0779) time: 0.2823 (0.3054) data: 0.0048 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:52,392 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:45 iter: 6920 loss: 0.7813 (1.0319) loss_classifier: 0.1759 (0.2761) loss_box_reg: 0.0824 (0.1394) loss_mask: 0.3130 (0.4090) loss_objectness: 0.0393 (0.1295) loss_rpn_box_reg: 0.0133 (0.0779) time: 0.3052 (0.3054) data: 0.0061 (0.0129) lr: 0.002500 max mem: 4887 Traceback (most recent call last): File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py", line 236, in _feed File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage RuntimeError: unable to open shared memory object in read-write mode Traceback (most recent call last): File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve send(conn, destination_pid) File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 179, in send_handle with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s: File "/home/ly/sfw/anaconda3/lib/python3.7/socket.py", line 463, in fromfd nfd = dup(fd) OSError: [Errno 24] Too many open files Traceback (most recent call last): File "tools/train_net.py", line 172, in
main()
File "tools/train_net.py", line 165, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 74, in train
arguments,
File "/home/ly/projects/MaskRCNN/maskrcnn/maskrcnn_benchmark/engine/trainer.py", line 56, in dotrain
for iteration, (images, targets, ) in enumerate(data_loader, start_iter):
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 631, in next
idx, batch = self._get_batch()
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
return self.data_queue.get()
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
fd = df.detach()
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
return recvfds(s, 1)[0]
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError
Environment
conda
,pip
, source): conda install pytorch-nightly -c pytorchWhat should I do to solve this problem? Thanks for your help!