Closed ihholmes-p closed 3 years ago
It may be the memory OOM issue. Have you tried small batch_size_per_card?
Hey my case is also the same. The difference in my case it is running for 40 iter then it's killed. i have 4 batch_size_per_card and 2 workers.
W0609 07:10:35.864583 13827 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
ERROR:root:DataLoader reader thread raised an exception!
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data
data = self._data_queue.get(timeout=self._timeout)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/queues.py", line 105, in get
raise Empty
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 411, in _thread_loop
batch = self._get_data()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 498, in _get_data
"pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 13839
Traceback (most recent call last):
File "tools/train.py", line 127, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 104, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
File "/home/ubuntu/PaddleOCR/tools/program.py", line 205, in train
for idx, batch in enumerate(train_dataloader):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 585, in __next__
data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)
Have you found any solution?
Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开(建议先拉取最新代码进行尝试),我们会继续跟进。
I've tried on two different machines and I have this problem. I followed the instructions in the guide exactly, and used the same model, data, and config file