clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.03k stars 272 forks source link

Insufficient shared memory #83

Closed surasakBoonkla closed 3 years ago

surasakBoonkla commented 3 years ago

I tried to train by following your scripts and I got these errors.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/usr/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3867) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "./trainSpeakerNet.py", line 166, in loss, traineer = s.train_network(loader=trainLoader); File "/opt/voxceleb_trainer-master/SpeakerNet.py", line 49, in train_network for data, data_label in loader: File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1068, in _next_data idx, data = self._get_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1034, in _get_data success, data = self._try_get_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3867, 3868) exited unexpectedly

What shall I do?

joonson commented 3 years ago

You need to increase the shm size. If you are using docker, there is an argument for this.

surasakBoonkla commented 3 years ago

I fixed the problem by rerunning the docker with setting --ipc=host.

Ref. https://github.com/tengshaofeng/ResidualAttentionNetwork-pytorch/issues/2