GuoLanqing / ShadowFormer

ShadowFormer (AAAI2023), Pytorch implementation
MIT License
129 stars 17 forks source link

Segmentation fault after 14 epochs training #18

Open hnrna opened 1 year ago

hnrna commented 1 year ago

When I try to use custom dataset for training, Segmentation fault occurs after completing 14 epochs training.

How can I fix it?

dataset: custom data (training set: 475 , validation set: 53)

image size: 960*480

Environment and configuration are as follows: Python: 3.7.16 PyTorch: 1.13.1 CUDA: 11.6

Trainning command: python train.py --warmup --checkpoint 1 --win_size 10 --train_ps 320 --env _self_dataset --gpu 6,7

error info:

... (1-13 epoch info)
------------------------------------------------------------------                                                                                        
Epoch: 14       Time: 61.3263   Loss: 6.0985    LearningRate 0.000200
------------------------------------------------------------------
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4112666) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 174, in <module>
    for ii, data_val in enumerate((val_loader), 0):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4112666, 4112989) exited unexpectedly
hnrna commented 1 year ago

It looks like a Segmentation fault that occurs when it is the turn to run the evaluation program.