facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.48k stars 7.48k forks source link

OSError: [Errno 121] Remote I/O error after 12 epochs ?? #1544

Closed rds-itga closed 4 years ago

rds-itga commented 4 years ago

Hi everybody,

I launch a training on 512x512 8 bits .png images, I got this issue after more than 12 epochs training, I really don't understand why, everything was ok before this:

[06/05 04:18:15] d2.engine.train_loop ERROR: Exception during training: Traceback (most recent call last): File "/home/appuser/detectron2_repo/detectron2/engine/train_loop.py", line 132, in train self.run_step() File "/home/appuser/detectron2_repo/detectron2/engine/train_loop.py", line 209, in run_step data = next(self._data_loader_iter) File "/home/appuser/detectron2_repo/detectron2/data/common.py", line 140, in iter for d in self.dataset: File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/home/appuser/.local/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) OSError: Caught OSError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/appuser/detectron2_repo/detectron2/data/common.py", line 41, in getitem data = self._map_func(self._dataset[cur_idx]) File "/home/appuser/detectron2_repo/detectron2/utils/serialize.py", line 23, in call return self._obj(*args, **kwargs) File "/home/appuser/detectron2_repo/detectron2/data/dataset_mapper.py", line 77, in call image = utils.read_image(dataset_dict["file_name"], format=self.img_format) File "/home/appuser/detectron2_repo/detectron2/data/detection_utils.py", line 49, in read_image image = Image.open(f) File "/home/appuser/.local/lib/python3.6/site-packages/PIL/Image.py", line 2818, in open prefix = fp.read(16) OSError: [Errno 121] Remote I/O error

here is the whole log.txt file: log.txt

her is my config.yaml: CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

ppwwyyxx commented 4 years ago

According to the stack trace PIL cannot read the image file.