SysCV / sam-hq

Segment Anything in High Quality [NeurIPS 2023]
https://arxiv.org/abs/2306.01567
Apache License 2.0
3.67k stars 222 forks source link

Data Loader throwing FileNotFound error After few epochs of training #54

Open ghost opened 1 year ago

ghost commented 1 year ago

I've used training command but every time after random number of epochs I've got FileNotFound error from dataloader.Anyone knows the solution? error: epoch: 14 learning rate: 1e-05 [ 0/333] eta: 0:14:51 training_loss: 0.1127 (0.1127) loss_mask: 0.0446 (0.0446) loss_dice: 0.0681 (0.0681) time: 2.6786 data: 0.3379 max mem: 10103 Traceback (most recent call last): File "/content/drive/MyDrive/sam-hq/train/train.py", line 651, in main(net, train_datasets, valid_datasets, args) File "/content/drive/MyDrive/sam-hq/train/train.py", line 360, in main train(args, net, optimizer, train_dataloaders, valid_dataloaders, lr_scheduler,writer) File "/content/drive/MyDrive/sam-hq/train/train.py", line 396, in train for data in metric_logger.log_every(train_dataloaders,1000): File "/content/drive/MyDrive/sam-hq/train/utils/misc.py", line 237, in log_every for obj in iterable: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", line 243, in getitem return self.datasets[dataset_idx][sample_idx] File "/content/drive/MyDrive/sam-hq/train/utils/dataloader.py", line 244, in getitem File "/usr/local/lib/python3.10/dist-packages/skimage/io/_io.py", line 53, in imread img = call_plugin('imread', fname, plugin=plugin, plugin_args) File "/usr/local/lib/python3.10/dist-packages/skimage/io/manage_plugins.py", line 207, in call_plugin return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/skimage/io/_plugins/imageio_plugin.py", line 15, in imread return np.asarray(imageio_imread(args, kwargs)) File "/usr/local/lib/python3.10/dist-packages/imageio/v2.py", line 226, in imread with imopen(uri, "ri", **imopen_args) as file: File "/usr/local/lib/python3.10/dist-packages/imageio/core/imopen.py", line 113, in imopen request = Request(uri, io_mode, format_hint=format_hint, extension=extension) File "/usr/local/lib/python3.10/dist-packages/imageio/core/request.py", line 247, in init self._parse_uri(uri) File "/usr/local/lib/python3.10/dist-packages/imageio/core/request.py", line 407, in _parse_uri raise FileNotFoundError("No such file: '%s'" % fn) FileNotFoundError: No such file: '/content/drive/MyDrive/Iris-and-Needle-Segmentation-3/train/images/SID0615_jpg.rf.8dd4aeb70ce910df9c8716e3af21b2cd.jpg'

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2600) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-01_11:51:44 host : 6198cb800e23 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2600) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
ymq2017 commented 1 year ago

This looks like a data path issue. You can check if the image corresponding to the path /content/drive/MyDrive/Iris-and-Needle-Segmentation-3/train/images/SID0615_jpg.rf.8dd4aeb70ce910df9c8716e3af21b2cd.jpg is still there. Or is your google drive disconnected?