Open ghost opened 1 year ago
This looks like a data path issue. You can check if the image corresponding to the path /content/drive/MyDrive/Iris-and-Needle-Segmentation-3/train/images/SID0615_jpg.rf.8dd4aeb70ce910df9c8716e3af21b2cd.jpg
is still there. Or is your google drive disconnected?
I've used training command but every time after random number of epochs I've got FileNotFound error from dataloader.Anyone knows the solution? error: epoch: 14 learning rate: 1e-05 [ 0/333] eta: 0:14:51 training_loss: 0.1127 (0.1127) loss_mask: 0.0446 (0.0446) loss_dice: 0.0681 (0.0681) time: 2.6786 data: 0.3379 max mem: 10103 Traceback (most recent call last): File "/content/drive/MyDrive/sam-hq/train/train.py", line 651, in
main(net, train_datasets, valid_datasets, args)
File "/content/drive/MyDrive/sam-hq/train/train.py", line 360, in main
train(args, net, optimizer, train_dataloaders, valid_dataloaders, lr_scheduler,writer)
File "/content/drive/MyDrive/sam-hq/train/train.py", line 396, in train
for data in metric_logger.log_every(train_dataloaders,1000):
File "/content/drive/MyDrive/sam-hq/train/utils/misc.py", line 237, in log_every
for obj in iterable:
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", line 243, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/content/drive/MyDrive/sam-hq/train/utils/dataloader.py", line 244, in getitem
File "/usr/local/lib/python3.10/dist-packages/skimage/io/_io.py", line 53, in imread
img = call_plugin('imread', fname, plugin=plugin, plugin_args)
File "/usr/local/lib/python3.10/dist-packages/skimage/io/manage_plugins.py", line 207, in call_plugin
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/skimage/io/_plugins/imageio_plugin.py", line 15, in imread
return np.asarray(imageio_imread(args, kwargs))
File "/usr/local/lib/python3.10/dist-packages/imageio/v2.py", line 226, in imread
with imopen(uri, "ri", **imopen_args) as file:
File "/usr/local/lib/python3.10/dist-packages/imageio/core/imopen.py", line 113, in imopen
request = Request(uri, io_mode, format_hint=format_hint, extension=extension)
File "/usr/local/lib/python3.10/dist-packages/imageio/core/request.py", line 247, in init
self._parse_uri(uri)
File "/usr/local/lib/python3.10/dist-packages/imageio/core/request.py", line 407, in _parse_uri
raise FileNotFoundError("No such file: '%s'" % fn)
FileNotFoundError: No such file: '/content/drive/MyDrive/Iris-and-Needle-Segmentation-3/train/images/SID0615_jpg.rf.8dd4aeb70ce910df9c8716e3af21b2cd.jpg'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2600) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: