Closed Mehulk43 closed 1 year ago
Can you confirm the path /dataset/Imagenet
exists?
yes, it exits
Can you confirm the path
/dataset/Imagenet
exists?
I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.
I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.
I am getting like this " ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) " Not for the dataset.
I have already tried to give the full path of dataset. The same is happened.
It is literally failing here:
assert os.path.exists(root)
AssertionError
Also, could you clarify what the "full path of the dataset" is?
Can you please ls /dataset/ImageNet
and share the output?
It is literally failing here:
assert os.path.exists(root) AssertionError
Also, could you clarify what the "full path of the dataset" is? Can you please
ls /dataset/ImageNet
and share the output?
I have created a folder name " dataset" in classification folder and put the imagnet in dataset folder.
In that case it should be dataset/ImageNet
and not /dataset/ImageNet
(no forward slash in the beginning.)
In that case it should be
dataset/ImageNet
and not/dataset/ImageNet
(no forward slash in the beginning.)
Thanks you for replying fast.
Yeah I know that, I have tried that too.
and i have also given full path name like ~/Downloads/MyProject/[Neighborhood-Attention-Transformer/classification/dataset/ImageNet
And I have also tried like ./dataset/ImageNet
@Mehulk43 I can confirm that this is a path issue. It is an assertion error in timm
on the create_dataset
function. You may be confused because we have left /dataset/ImageNet
in as an example of where that might be. It's pretty unlikely that's where you have ImageNet at. I suggest using readlink -f <insert ImageNet folder path here>
and paste that into the path argument.
Also note that any path starting with ~/
is actually relative. ~/
is the same as the $HOME
variable. Full paths start with /
which is root directory.
Thank you,
I will try and upload the screenshot if I get the error again.
Closing this due to inactivity. If you still have questions feel free to open it back up.
python -m torch.distributed.launch --nproc_per_node=1 train.py -c configs/nat_mini.yml /dataset/Imagenet
/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility forfurther instructions warnings.warn( Training with a single process on 1 GPUs. WARNING: Unsupported operator aten::mul encountered 52 time(s) WARNING: Unsupported operator aten::softmax encountered 18 time(s) WARNING: Unsupported operator aten::add encountered 70 time(s) WARNING: Unsupported operator aten::gelu encountered 18 time(s) WARNING: Unsupported operator aten::rand encountered 34 time(s) WARNING: Unsupported operator aten::floor_ encountered 34 time(s) WARNING: Unsupported operator aten::div encountered 34 time(s) WARNING: Unsupported operator aten::adaptive_avg_pool1d encountered 1 time(s) Model nat_mini created. 19.984M Params and 2.713GFLOPs
Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 Using native Torch AMP. Training in mixed precision. Traceback (most recent call last): File "train.py", line 1020, in
main(args)
File "train.py", line 517, in main
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset_factory.py", line 138, in create_dataset ds = ImageDataset(root, parser=name, class_map=class_map, load_bytes=load_bytes, **kwargs) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset.py", line 32, in init parser = create_parser(parser or '', root=root, class_map=class_map) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/parsers/parser_factory.py", line 22, in create_parser assert os.path.exists(root) AssertionError ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48603) of binary: /home/user/anaconda3/envs/nat/bin/python
Traceback (most recent call last):
File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================ train.py FAILED
Failures: