While ruuning the code, I got this types of problem. Could you please tell me the solution

Mehulk43 commented 2 years ago

python -m torch.distributed.launch --nproc_per_node=1 train.py -c configs/nat_mini.yml /dataset/Imagenet

/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for

further instructions warnings.warn( Training with a single process on 1 GPUs. WARNING: Unsupported operator aten::mul encountered 52 time(s) WARNING: Unsupported operator aten::softmax encountered 18 time(s) WARNING: Unsupported operator aten::add encountered 70 time(s) WARNING: Unsupported operator aten::gelu encountered 18 time(s) WARNING: Unsupported operator aten::rand encountered 34 time(s) WARNING: Unsupported operator aten::floor_ encountered 34 time(s) WARNING: Unsupported operator aten::div encountered 34 time(s) WARNING: Unsupported operator aten::adaptive_avg_pool1d encountered 1 time(s) Model nat_mini created. 19.984M Params and 2.713GFLOPs

Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 Using native Torch AMP. Training in mixed precision. Traceback (most recent call last): File "train.py", line 1020, in main(args) File "train.py", line 517, in main

dataset_train = create_dataset(

File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset_factory.py", line 138, in create_dataset ds = ImageDataset(root, parser=name, class_map=class_map, load_bytes=load_bytes, **kwargs) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset.py", line 32, in init parser = create_parser(parser or '', root=root, class_map=class_map) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/parsers/parser_factory.py", line 22, in create_parser assert os.path.exists(root) AssertionError ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48603) of binary: /home/user/anaconda3/envs/nat/bin/python

Traceback (most recent call last):

File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args))

File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-11-04_13:29:38 host : user rank : 0 (local_rank: 0) exitcode : 1 (pid: 48603) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

alihassanijr commented 2 years ago

Can you confirm the path /dataset/Imagenet exists?

Mehulk43 commented 2 years ago

yes, it exits

Can you confirm the path /dataset/Imagenet exists?

alihassanijr commented 2 years ago

I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.

Mehulk43 commented 2 years ago

I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.

I am getting like this " ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) " Not for the dataset.

I have already tried to give the full path of dataset. The same is happened.

alihassanijr commented 2 years ago

It is literally failing here:

assert os.path.exists(root)
AssertionError

Also, could you clarify what the "full path of the dataset" is? Can you please ls /dataset/ImageNet and share the output?

Mehulk43 commented 2 years ago

It is literally failing here:
assert os.path.exists(root)
AssertionError
Also, could you clarify what the "full path of the dataset" is? Can you please ls /dataset/ImageNet and share the output?

I have created a folder name " dataset" in classification folder and put the imagnet in dataset folder.

alihassanijr commented 2 years ago

In that case it should be dataset/ImageNet and not /dataset/ImageNet (no forward slash in the beginning.)

Mehulk43 commented 2 years ago

In that case it should be dataset/ImageNet and not /dataset/ImageNet (no forward slash in the beginning.)

Thanks you for replying fast.

Yeah I know that, I have tried that too.

and i have also given full path name like ~/Downloads/MyProject/[Neighborhood-Attention-Transformer/classification/dataset/ImageNet

And I have also tried like ./dataset/ImageNet

stevenwalton commented 2 years ago

@Mehulk43 I can confirm that this is a path issue. It is an assertion error in timm on the create_dataset function. You may be confused because we have left /dataset/ImageNet in as an example of where that might be. It's pretty unlikely that's where you have ImageNet at. I suggest using readlink -f <insert ImageNet folder path here> and paste that into the path argument.

Also note that any path starting with ~/ is actually relative. ~/ is the same as the $HOME variable. Full paths start with / which is root directory.

Mehulk43 commented 2 years ago

Thank you,

I will try and upload the screenshot if I get the error again.

alihassanijr commented 1 year ago

Closing this due to inactivity. If you still have questions feel free to open it back up.

SHI-Labs / Neighborhood-Attention-Transformer

While ruuning the code, I got this types of problem. Could you please tell me the solution #69