Closed erictan23 closed 5 months ago
This issue is caused by version changes, and the following two lines of code: import ruamel.yaml as yaml config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) can be replaced with: from ruamel.yaml import YAML yaml = YAML(typ='safe') config = yaml.load(open(args.config, 'r'))
Hello @Shuyu-XJTU ! Thank you for your advice on changing the lines of code due to version changes.
I am still experiencing some problems similar to the previous time where the retrieval.py has failed. I have checked if my CUDA device is unavailable, but they are actually available. I suspect this is due to the torch.distributed.launch being deprecated? But I am not entirely sure of how I will need to modify this to rectify the problem... Could you further advise me on what I should do ?
The following shows the failure received from the python script.
NNODES, 1
NPROC_PER_NODE, 4
MASTER_ADDR, 127.0.0.1
MASTER_PORT, 3000
NODE_RANK, 0
/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2024-04-01 10:10:58,498] torch.distributed.run: [WARNING]
[2024-04-01 10:10:58,498] torch.distributed.run: [WARNING]
[2024-04-01 10:10:58,498] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-01 10:10:58,498] torch.distributed.run: [WARNING]
Traceback (most recent call last):
File "Retrieval.py", line 303, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
| distributed init (rank 0): env://
Traceback (most recent call last):
File "Retrieval.py", line 303, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "Retrieval.py", line 303, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Root Cause (first observed failure): [0]: time : 2024-04-01_10:11:08 host : default-Pulse-15-B13VFK rank : 1 (local_rank: 1) exitcode : 1 (pid: 17994) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Sorry, maybe you can try to replace 'torch.distributed.launch' with 'torch.distributed.run'.
I was running the code for just evaluation after downloading the datasets and checkpoints required. However, I am not able to run the evaluation python script. Does anyone have the same problem too? and do you have any solution to this? To Reproduce the error: python3 run.py --task "itr_cuhk" --evaluate --dist "f4" --output_dir "output/ft_cuhk/test" --checkpoint "output/ft_cuhk/checkpoint_best.pth"
I am wondering if there is a version error in this issue, i am using PyYAML: 6.0.1, PyTorch: 2.2.1, ruamel.yaml: 0.18.6 ruamel.yaml.clib : 0.2.8
Error
NNODES, 1 NPROC_PER_NODE, 4 MASTER_ADDR, 127.0.0.1 MASTER_PORT, 3000 NODE_RANK, 0 /home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects
--local-rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] Traceback (most recent call last): File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use
yaml = YAML(typ='rt') yaml.load(...)
and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296
Traceback (most recent call last): File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use
yaml = YAML(typ='rt') yaml.load(...)
and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296
Traceback (most recent call last): File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use
yaml = YAML(typ='rt') yaml.load(...)
and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296
Traceback (most recent call last): File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use
yaml = YAML(typ='rt') yaml.load(...)
and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296
[2024-03-25 15:15:11,033] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 142176) of binary: /home/default/miniconda3/envs/aptm/bin/python3 Traceback (most recent call last): File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Retrieval.py FAILED
Failures: [1]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 1 (local_rank: 1) exitcode : 1 (pid: 142177) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 2 (local_rank: 2) exitcode : 1 (pid: 142178) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 3 (local_rank: 3) exitcode : 1 (pid: 142179) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 0 (local_rank: 0) exitcode : 1 (pid: 142176) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html