erictan23 commented 7 months ago

I was running the code for just evaluation after downloading the datasets and checkpoints required. However, I am not able to run the evaluation python script. Does anyone have the same problem too? and do you have any solution to this? To Reproduce the error: python3 run.py --task "itr_cuhk" --evaluate --dist "f4" --output_dir "output/ft_cuhk/test" --checkpoint "output/ft_cuhk/checkpoint_best.pth"

I am wondering if there is a version error in this issue, i am using PyYAML: 6.0.1, PyTorch: 2.2.1, ruamel.yaml: 0.18.6 ruamel.yaml.clib : 0.2.8

Error

NNODES, 1 NPROC_PER_NODE, 4 MASTER_ADDR, 127.0.0.1 MASTER_PORT, 3000 NODE_RANK, 0 /home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] Traceback (most recent call last): File "Retrieval.py", line 296, in config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation raise AttributeError(s) AttributeError: "load()" has been removed, use

yaml = YAML(typ='rt') yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

Traceback (most recent call last): File "Retrieval.py", line 296, in config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation raise AttributeError(s) AttributeError: "load()" has been removed, use

yaml = YAML(typ='rt') yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

Traceback (most recent call last): File "Retrieval.py", line 296, in config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation raise AttributeError(s) AttributeError: "load()" has been removed, use

yaml = YAML(typ='rt') yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

Traceback (most recent call last): File "Retrieval.py", line 296, in config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation raise AttributeError(s) AttributeError: "load()" has been removed, use

yaml = YAML(typ='rt') yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data, instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

[2024-03-25 15:15:11,033] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 142176) of binary: /home/default/miniconda3/envs/aptm/bin/python3 Traceback (most recent call last): File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in main() File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Retrieval.py FAILED

Failures: [1]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 1 (local_rank: 1) exitcode : 1 (pid: 142177) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 2 (local_rank: 2) exitcode : 1 (pid: 142178) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 3 (local_rank: 3) exitcode : 1 (pid: 142179) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-25_15:15:11 host : default-Pulse-15-B13VFK rank : 0 (local_rank: 0) exitcode : 1 (pid: 142176) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Shuyu-XJTU commented 7 months ago

This issue is caused by version changes, and the following two lines of code： import ruamel.yaml as yaml config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) can be replaced with: from ruamel.yaml import YAML yaml = YAML(typ='safe') config = yaml.load(open(args.config, 'r'))

erictan23 commented 7 months ago

Hello @Shuyu-XJTU ! Thank you for your advice on changing the lines of code due to version changes.

I am still experiencing some problems similar to the previous time where the retrieval.py has failed. I have checked if my CUDA device is unavailable, but they are actually available. I suspect this is due to the torch.distributed.launch being deprecated? But I am not entirely sure of how I will need to modify this to rectify the problem... Could you further advise me on what I should do ?

The following shows the failure received from the python script. NNODES, 1 NPROC_PER_NODE, 4 MASTER_ADDR, 127.0.0.1 MASTER_PORT, 3000 NODE_RANK, 0 /home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2024-04-01 10:10:58,498] torch.distributed.run: [WARNING] [2024-04-01 10:10:58,498] torch.distributed.run: [WARNING] [2024-04-01 10:10:58,498] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-04-01 10:10:58,498] torch.distributed.run: [WARNING] Traceback (most recent call last): File "Retrieval.py", line 303, in main(args, config) File "Retrieval.py", line 36, in main utils.init_distributed_mode(args) File "/home/default/Desktop/eric/APTM/utils/init.py", line 264, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/cuda/init.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

| distributed init (rank 0): env:// Traceback (most recent call last): File "Retrieval.py", line 303, in main(args, config) File "Retrieval.py", line 36, in main utils.init_distributed_mode(args) File "/home/default/Desktop/eric/APTM/utils/init.py", line 264, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/cuda/init.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "Retrieval.py", line 303, in main(args, config) File "Retrieval.py", line 36, in main utils.init_distributed_mode(args) File "/home/default/Desktop/eric/APTM/utils/init.py", line 264, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/cuda/init.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-04-01 10:11:08,564] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 17993 closing signal SIGTERM [2024-04-01 10:11:08,781] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 17994) of binary: /home/default/miniconda3/envs/aptm/bin/python3 Traceback (most recent call last): File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in main() File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Retrieval.py FAILED

Failures: [1]: time : 2024-04-01_10:11:08 host : default-Pulse-15-B13VFK rank : 2 (local_rank: 2) exitcode : 1 (pid: 17995) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-04-01_10:11:08 host : default-Pulse-15-B13VFK rank : 3 (local_rank: 3) exitcode : 1 (pid: 17996) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-04-01_10:11:08 host : default-Pulse-15-B13VFK rank : 1 (local_rank: 1) exitcode : 1 (pid: 17994) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Shuyu-XJTU commented 6 months ago

Sorry, maybe you can try to replace 'torch.distributed.launch' with 'torch.distributed.run'.

Shuyu-XJTU / APTM

Issues with Running Evaluation Run.py #19

Error

Retrieval.py FAILED

Retrieval.py FAILED