Closed yunseok624 closed 6 months ago
Ah, what version of torch are you using? There was a recent update where I also got this warning, so I had to go back to an earlier version. The BasicSR package (which this code heavily relies on) has not yet been updated to work with newer torch versions.
Pytorch 2.2.0. Which version are you using?
Hm, up to 2.1.0 works for me. Can you try adding --use-env
to your command?
So I checked again, it's 2.1.0. I'll try adding --use-env and come back if there will be a problem
Got such problem:
usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --use-env usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --use-env
Hmm okay. Let me try to replicate this. I ran into this issue a few weeks ago, but I am not remembering how I solved it.
Hi again - I haven't been able to replicate this. Have you resolved it, or are you still getting the same errors?
Hey,
I failed in every ways to train on multiple gpus...
So weird. I tried installing the environment from scratch again and adding the --use-env flag is what allowed it to work with torch.distributed.launch (as opposed to torchrun). I can add it to my TODO to get torchrun working but that may require a PR to BasicSR repo. I'm going to close this issue for now and will reopen if/when I get to that.
Hi, is the command for running script on multi-gpu correct? I am getting some problems in the beggining of the training phase. I want to train it on 2 gpus.
After this command "PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 ssr/train.py -opt ssr/options/esrgan_s2naip_urban.yml --launcher pytorch" I get:
/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects
--local-rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:4321 (errno: 97 - Address family not supported by protocol). [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:4321 (errno: 97 - Address family not supported by protocol). /trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( /trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --local-rank=1 train.py: error: unrecognized arguments: --local-rank=0 [2024-03-22 21:57:00,889] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 4154405) of binary: /trinity/home/park.yunseok/.conda/envs/ssr/bin/python Traceback (most recent call last): File "/trinity/home/park.yunseok/.conda/envs/ssr/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/trinity/home/park.yunseok/.conda/envs/ssr/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
ssr/train.py FAILED
Failures: [1]: time : 2024-03-22_21:57:00 host : gn16.zhores rank : 1 (local_rank: 1) exitcode : 2 (pid: 4154406) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-03-22_21:57:00 host : gn16.zhores rank : 0 (local_rank: 0) exitcode : 2 (pid: 4154405) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html