allenai / satlas-super-resolution

Apache License 2.0
221 stars 25 forks source link

Multi-gpu training #19

Closed yunseok624 closed 6 months ago

yunseok624 commented 8 months ago

Hi, is the command for running script on multi-gpu correct? I am getting some problems in the beggining of the training phase. I want to train it on 2 gpus.

After this command "PYTHONPATH=. python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 ssr/train.py -opt ssr/options/esrgan_s2naip_urban.yml --launcher pytorch" I get:

/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-22 21:56:50,850] torch.distributed.run: [WARNING] [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:4321 (errno: 97 - Address family not supported by protocol). [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:4321 (errno: 97 - Address family not supported by protocol). /trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( /trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --local-rank=1 train.py: error: unrecognized arguments: --local-rank=0 [2024-03-22 21:57:00,889] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 4154405) of binary: /trinity/home/park.yunseok/.conda/envs/ssr/bin/python Traceback (most recent call last): File "/trinity/home/park.yunseok/.conda/envs/ssr/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/trinity/home/park.yunseok/.conda/envs/ssr/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/trinity/home/park.yunseok/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ssr/train.py FAILED

Failures: [1]: time : 2024-03-22_21:57:00 host : gn16.zhores rank : 1 (local_rank: 1) exitcode : 2 (pid: 4154406) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-22_21:57:00 host : gn16.zhores rank : 0 (local_rank: 0) exitcode : 2 (pid: 4154405) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

piperwolters commented 8 months ago

Ah, what version of torch are you using? There was a recent update where I also got this warning, so I had to go back to an earlier version. The BasicSR package (which this code heavily relies on) has not yet been updated to work with newer torch versions.

yunseok624 commented 8 months ago

Pytorch 2.2.0. Which version are you using?

piperwolters commented 8 months ago

Hm, up to 2.1.0 works for me. Can you try adding --use-env to your command?

yunseok624 commented 7 months ago

So I checked again, it's 2.1.0. I'll try adding --use-env and come back if there will be a problem

yunseok624 commented 7 months ago

Got such problem:

usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --use-env usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --use-env

piperwolters commented 7 months ago

Hmm okay. Let me try to replicate this. I ran into this issue a few weeks ago, but I am not remembering how I solved it.

piperwolters commented 6 months ago

Hi again - I haven't been able to replicate this. Have you resolved it, or are you still getting the same errors?

yunseok624 commented 6 months ago

Hey,

I failed in every ways to train on multiple gpus...

piperwolters commented 6 months ago

So weird. I tried installing the environment from scratch again and adding the --use-env flag is what allowed it to work with torch.distributed.launch (as opposed to torchrun). I can add it to my TODO to get torchrun working but that may require a PR to BasicSR repo. I'm going to close this issue for now and will reopen if/when I get to that.