FlyEgle / MAE-pytorch

Masked Autoencoders Are Scalable Vision Learners
245 stars 36 forks source link

error occur #13

Closed leeisack closed 1 year ago

leeisack commented 1 year ago

Hi~ I'm really impressed with your code. I want to restore the hidden part of the face. So, you want to train with your code, but it is inferenced with a pretrained model. However, in the case of train, many problems arise. advice please

CUDA_VISIBLE_DEVICES=0,1 python -W ignore -m torch.distributed.launch --nproc_per_node 8 train_mae.py rank: 1 / 2 rank: 4 / 2 rank: 0 / 2 rank: 3 / 2 rank: 5 / 2 rank: 2 / 2 rank: 6 / 2 rank: 7 / 2 Traceback (most recent call last): File "train_mae.py", line 692, in main_worker(args) File "train_mae.py", line 205, in main_worker torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "train_mae.py", line 692, in main_worker(args) File "train_mae.py", line 205, in main_worker torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "train_mae.py", line 692, in main_worker(args) File "train_mae.py", line 205, in main_worker torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): Traceback (most recent call last): File "train_mae.py", line 692, in File "train_mae.py", line 692, in Traceback (most recent call last): File "train_mae.py", line 692, in main_worker(args) main_worker(args) File "train_mae.py", line 205, in main_worker File "train_mae.py", line 205, in main_worker torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal main_worker(args) File "train_mae.py", line 205, in main_worker torch.cuda.set_device(args.local_rank) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Killing subprocess 1520881 Killing subprocess 1520882 Killing subprocess 1520883 Killing subprocess 1520884 Killing subprocess 1520885 Killing subprocess 1520886 Killing subprocess 1520889 Killing subprocess 1520893 Traceback (most recent call last): File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/vimlab/anaconda3/envs/stylegan2_pytorch/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/vimlab/anaconda3/envs/stylegan2_pytorch/bin/python', '-u', 'train_mae.py', '--local_rank=7']' returned non-zero exit status 1.

FlyEgle commented 1 year ago

This is a ddp problem, u only use 0,1 gpu, but node is 8