XingangPan / GAN2Shape

Code for GAN2Shape (ICLR2021 oral)
https://arxiv.org/abs/2011.00844
MIT License
573 stars 101 forks source link

RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. #32

Open leonel-os opened 3 years ago

leonel-os commented 3 years ago

Now I'm getting the following error. I installed all the dependencies including CUDA, but when I run:

sh scripts/run_car.sh

I'm getting this:

Load config from yml file: configs/car.ymlLoad config from yml file: configs/car.ymlLoad config from yml file: configs/car.ymlLoad config from yml file: configs/car.yml

Loading configs from configs/car.ymlLoading configs from configs/car.yml
Loading configs from configs/car.yml
Loading configs from configs/car.yml

{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}

{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}
Setting up Perceptual loss...
Setting up Perceptual loss...
Setting up Perceptual loss...
Setting up Perceptual loss...
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 3 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 2 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
...[net-lin [vgg]] initialized
...Done

Please I need help, thanks

XingangPan commented 3 years ago

@leonel-os It seems that your machine has only 1 GPU, while our scripts require at least 4 GPUs. You need to revise the run_car.sh script accordingly to run on one GPU. Specifically, you need to change CUDA_VISIBLE_DEVICES=0,1,2,3 to CUDA_VISIBLE_DEVICES=0, and close distributed training. However, it is possible that you may get sub-optimal quality on only one GPU. I suggest running on more GPUs if possible.

leonel-os commented 3 years ago

@XingangPan thanks, changing the run_car.sh configuration fixed the error.

EXP=car
CONFIG=car
GPUS=1
PORT=${PORT:-29577}

mkdir -p results/${EXP}
CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
    run.py \
    --launcher pytorch \
    --config configs/${CONFIG}.yml \
    2>&1 | tee results/${EXP}/log.txt

but now I'm getting the following error, related to CUDA out of memory:

sh scripts/run_car.sh
Load config from yml file: configs/car.yml
Loading configs from configs/car.yml
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': False, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 2, 'distributed': True}
Setting up Perceptual loss...
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
...[net-lin [vgg]] initialized
...Done
Loading images...
Traceback (most recent call last):
  File "run.py", line 34, in <module>
    trainer.train()
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 158, in train
    self.setup_data(epoch)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 78, in setup_data
    self.latent_list[epoch])
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 149, in setup_target
    self.load_latent()
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 248, in load_latent
    self.latent_w, self.gan_im = get_w_img(self.w_path)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 227, in get_w_img
    truncation=self.truncation, randomize_noise=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 595, in forward
    out = conv2(out, latent[:, i + 1], noise=noise2)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 350, in forward
    out = self.conv(input, style)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 287, in forward
    out = F.conv2d(input, weight, padding=self.padding, groups=batch)
RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 1.95 GiB total capacity; 901.72 MiB already allocated; 99.88 MiB free; 928.00 MiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/darkayserleo/anaconda3/envs/unsup3d/bin/python', '-u', 'run.py', '--local_rank=0', '--launcher', 'pytorch', '--config', 'configs/car.yml']' returned non-zero exit status 1.

I don't know how to fix it. I read some forums and they say I need to change the batch size and/or num of workers, is that right? What else I need to change to run this demo? Please

Thanks in advance

Leonel

hito-Chen commented 3 years ago

Hello, I have encountered the same problem, changing the batch size/num of workers does not work. Have you solved it yet? Looking forward to your reply. Thanks!

dellshan commented 1 year ago

(base) [yshan@saturn12 GAN2Shape]$ sh scripts/run_car.sh /data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( /data/yshan/anaconda3/lib/python3.9/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( StyleGAN2: Optimized CUDA op FusedLeakyReLU not available, using native PyTorch fallback. StyleGAN2: Optimized CUDA op UpFirDn2d not available, using native PyTorch fallback. Load config from yml file: configs/car.yml Loading configs from configs/car.yml {'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True} Setting up Perceptual loss... /data/yshan/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /data/yshan/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /data/yshan/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth ...[net-lin [vgg]] initialized ...Done Traceback (most recent call last): File "/data/yshan/GAN2Shape/run.py", line 31, in trainer = Trainer(cfgs, GAN2Shape) File "/data/yshan/GAN2Shape/gan2shape/trainer.py", line 23, in init self.model = model(cfgs) File "/data/yshan/GAN2Shape/gan2shape/model.py", line 92, in init self.renderer = Renderer(cfgs, self.image_size) File "/data/yshan/GAN2Shape/gan2shape/renderer/renderer.py", line 44, in init self.inv_K_origin = torch.inverse(K).unsqueeze(0) RuntimeError: Error in dlopen: libtorch_cuda_linalg.so: cannot open shared object file: No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 209663) of binary: /data/yshan/anaconda3/bin/python Traceback (most recent call last): File "/data/yshan/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/yshan/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in main() File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-10_00:31:49 host : saturn12.ihpc.uts.edu.au rank : 0 (local_rank: 0) exitcode : 1 (pid: 209663) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
dellshan commented 1 year ago

Hi, I got this error, when use you config : EXP=car CONFIG=car GPUS=1 PORT=${PORT:-29577}

mkdir -p results/${EXP} CUDA_VISIBLE_DEVICES=0 \ python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ run.py \ --launcher pytorch \ --config configs/${CONFIG}.yml \ 2>&1 | tee results/${EXP}/log.txt