Open leonel-os opened 3 years ago
@leonel-os It seems that your machine has only 1 GPU, while our scripts require at least 4 GPUs. You need to revise the run_car.sh
script accordingly to run on one GPU. Specifically, you need to change CUDA_VISIBLE_DEVICES=0,1,2,3
to CUDA_VISIBLE_DEVICES=0
, and close distributed training. However, it is possible that you may get sub-optimal quality on only one GPU. I suggest running on more GPUs if possible.
@XingangPan thanks, changing the run_car.sh configuration fixed the error.
EXP=car
CONFIG=car
GPUS=1
PORT=${PORT:-29577}
mkdir -p results/${EXP}
CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
run.py \
--launcher pytorch \
--config configs/${CONFIG}.yml \
2>&1 | tee results/${EXP}/log.txt
but now I'm getting the following error, related to CUDA out of memory:
sh scripts/run_car.sh
Load config from yml file: configs/car.yml
Loading configs from configs/car.yml
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': False, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 2, 'distributed': True}
Setting up Perceptual loss...
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
...[net-lin [vgg]] initialized
...Done
Loading images...
Traceback (most recent call last):
File "run.py", line 34, in <module>
trainer.train()
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 158, in train
self.setup_data(epoch)
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 78, in setup_data
self.latent_list[epoch])
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 149, in setup_target
self.load_latent()
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 248, in load_latent
self.latent_w, self.gan_im = get_w_img(self.w_path)
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 227, in get_w_img
truncation=self.truncation, randomize_noise=False)
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 595, in forward
out = conv2(out, latent[:, i + 1], noise=noise2)
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 350, in forward
out = self.conv(input, style)
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 287, in forward
out = F.conv2d(input, weight, padding=self.padding, groups=batch)
RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 1.95 GiB total capacity; 901.72 MiB already allocated; 99.88 MiB free; 928.00 MiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/darkayserleo/anaconda3/envs/unsup3d/bin/python', '-u', 'run.py', '--local_rank=0', '--launcher', 'pytorch', '--config', 'configs/car.yml']' returned non-zero exit status 1.
I don't know how to fix it. I read some forums and they say I need to change the batch size and/or num of workers, is that right? What else I need to change to run this demo? Please
Thanks in advance
Leonel
Hello, I have encountered the same problem, changing the batch size/num of workers does not work. Have you solved it yet? Looking forward to your reply. Thanks!
(base) [yshan@saturn12 GAN2Shape]$ sh scripts/run_car.sh
/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1
. You can also use weights=VGG16_Weights.DEFAULT
to get the most up-to-date weights.
warnings.warn(msg)
Loading model from: /data/yshan/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
...[net-lin [vgg]] initialized
...Done
Traceback (most recent call last):
File "/data/yshan/GAN2Shape/run.py", line 31, in Failures:
Hi, I got this error, when use you config : EXP=car CONFIG=car GPUS=1 PORT=${PORT:-29577}
mkdir -p results/${EXP} CUDA_VISIBLE_DEVICES=0 \ python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ run.py \ --launcher pytorch \ --config configs/${CONFIG}.yml \ 2>&1 | tee results/${EXP}/log.txt
Now I'm getting the following error. I installed all the dependencies including CUDA, but when I run:
I'm getting this:
Please I need help, thanks