Closed zhenghan408 closed 4 years ago
What is your command? Please provide with the full log.
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/deo/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/lc/anaconda3/envs/deo/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
thanks a lot for your reply
Sorry I still cannot locate the issue from this limited information. Could you please provide your script, the environment, and a full log?
Sorry I still cannot locate the issue from this limited information. Could you please provide your script, the environment, and a full log?
sorry , The above problem has been solved,But when i run the second step:sh experiments/COCOA/pcnet_c/train.sh
I encountered other problems, as follows:
/media/lc/软件/zh/deocclusion-master/experiments/COCOA/pcnet_c/config.yaml main.py:15: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. config = yaml.load(f) => loading checkpoint '/media/lc/软件/zh/deocclusion-master/pretrains/partialconv_input_ch4.pth' Traceback (most recent call last): File "main.py", line 49, in main(args) File "main.py", line 30, in main trainer = Trainer(args) File "/media/lc/软件/zh/deocclusion-master/trainer.py", line 61, in init args.model, load_pretrain=args.load_pretrain, dist_model=True) File "/media/lc/软件/zh/deocclusion-master/models/partial_completion_content_cgan.py", line 53, in init self.criterion = InpaintingLoss(backbone.VGG16FeatureExtractor()).cuda() File "/media/lc/软件/zh/deocclusion-master/models/backbone/pconv_unet.py", line 36, in init vgg16 = models.vgg16(pretrained=True) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torchvision/models/vgg.py", line 144, in vgg16 return _vgg('vgg16', 'D', False, pretrained, progress, kwargs) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torchvision/models/vgg.py", line 92, in _vgg progress=progress) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/hub.py", line 434, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/serialization.py", line 387, in load return _load(f, map_location, pickle_module, pickle_load_args) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/serialization.py", line 564, in _load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input Traceback (most recent call last): File "/home/lc/anaconda3/envs/deo/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/lc/anaconda3/envs/deo/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in main() File "/home/lc/anaconda3/envs/deo/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main cmd=process.args) subprocess.CalledProcessError: Command '['/home/lc/anaconda3/envs/deo/bin/python', '-u', 'main.py', '--local_rank=0', '--config', '/media/lc/软件/zh/deocclusion-master/experiments/COCOA/pcnet_c/config.yaml', '--launcher', 'pytorch', '--load-pretrain', '/media/lc/软件/zh/deocclusion-master/pretrains/partialconv_input_ch4.pth']' returned non-zero exit status 1.
thanks!!!
This might be due to the incomplete vgg pretrained file. During distributed training, when you downloading a file through network with multiple processors, the file will be destroyed. The first solution, you could manually download VGG pretrained file to the torch's default checkpoint location, typically ~/.cache/torch
. The second solution, run this training command with 1 GPU first. After the vgg file is downloaded, re-run it with multiple GPUs.
ok ,i will try it ,thanks so much!!!
thanks so much ,I solved the problem!!!!!!
the more details: subprocess.CalledProcessError: Command '['/home/lc/anaconda3/envs/deo/bin/python', '-u', 'main.py', '--local_rank=0', '--config', './config.yaml', '--launcher', 'pytorch']' returned non-zero exit status 2.