Errors in both single-GPU and multi-GPU searching

lcmeng commented 3 years ago

Hi,

I followed the steps in the README but saw errors during searching using either single-GPU or multi-GPU boxes.

Have you encountered these issues before or have any idea how to fix them? TIA.

single-GPU: I modified set --nproc_per_node=1. The searching started as expected but couldn't finish stage 0. The error message is as follows:

12/28 07:27:02 AM WORLD_SIZE in os.environ is 1 12/28 07:27:02 AM Namespace(amp=False, batch_size=64, color_jitter=0.4, cooldown_epochs=0, data_config=None, datadir='/home/ubuntu/workspace/datasets/ILSVRC2012/', dataset='imagenet', decay_epochs=1, decay_rate=0.9, distill_last_stage=True, distributed=False, eval_intervals=2, eval_metric='prec1', eval_mode=False, exp_dir='', feature_train=True, guide_input=True, guide_loss_fn='mse', hyperparam_config=None, img_size=224, index='', init_classifier=False, interpolation='', label_train=False, local_rank=0, log_interval=50, loss_weight=[0.5, 0.5], lr=[0.002, 0.005, 0.005, 0.005, 0.005, 0.002], mean=None, min_lr=1e-08, mixup=0.0, mixup_off_epoch=0, model_ema=False, model_ema_decay=0.9998, model_ema_force_cpu=False, model_pool='', momentum=0.9, num_classes=1000, num_gpu=1, opt='adam', opt_eps=1e-08, output='', potential_eval_times=20, prefetcher=True, pretrain=False, print_detail=True, recovery_interval=0, remode='pixel', reprob=0.5, reset_after_stage=False, reset_bn_eval=True, resume='', reverse_train=False, save_images=False, save_last_feature=True, sched='step', seed=42, separate_train=False, smoothing=0.1, stage_num=6, start_epoch=None, start_stage=None, std=None, step_epochs=20, sync_bn=False, test_dispatch='', top_model_num=3, train_mode=False, update_frequency=1, warmup_epochs=0, warmup_lr=0.001, weight_decay=0.0001, workers=4) 12/28 07:27:02 AM Training with a single process on 1 GPUs. 12/28 07:27:04 AM Data processing configuration for current model + dataset: 12/28 07:27:04 AM input_size: (3, 224, 224) 12/28 07:27:04 AM interpolation: bicubic 12/28 07:27:04 AM mean: (0.485, 0.456, 0.406) 12/28 07:27:04 AM std: (0.229, 0.224, 0.225) 12/28 07:27:04 AM crop_pct: 0.875 12/28 07:27:06 AM NVIDIA APEX installed. AMP off. 12/28 07:27:32 AM Train: stage 0, epoch 1, step [ 0/20018] Loss: 109.597771 (109.5978) Time: 2.011s, 31.82/s LR: 1.800e-03 Data & Guide Time: 1.644 GuideMean: -0.64644 GuideStd: 10.40032 OutMean: 0.00000 (0.00000) OutStd: 0.99985 (0.99985) Dist_Mean: 0.64644 (0.64644) GRLoss: 1.00459 (1.00459) CLLoss: 0.79709 (0.79709) KLCosLoss: 0.57991 (0.57991) FeatureLoss: 0.00000 (0.00000) Top1Acc: 0.00000(0.00000) Relative MSE loss: 1.01323(1.01323)

..... 12/29 06:58:47 AM Random Test: stage 0, epoch 20 Loss: 20.4754 Prec@1: 0.0000 Time: 0.216s, 74.05/s 12/29 06:58:48 AM Current checkpoints: ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-6.pth.tar', 19.889211503295897) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-14.pth.tar', 19.960276111450195) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-4.pth.tar', 19.97588088684082) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-16.pth.tar', 20.030977337646483) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-8.pth.tar', 20.106792897033692) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-10.pth.tar', 20.107453624572752) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-12.pth.tar', 20.242049604492188) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-18.pth.tar', 20.277006747436523) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-2.pth.tar', 20.39269996520996) ('./output/test/adam-step-ep20-lr0.002-bs64-20201228-072702/checkpoint-0-20.pth.tar', 20.47537907836914)

Traceback (most recent call last): File "train.py", line 273, in main() File "train.py", line 268, in main writer=writer) File "/home/ubuntu/workspace/repos/DNA/searching/dna/distill_train.py", line 100, in distill_train reset_data=reset_data) File "/home/ubuntu/workspace/repos/DNA/searching/dna/distill_train.py", line 695, in _potential for layer in supernet.module.modules(): File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 576, in getattr type(self).name, name)) AttributeError: 'StudentSuperNet' object has no attribute 'module' Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in main() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/pytorch_p36/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.


- multi-GPU:
`--nproc_per_node=4` but it resulted in set faults.

12/30 05:01:12 AM WORLD_SIZE in os.environ is 4 12/30 05:01:12 AM Namespace(amp=False, batch_size=64, color_jitter=0.4, cooldown_epochs=0, data_config=None, datadir='/home/ubuntu/workspace/datasets/ILSVRC2012/', dataset='imagenet', decay_epochs=1, decay_rate=0.9, distill_last_stage=True, distributed=False, eval_intervals=2, eval_metric='prec1', eval_mode=False, exp_dir='', feature_train=True, guide_input=True, guide_loss_fn='mse', hyperparam_config=None, img_size=224, index='', init_classifier=False, interpolation='', label_train=False, local_rank=0, log_interval=50, loss_weight=[0.5, 0.5], lr=[0.002, 0.005, 0.005, 0.005, 0.005, 0.002], mean=None, min_lr=1e-08, mixup=0.0, mixup_off_epoch=0, model_ema=False, model_ema_decay=0.9998, model_ema_force_cpu=False, model_pool='', momentum=0.9, num_classes=1000, num_gpu=1, opt='adam', opt_eps=1e-08, output='', potential_eval_times=20, prefetcher=True, pretrain=False, print_detail=True, recovery_interval=0, remode='pixel', reprob=0.5, reset_after_stage=False, reset_bn_eval=True, resume='', reverse_train=False, save_images=False, save_last_feature=True, sched='step', seed=42, separate_train=False, smoothing=0.1, stage_num=6, start_epoch=None, start_stage=None, std=None, step_epochs=20, sync_bn=False, test_dispatch='', top_model_num=3, train_mode=False, update_frequency=1, warmup_epochs=0, warmup_lr=0.001, weight_decay=0.0001, workers=4) 12/30 05:01:12 AM Training in distributed mode with multiple processes, 1 GPU per process. CUDA 2, Process 2, total 4. 12/30 05:01:12 AM Training in distributed mode with multiple processes, 1 GPU per process. CUDA 3, Process 3, total 4. 12/30 05:01:13 AM Training in distributed mode with multiple processes, 1 GPU per process. CUDA 1, Process 1, total 4. 12/30 05:01:13 AM Training in distributed mode with multiple processes, 1 GPU per process. CUDA 0, Process 0, total 4. 12/30 05:01:15 AM Data processing configuration for current model + dataset: 12/30 05:01:15 AM input_size: (3, 224, 224) 12/30 05:01:15 AM interpolation: bicubic 12/30 05:01:15 AM mean: (0.485, 0.456, 0.406) 12/30 05:01:15 AM std: (0.229, 0.224, 0.225) 12/30 05:01:15 AM crop_pct: 0.875 12/30 05:01:18 AM NVIDIA APEX installed. AMP off. ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@ERROR: Unexpected segmentation fault encountered in worker. ^@Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in main() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main process.wait() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py", line 1477, in wait (pid, sts) = self._try_wait(0) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py", line 1424, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags)

changlin31 commented 3 years ago

Hi, @lcmeng

I did a minor update for single-GPU compatibility. Hope it can solve the first issue. You can resume from the 20th epoch. Evaluation and search will be automatically continued.

The second one seems an environment issue. Try setting workers to 0. https://github.com/changlin31/DNA/blob/dea09de6dc03e3ff11d9cec162fe5e83b13898b6/searching/initialize/train_pipeline.yaml#L28

lcmeng commented 3 years ago

Thanks for the update. I'm able to resume the single-GPU searching at the 20th epoch by modifying the resume field in the train_pipeline.yaml. However, the multi-GPU searching is still stuck at NVIDIA APEX installed. AMP off..

01/01 06:50:55 AM NVIDIA APEX installed. AMP off.
^CTraceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
    process.wait()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py", line 1477, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/subprocess.py", line 1424, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

changlin31 commented 3 years ago

Sometimes environment issues related to APEX installation may cause deadlock. Thus, you could try to uninstall APEX and rerun, the model will automatically fall back to pytorch DDP.

changlin31 / DNA

Errors in both single-GPU and multi-GPU searching #19