GOATmessi8 / ASFF

yolov3 with mobilenet v2 and ASFF
GNU General Public License v3.0
1.05k stars 216 forks source link

problems withs training #93

Open SmallWhite-CZY opened 4 years ago

SmallWhite-CZY commented 4 years ago

when i train this project,I encountered the following problems, and it was stuck after “using cuda”, and there was no change. The package I installed according to the requirements of the code is not missing. My Python version is 1.3.1, CUDA is version 10.1, and Ubuntu is 16.04. GPU is NVIDIA 418.87.01, and there are only four GPUs. Therefore, the command I execute is:

python -m torch.distributed.launch --nproc_per_node=10 --master_port=${RANDOM+10000} main.py --cfg config/yolov3_baseline.cfg -d COCO --distributed --ngpu 4 --checkpoint weights/YOLOv3-baseline_38.8.pth --start_epoch 0 --half -s 608

Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=2, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=6, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=4, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=5, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=7, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=1, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=0, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "main.py", line 470, in main() File "main.py", line 98, in main torch.cuda.set_device(args.local_rank) File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "main.py", line 470, in main() File "main.py", line 98, in main torch.cuda.set_device(args.local_rank) File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "main.py", line 470, in main() File "main.py", line 98, in main torch.cuda.set_device(args.local_rank) File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37 Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=8, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=9, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False) Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=3, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "main.py", line 470, in main() File "main.py", line 98, in main torch.cuda.set_device(args.local_rank) File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37

successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}} loading annotations into memory... successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}} loading annotations into memory... successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}} loading annotations into memory... Done (t=17.82s) creating index... Done (t=17.83s) creating index... index created! Training YOLOv3 strong baseline! index created! Training YOLOv3 strong baseline! Done (t=19.36s) creating index... index created! Training YOLOv3 strong baseline! loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth using cuda loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth using cuda loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth using cuda successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}} loading annotations into memory... Done (t=17.86s) creating index... index created! Training YOLOv3 strong baseline! loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth using cuda

I don't know exactly where the mistake is. I hope you can give me some advice. Thank you