Zzh-tju / Rotated-LD

Rotated Localization Distillation (CVPR 2022, TPAMI 2023)
Apache License 2.0
43 stars 2 forks source link

单卡训练时出现loss是nan,请问是什么原因 #5

Open hezheyuan opened 1 year ago

Zzh-tju commented 1 year ago

晒一下训练log

hezheyuan commented 1 year ago

2022-12-15 15:19:25,166 - mmrotate - INFO - Environment info:

sys.platform: linux Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.55 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1 OpenCV: 4.6.0 MMCV: 1.6.0 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.6 MMRotate: 0.1.0+5fe611f

2022-12-15 15:19:25,678 - mmrotate - INFO - Distributed training: False 2022-12-15 15:19:26,160 - mmrotate - INFO - Config: dataset_type = 'DOTADataset' data_root = '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1024, 1024)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal']), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1024, 1024), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=2, train=dict( type='DOTADataset', ann_file= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/', img_prefix= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1024, 1024)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal']), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], version='oc'), val=dict( type='DOTADataset', ann_file= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/', img_prefix= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1024, 1024), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], version='oc'), test=dict( type='DOTADataset', ann_file= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/', img_prefix= '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1024, 1024), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], version='oc')) evaluation = dict(interval=12, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=12) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook'), dict(type='TensorboardLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' angle_version = 'oc' model = dict( type='KnowledgeDistillationRotatedSingleStageDetector', backbone=dict( type='ResNet', depth=18, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, zero_init_residual=False, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')), neck=dict( type='FPN', in_channels=[64, 128, 256, 512], out_channels=256, start_level=1, add_extra_convs='on_input', num_outs=5), bbox_head=dict( type='LDRotatedRetinaHead', num_classes=15, in_channels=256, stacked_convs=4, feat_channels=256, assign_by_circumhbbox='oc', anchor_generator=dict( type='RotatedAnchorGenerator', octave_base_scale=4, scales_per_octave=3, ratios=[1.0, 0.5, 2.0], strides=[8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHAOBBoxCoder', angle_range='oc', norm_factor=None, edge_swap=False, proj_xy=False, target_means=(0.0, 0.0, 0.0, 0.0, 0.0), target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0), loss_bbox=dict(type='GDLoss', loss_weight=5.0, loss_type='gwd'), reg_max=8, reg_decoded_bbox=True, loss_ld=dict(type='GDLoss', loss_type='gwd', loss_weight=5.0), loss_kd=dict( type='KnowledgeDistillationKLDivLoss', loss_weight=30, T=5), loss_im=dict(type='IMLoss', loss_weight=2.0), imitation_method='finegrained'), train_cfg=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.4, min_pos_iou=0, ignore_iof_thr=-1, iou_calculator=dict(type='RBboxOverlaps2D')), allowed_border=-1, pos_weight=-1, debug=False), test_cfg=dict( nms_pre=2000, min_bbox_size=0, score_thr=0.05, nms=dict(iou_thr=0.1), max_per_img=2000), teacher_config= './configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py', teacher_ckpt= '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth', output_feature=True) teacher_ckpt = '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth' work_dir = './work_dirs/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc' auto_resume = False gpu_ids = range(0, 1)

2022-12-15 15:19:26,208 - mmrotate - INFO - Set random seed to 942796273, deterministic: False 2022-12-15 15:19:32,558 - mmrotate - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet18'} 2022-12-15 15:19:32,622 - mmrotate - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'} Name of parameter - Initialization information

hezheyuan commented 1 year ago

{"env_info": "sys.platform: linux\nPython: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0]\nCUDA available: True\nGPU 0: NVIDIA GeForce RTX 3090\nCUDA_HOME: /usr/local/cuda-11.6\nNVCC: Cuda compilation tools, release 11.6, V11.6.55\nGCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0\nPyTorch: 1.12.1\nPyTorch compiling details: PyTorch built with:\n - GCC 9.3\n - C++ Version: 201402\n - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n\nTorchVision: 0.13.1\nOpenCV: 4.6.0\nMMCV: 1.6.0\nMMCV Compiler: GCC 9.3\nMMCV CUDA Compiler: 11.6\nMMRotate: 0.1.0+5fe611f", "config": "dataset_type = 'DOTADataset'\ndata_root = '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/'\nimg_norm_cfg = dict(\n mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)\ntrain_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n]\ntest_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n]\ndata = dict(\n samples_per_gpu=1,\n workers_per_gpu=2,\n train=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n ],\n version='oc'),\n val=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'),\n test=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'))\nevaluation = dict(interval=12, metric='mAP')\noptimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)\noptimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))\nlr_config = dict(\n policy='step',\n warmup='linear',\n warmup_iters=500,\n warmup_ratio=0.3333333333333333,\n step=[8, 11])\nrunner = dict(type='EpochBasedRunner', max_epochs=12)\ncheckpoint_config = dict(interval=12)\nlog_config = dict(\n interval=50,\n hooks=[dict(type='TextLoggerHook'),\n dict(type='TensorboardLoggerHook')])\ndist_params = dict(backend='nccl')\nlog_level = 'INFO'\nload_from = None\nresume_from = None\nworkflow = [('train', 1)]\nopencv_num_threads = 0\nmp_start_method = 'fork'\nangle_version = 'oc'\nmodel = dict(\n type='KnowledgeDistillationRotatedSingleStageDetector',\n backbone=dict(\n type='ResNet',\n depth=18,\n num_stages=4,\n out_indices=(0, 1, 2, 3),\n frozen_stages=1,\n zero_init_residual=False,\n norm_cfg=dict(type='BN', requires_grad=True),\n norm_eval=True,\n style='pytorch',\n init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')),\n neck=dict(\n type='FPN',\n in_channels=[64, 128, 256, 512],\n out_channels=256,\n start_level=1,\n add_extra_convs='on_input',\n num_outs=5),\n bbox_head=dict(\n type='LDRotatedRetinaHead',\n num_classes=15,\n in_channels=256,\n stacked_convs=4,\n feat_channels=256,\n assign_by_circumhbbox='oc',\n anchor_generator=dict(\n type='RotatedAnchorGenerator',\n octave_base_scale=4,\n scales_per_octave=3,\n ratios=[1.0, 0.5, 2.0],\n strides=[8, 16, 32, 64, 128]),\n bbox_coder=dict(\n type='DeltaXYWHAOBBoxCoder',\n angle_range='oc',\n norm_factor=None,\n edge_swap=False,\n proj_xy=False,\n target_means=(0.0, 0.0, 0.0, 0.0, 0.0),\n target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),\n loss_cls=dict(\n type='FocalLoss',\n use_sigmoid=True,\n gamma=2.0,\n alpha=0.25,\n loss_weight=1.0),\n loss_bbox=dict(type='GDLoss', loss_weight=5.0, loss_type='gwd'),\n reg_max=8,\n reg_decoded_bbox=True,\n loss_ld=dict(type='GDLoss', loss_type='gwd', loss_weight=5.0),\n loss_kd=dict(\n type='KnowledgeDistillationKLDivLoss', loss_weight=30, T=5),\n loss_im=dict(type='IMLoss', loss_weight=2.0),\n imitation_method='finegrained'),\n train_cfg=dict(\n assigner=dict(\n type='MaxIoUAssigner',\n pos_iou_thr=0.5,\n neg_iou_thr=0.4,\n min_pos_iou=0,\n ignore_iof_thr=-1,\n iou_calculator=dict(type='RBboxOverlaps2D')),\n allowed_border=-1,\n pos_weight=-1,\n debug=False),\n test_cfg=dict(\n nms_pre=2000,\n min_bbox_size=0,\n score_thr=0.05,\n nms=dict(iou_thr=0.1),\n max_per_img=2000),\n teacher_config=\n './configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py',\n teacher_ckpt=\n '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth',\n output_feature=True)\nteacher_ckpt = '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth'\nwork_dir = './work_dirs/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc'\nauto_resume = False\ngpu_ids = range(0, 1)\n", "seed": 884529645, "exp_name": "rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc.py"} {"mode": "train", "epoch": 1, "iter": 50, "lr": 0.001, "memory": 4366, "data_time": 0.05191, "loss_cls": NaN, "loss_bbox": NaN, "loss_ld": NaN, "loss_kd": NaN, "loss_im": NaN, "loss": NaN, "grad_norm": NaN, "time": 0.2035}

hezheyuan commented 1 year ago

晒一下训练log 是不是使用单卡的原因?我现在也没找到原因

Zzh-tju commented 1 year ago

这种情况多半是数据集的问题,预处理步骤或者标签问题。请遵循mmrotate的数据集下载与预处理方式

可以先关闭蒸馏损失,在config文件中将KD与LD,Feature imitation的损失设为0,观察是否还会nan

hezheyuan commented 1 year ago

这种情况多半是数据集的问题,预处理步骤或者标签问题。请遵循mmrotate的数据集下载与预处理方式

可以先关闭蒸馏损失,在config文件中将KD与LD,Feature imitation的损失设为0,观察是否还会nan

谢谢!数据集不会有问题,用了很久的,在mmrotate里训练不存在问题。我试一下关闭特征蒸馏损失看一下效果。我使用自己写的Feature imitation 方法也存在损失NAN的问题。

hezheyuan commented 1 year ago

关掉蒸馏损失之后,训练不会在出现NAN的情况,是为什么

Zzh-tju commented 1 year ago

你的教师模型用的哪个,config文件是哪个

hezheyuan commented 1 year ago

你的教师模型用的哪个,config文件是哪个 1.教师模型使用的rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth,是从mmrotate官方下载的 2.config文件使用的是./configs/ld/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc.py 3.教师的config文件使用的是./configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py 我现在找到NAN的原因了,我观察到网络的bounding box预测值是NAN,导致在计算损失时无法计算,调整学习率或者warmup率也没用。 但是如果我训练configs/gwd/rotated_retinanet_distribution_hbb_gwd_r50_fpn_2x_dota_oc.py,没有错误。

Zzh-tju commented 1 year ago

肯定不能用mmrotate官方的权重作为教师啊,它的box表示是4个数,而不是4n个数的概率分布