why Loss==0 - Githubissues

joeyslv commented 1 year ago

I installed an environment on Win using Python tools/train. py configs consistent processor consistent teacher r50 fpn coco 180k How can I solve the problem of loss==0 when running the project on 1p. py

`2023-04-23 12:49:22,302 - mmdet.ssod - INFO - [<StreamHandler (INFO)>, <FileHandler E:\Object-Detection\Github\consist\ConsistentTeacher\work_dirs\consistent_teacher_r50_fpn_coco_180k_1p\20230423_124922.log (INFO)>] 2023-04-23 12:49:22,303 - mmdet.ssod - INFO - Environment info:

sys.platform: win32 Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)] CUDA available: True GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1 NVCC: Build cuda_11.1.relgpu_drvr455TC455_06.29190527_0 GCC: gcc (Rev2, Built by MSYS2 project) 10.3.0 PyTorch: 1.9.0+cu111 PyTorch compiling details: PyTorch built with:

C++ Version: 199711
MSVC 192829337
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 2019
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.5
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=C:/w/b/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/w/b/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.10.0+cu111 OpenCV: 4.5.5 MMCV: 1.4.2 MMCV Compiler: MSVC 192930137 MMCV CUDA Compiler: 11.1 MMDetection: 2.25.0+1fa6477

2023-04-23 12:49:24,106 - mmdet.ssod - INFO - Distributed training: False 2023-04-23 12:49:25,727 - mmdet.ssod - INFO - Config: dataset_type = 'CocoDataset' data_root = 'data/coco/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='sup'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag')) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=5, workers_per_gpu=1, train=dict( type='SemiDataset', sup=dict( type='CocoDataset', ann_file='droot_4classes\json\voc07_train0.3.json', img_prefix='droot_4classes', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='sup'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag')) ]), unsup=dict( type='CocoDataset', ann_file='droot_4classes\json\voc07_train_unsup0.3.json', img_prefix='droot_4classes', pipeline=[ dict(type='LoadImageFromFile'), dict(type='PseudoSamples', with_bbox=True), dict( type='MultiBranch', unsup_teacher=[ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='ShuffledSequential', transforms=[ dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]), dict( type='OneOf', transforms=[{ 'type': 'RandTranslate', 'x': (-0.1, 0.1) }, { 'type': 'RandTranslate', 'y': (-0.1, 0.1) }, { 'type': 'RandRotate', 'angle': (-30, 30) }, [{ 'type': 'RandShear', 'x': (-30, 30) }, { 'type': 'RandShear', 'y': (-30, 30) }]]) ]), dict( type='RandErase', n_iterations=(1, 5), size=[0, 0.2], squared=True) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_student'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ], unsup_student=[ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_teacher'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ]) ], filter_empty_gt=False)), val=dict( type='CocoDataset', ann_file='droot_4classes\json\voc07_val_unsup1.json', img_prefix='droot_4classes', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file='droot_4classes\json\voc07_val_unsup1.json', img_prefix='data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), sampler=dict( train=dict( type='SemiBalanceSampler', sample_ratio=[1, 5], by_prob=False, epoch_length=500))) evaluation = dict(interval=1000, metric='bbox', type='SubModulesDistEvalHook') optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=20, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.001, step=[2000, 12000]) runner = dict(type='IterBasedRunner', max_iters=20000) checkpoint_config = dict(interval=1000, by_epoch=False, max_keep_ckpts=2) log_config = dict( interval=50, hooks=[ dict(type='TextLoggerHook', by_epoch=False), dict( type='WandbLoggerHook', init_kwargs=dict( project='consistent-teacher', name='consistent_teacher_r50_fpn_coco_180k_1p', config=dict( fold=1, percent=1, work_dirs= './work_dirs\consistent_teacher_r50_fpn_coco_180k_1p', total_step=20000)), by_epoch=False) ]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='WeightSummary'), dict(type='SetIterInfoHook'), dict(type='MeanTeacher', momentum=0.9995, interval=1, warm_up=0) ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=16) mmdet_base = '../../../mmdetection/configs/base' model = dict( type='ConsistentTeacher', model=dict( type='RetinaNet', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict( type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=1, add_extra_convs='on_output', num_outs=5), bbox_head=dict( type='FAM3DHead', num_classes=4, in_channels=256, stacked_convs=4, feat_channels=256, anchor_type='anchor_based', anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), loss_cls=dict( type='FocalLoss', use_sigmoid=True, activated=True, gamma=2.0, alpha=0.25, loss_weight=1.0), loss_bbox=dict(type='GIoULoss', loss_weight=2.0)), train_cfg=dict( assigner=dict( type='DynamicSoftLabelAssigner', topk=13, iou_factor=2.0), alpha=1, beta=6, allowed_border=-1, pos_weight=-1, debug=False), test_cfg=dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms', iou_threshold=0.6), max_per_img=100)), train_cfg=dict( num_scores=100, dynamic_ratio=1.0, warmup_step=500, min_pseduo_box_size=0, unsup_weight=2.0), test_cfg=dict(inference_on='teacher')) strong_pipeline = [ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='ShuffledSequential', transforms=[ dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]), dict( type='OneOf', transforms=[{ 'type': 'RandTranslate', 'x': (-0.1, 0.1) }, { 'type': 'RandTranslate', 'y': (-0.1, 0.1) }, { 'type': 'RandRotate', 'angle': (-30, 30) }, [{ 'type': 'RandShear', 'x': (-30, 30) }, { 'type': 'RandShear', 'y': (-30, 30) }]]) ]), dict( type='RandErase', n_iterations=(1, 5), size=[0, 0.2], squared=True) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_student'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ] weak_pipeline = [ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_teacher'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ] unsup_pipeline = [ dict(type='LoadImageFromFile'), dict(type='PseudoSamples', with_bbox=True), dict( type='MultiBranch', unsup_teacher=[ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='ShuffledSequential', transforms=[ dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]), dict( type='OneOf', transforms=[{ 'type': 'RandTranslate', 'x': (-0.1, 0.1) }, { 'type': 'RandTranslate', 'y': (-0.1, 0.1) }, { 'type': 'RandRotate', 'angle': (-30, 30) }, [{ 'type': 'RandShear', 'x': (-30, 30) }, { 'type': 'RandShear', 'y': (-30, 30) }]]) ]), dict( type='RandErase', n_iterations=(1, 5), size=[0, 0.2], squared=True) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_student'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ], unsup_student=[ dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5) ], record=True), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='ExtraAttrs', tag='unsup_teacher'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag', 'transform_matrix')) ]) ] fold = 1 percent = 1 classes = ['loose_l', 'loose_s', 'poor_l', 'porous'] fp16 = None work_dir = './work_dirs\consistent_teacher_r50_fpn_coco_180k_1p' cfg_name = 'consistent_teacher_r50_fpn_coco_180k_1p' gpu_ids = range(0, 1)

2023-04-23 12:49:26,187 - mmdet.ssod - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'} 2023-04-23 12:49:26,410 - mmdet.ssod - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'} 2023-04-23 12:49:26,478 - mmdet.ssod - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet50'} 2023-04-23 12:49:26,638 - mmdet.ssod - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'} Name of parameter - Initialization information 2023-04-23 12:50:24,305 - mmdet.ssod - INFO - Iter [50/20000] lr: 9.890e-04, eta: 4:02:58, time: 0.731, data_time: 0.019, memory: 3454, ema_momentum: 0.9800, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0240, unsup_gmm_thr: 0.0015, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:50:51,673 - mmdet.ssod - INFO - Iter [100/20000] lr: 1.988e-03, eta: 3:31:56, time: 0.547, data_time: 0.014, memory: 3454, ema_momentum: 0.9900, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0027, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:51:18,728 - mmdet.ssod - INFO - Iter [150/20000] lr: 2.987e-03, eta: 3:20:36, time: 0.541, data_time: 0.014, memory: 3454, ema_momentum: 0.9933, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0027, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:51:46,522 - mmdet.ssod - INFO - Iter [200/20000] lr: 3.986e-03, eta: 3:15:56, time: 0.556, data_time: 0.015, memory: 3454, ema_momentum: 0.9950, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0040, unsup_gmm_thr: 0.0067, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:52:15,543 - mmdet.ssod - INFO - Iter [250/20000] lr: 4.985e-03, eta: 3:14:34, time: 0.580, data_time: 0.015, memory: 3454, ema_momentum: 0.9960, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0064, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:52:43,742 - mmdet.ssod - INFO - Iter [300/20000] lr: 5.984e-03, eta: 3:12:35, time: 0.564, data_time: 0.015, memory: 3454, ema_momentum: 0.9967, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0013, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:53:12,320 - mmdet.ssod - INFO - Iter [350/20000] lr: 6.983e-03, eta: 3:11:24, time: 0.572, data_time: 0.015, memory: 3454, ema_momentum: 0.9971, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0040, unsup_gmm_thr: 0.0131, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:53:40,363 - mmdet.ssod - INFO - Iter [400/20000] lr: 7.982e-03, eta: 3:09:57, time: 0.561, data_time: 0.015, memory: 3454, ema_momentum: 0.9975, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0040, unsup_gmm_thr: 0.0101, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:54:08,155 - mmdet.ssod - INFO - Iter [450/20000] lr: 8.981e-03, eta: 3:08:32, time: 0.556, data_time: 0.015, memory: 3454, ema_momentum: 0.9978, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0021, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:54:35,883 - mmdet.ssod - INFO - Iter [500/20000] lr: 9.980e-03, eta: 3:07:16, time: 0.555, data_time: 0.015, memory: 3454, ema_momentum: 0.9980, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0013, loss: 0.0000, grad_norm: 0.0000 2023-04-23 12:55:14,925 - mmdet.ssod - INFO - Iter [550/20000] lr: 1.000e-02, eta: 3:12:49, time: 0.781, data_time: 0.229, memory: 3454, ema_momentum: 0.9982, unsup_loss_cls: 0.0004, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0004, grad_norm: 0.0186 2023-04-23 12:55:42,908 - mmdet.ssod - INFO - Iter [600/20000] lr: 1.000e-02, eta: 3:11:22, time: 0.560, data_time: 0.015, memory: 3454, ema_momentum: 0.9983, unsup_loss_cls: 0.0001, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0001, grad_norm: 0.0020 2023-04-23 12:56:10,335 - mmdet.ssod - INFO - Iter [650/20000] lr: 1.000e-02, eta: 3:09:48, time: 0.549, data_time: 0.015, memory: 3454, ema_momentum: 0.9985, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0015 2023-04-23 12:56:38,082 - mmdet.ssod - INFO - Iter [700/20000] lr: 1.000e-02, eta: 3:08:32, time: 0.555, data_time: 0.014, memory: 3454, ema_momentum: 0.9986, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0012 2023-04-23 12:57:05,840 - mmdet.ssod - INFO - Iter [750/20000] lr: 1.000e-02, eta: 3:07:23, time: 0.555, data_time: 0.015, memory: 3454, ema_momentum: 0.9987, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0010 2023-04-23 12:57:33,505 - mmdet.ssod - INFO - Iter [800/20000] lr: 1.000e-02, eta: 3:06:17, time: 0.553, data_time: 0.014, memory: 3454, ema_momentum: 0.9988, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0008 2023-04-23 12:58:01,648 - mmdet.ssod - INFO - Iter [850/20000] lr: 1.000e-02, eta: 3:05:26, time: 0.563, data_time: 0.015, memory: 3454, ema_momentum: 0.9988, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0008 2023-04-23 12:58:29,195 - mmdet.ssod - INFO - Iter [900/20000] lr: 1.000e-02, eta: 3:04:25, time: 0.551, data_time: 0.014, memory: 3454, ema_momentum: 0.9989, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0007 2023-04-23 12:58:57,751 - mmdet.ssod - INFO - Iter [950/20000] lr: 1.000e-02, eta: 3:03:48, time: 0.571, data_time: 0.014, memory: 3454, ema_momentum: 0.9989, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0006 2023-04-23 12:59:25,490 - mmdet.ssod - INFO - Saving checkpoint at 1000 iterations 2023-04-23 12:59:27,535 - mmdet.ssod - INFO - Exp name: consistent_teacher_r50_fpn_coco_180k_1p.py 2023-04-23 12:59:27,536 - mmdet.ssod - INFO - Iter [1000/20000] lr: 1.000e-02, eta: 3:03:35, time: 0.596, data_time: 0.014, memory: 3454, ema_momentum: 0.9990, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0000, unsup_gmm_thr: 0.0000, loss: 0.0000, grad_norm: 0.0005 `

joeyslv commented 1 year ago

The self-made dataset I used and it runs successfully on both the supervised model and the soft-tacher model. The dataset should be fine

joeyslv commented 1 year ago

i log in wandb and check the loss, Loss is not 0 but too small. How can I amplify such a loss? Is there any problem if I don't amplify it?

joeyslv commented 1 year ago

And during the verification process in the training phase, there was an error. It seems that distributed verification was used, but I only have one graphics card. How can I modify it so that verification can be performed on a single graphics card

 File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Adamdad commented 1 year ago

Hello @joeyslv,

I'm uncertain about the details of your case, but it seems that you haven't included the sup_loss_cls and sup_loss_box in your model. Without these losses, your model cannot be trained. I recommend checking to see if these losses have been properly included in your code.

Best regards,

Adamdad commented 1 year ago

And during the verification process in the training phase, there was an error. It seems that distributed verification was used, but I only have one graphics card. How can I modify it so that verification can be performed on a single graphics card

 File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I haven't personally tested the single card case, but I believe the issue may lie in how you are batching your labeled and unlabeled data. In my experiments, I select M labeled data and N unlabeled data for each GPU card (1:4).

However, it seems that you may have set the sample ratio incorrectly. To resolve this issue, please update your code as follows:

sample_ratio=[1, 5], -> sample_ratio=[1, 4],

It's not possible to divide 5 samples into a 1:5 batch, which may result in your model not receiving any labeled samples. By switching to a 1 label, 4 unlabeled batch, your training should proceed correctly.

joeyslv commented 1 year ago

And during the verification process in the training phase, there was an error. It seems that distributed verification was used, but I only have one graphics card. How can I modify it so that verification can be performed on a single graphics card
 File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
I haven't personally tested the single card case, but I believe the issue may lie in how you are batching your labeled and unlabeled data. In my experiments, I select M labeled data and N unlabeled data for each GPU card (1:4).

However, it seems that you may have set the sample ratio incorrectly. To resolve this issue, please update your code as follows:
sample_ratio=[1, 5], -> sample_ratio=[1, 4],
It's not possible to divide 5 samples into a 1:5 batch, which may result in your model not receiving any labeled samples. By switching to a 1 label, 4 unlabeled batch, your training should proceed correctly.

Thank you very much for your help. The training can proceed normally, but there will still be the following errors during verification

D:\App\anaconda\envs\arknights\lib\site-packages\numpy\_distributor_init.py:32: UserWarning: loaded more than 1 DLL from .libs:
D:\App\anaconda\envs\arknights\lib\site-packages\numpy\.libs\libopenblas.JPIJNSWNNAN3CE6LLI5FWSPHUT2VXMTH.gfortran-win_amd64.dll
D:\App\anaconda\envs\arknights\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
  stacklevel=1)
e:\object-detection\github\consist\consistentteacher\thirdparty\mmdetection\mmdet\datasets\pipelines\formating.py:7: UserWarning: DeprecationWarning: mmdet.datasets.pipelines.formating will be deprecated, please replace it with mmdet.datasets.pipelines.formatting.
  warnings.warn('DeprecationWarning: mmdet.datasets.pipelines.formating will be '
D:\App\anaconda\envs\arknights\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
2023-04-23 14:54:36,856 - mmdet.ssod - INFO - Iter [50/20000]   lr: 9.890e-06, eta: 4:06:43, time: 0.742, data_time: 0.010, memory: 3440, ema_momentum: 0.9800, sup_loss_cls: 3.0652, sup_loss_bbox: 1.6344, sup_num_gts: 2.5000, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.0900, unsup_gmm_thr: 0.0615, loss: 4.6996, grad_norm: 85.8169
2023-04-23 14:55:11,478 - mmdet.ssod - INFO - Iter [100/20000]  lr: 1.988e-05, eta: 3:57:53, time: 0.693, data_time: 0.008, memory: 3440, ema_momentum: 0.9900, sup_loss_cls: 2.1009, sup_loss_bbox: 1.5306, sup_num_gts: 2.8800, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 0.5250, unsup_gmm_thr: 0.0881, loss: 3.6315, grad_norm: 100.6657
2023-04-23 14:55:46,758 - mmdet.ssod - INFO - Iter [150/20000]  lr: 2.987e-05, eta: 3:55:59, time: 0.706, data_time: 0.008, memory: 3440, ema_momentum: 0.9933, sup_loss_cls: 1.4099, sup_loss_bbox: 1.1134, sup_num_gts: 2.8000, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 1.1500, unsup_gmm_thr: 0.1296, loss: 2.5233, grad_norm: 72.2084
2023-04-23 14:56:21,544 - mmdet.ssod - INFO - Iter [200/20000]  lr: 3.986e-05, eta: 3:53:57, time: 0.696, data_time: 0.008, memory: 3440, ema_momentum: 0.9950, sup_loss_cls: 1.2012, sup_loss_bbox: 1.0114, sup_num_gts: 2.6600, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 1.7400, unsup_gmm_thr: 0.1389, loss: 2.2126, grad_norm: 69.8378
2023-04-23 14:56:56,071 - mmdet.ssod - INFO - Iter [250/20000]  lr: 4.985e-05, eta: 3:52:09, time: 0.691, data_time: 0.007, memory: 3440, ema_momentum: 0.9960, sup_loss_cls: 1.0765, sup_loss_bbox: 0.9455, sup_num_gts: 2.5800, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.2400, unsup_gmm_thr: 0.1610, loss: 2.0220, grad_norm: 82.8436
2023-04-23 14:57:31,668 - mmdet.ssod - INFO - Iter [300/20000]  lr: 5.984e-05, eta: 3:51:55, time: 0.712, data_time: 0.008, memory: 3440, ema_momentum: 0.9967, sup_loss_cls: 0.9920, sup_loss_bbox: 0.9144, sup_num_gts: 2.8600, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.6700, unsup_gmm_thr: 0.1808, loss: 1.9064, grad_norm: 66.3868
2023-04-23 14:58:06,046 - mmdet.ssod - INFO - Iter [350/20000]  lr: 6.983e-05, eta: 3:50:27, time: 0.688, data_time: 0.007, memory: 3440, ema_momentum: 0.9971, sup_loss_cls: 0.9652, sup_loss_bbox: 0.8521, sup_num_gts: 2.7000, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.4400, unsup_gmm_thr: 0.2018, loss: 1.8174, grad_norm: 86.9026
2023-04-23 14:58:40,951 - mmdet.ssod - INFO - Iter [400/20000]  lr: 7.982e-05, eta: 3:49:38, time: 0.698, data_time: 0.007, memory: 3440, ema_momentum: 0.9975, sup_loss_cls: 0.9636, sup_loss_bbox: 0.8859, sup_num_gts: 3.0400, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.3850, unsup_gmm_thr: 0.2087, loss: 1.8496, grad_norm: 75.2607
2023-04-23 14:59:16,361 - mmdet.ssod - INFO - Iter [450/20000]  lr: 8.981e-05, eta: 3:49:14, time: 0.708, data_time: 0.007, memory: 3440, ema_momentum: 0.9978, sup_loss_cls: 0.9696, sup_loss_bbox: 0.8790, sup_num_gts: 3.0600, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.4650, unsup_gmm_thr: 0.2191, loss: 1.8486, grad_norm: 76.9708
2023-04-23 14:59:51,275 - mmdet.ssod - INFO - Iter [500/20000]  lr: 9.980e-05, eta: 3:48:29, time: 0.698, data_time: 0.007, memory: 3440, ema_momentum: 0.9980, sup_loss_cls: 0.9265, sup_loss_bbox: 0.8568, sup_num_gts: 2.9200, unsup_loss_cls: 0.0000, unsup_loss_bbox: 0.0000, unsup_num_gts: 2.6850, unsup_gmm_thr: 0.2163, loss: 1.7833, grad_norm: 79.8449
2023-04-23 15:00:25,902 - mmdet.ssod - INFO - Iter [550/20000]  lr: 1.000e-04, eta: 3:47:35, time: 0.693, data_time: 0.007, memory: 3440, ema_momentum: 0.9982, sup_loss_cls: 0.7877, sup_loss_bbox: 0.8173, sup_num_gts: 3.2200, unsup_loss_cls: 1.3093, unsup_loss_bbox: 1.1311, unsup_num_gts: 2.2550, unsup_gmm_thr: 0.2148, loss: 4.0453, grad_norm: 104.0142
2023-04-23 15:01:01,351 - mmdet.ssod - INFO - Iter [600/20000]  lr: 1.000e-04, eta: 3:47:11, time: 0.709, data_time: 0.008, memory: 3440, ema_momentum: 0.9983, sup_loss_cls: 0.8378, sup_loss_bbox: 0.7998, sup_num_gts: 2.3800, unsup_loss_cls: 1.2027, unsup_loss_bbox: 1.0569, unsup_num_gts: 2.4200, unsup_gmm_thr: 0.2243, loss: 3.8973, grad_norm: 107.2583
2023-04-23 15:01:36,434 - mmdet.ssod - INFO - Iter [650/20000]  lr: 1.000e-04, eta: 3:46:34, time: 0.702, data_time: 0.007, memory: 3440, ema_momentum: 0.9985, sup_loss_cls: 0.8000, sup_loss_bbox: 0.7546, sup_num_gts: 2.4200, unsup_loss_cls: 1.2029, unsup_loss_bbox: 1.0836, unsup_num_gts: 2.3900, unsup_gmm_thr: 0.2338, loss: 3.8412, grad_norm: 109.9931
2023-04-23 15:02:11,913 - mmdet.ssod - INFO - Iter [700/20000]  lr: 1.000e-04, eta: 3:46:09, time: 0.710, data_time: 0.007, memory: 3440, ema_momentum: 0.9986, sup_loss_cls: 0.7600, sup_loss_bbox: 0.7552, sup_num_gts: 2.5000, unsup_loss_cls: 1.2578, unsup_loss_bbox: 1.1244, unsup_num_gts: 2.4250, unsup_gmm_thr: 0.2345, loss: 3.8974, grad_norm: 94.4281
2023-04-23 15:02:47,359 - mmdet.ssod - INFO - Iter [750/20000]  lr: 1.000e-04, eta: 3:45:41, time: 0.709, data_time: 0.008, memory: 3440, ema_momentum: 0.9987, sup_loss_cls: 0.7542, sup_loss_bbox: 0.8021, sup_num_gts: 2.9000, unsup_loss_cls: 1.1399, unsup_loss_bbox: 1.0759, unsup_num_gts: 2.2600, unsup_gmm_thr: 0.2369, loss: 3.7720, grad_norm: 96.5684
2023-04-23 15:03:22,560 - mmdet.ssod - INFO - Iter [800/20000]  lr: 1.000e-04, eta: 3:45:07, time: 0.704, data_time: 0.008, memory: 3440, ema_momentum: 0.9988, sup_loss_cls: 0.7590, sup_loss_bbox: 0.7876, sup_num_gts: 3.2000, unsup_loss_cls: 1.0558, unsup_loss_bbox: 1.0497, unsup_num_gts: 2.4350, unsup_gmm_thr: 0.2503, loss: 3.6520, grad_norm: 88.1772
2023-04-23 15:03:58,182 - mmdet.ssod - INFO - Iter [850/20000]  lr: 1.000e-04, eta: 3:44:42, time: 0.712, data_time: 0.007, memory: 3440, ema_momentum: 0.9988, sup_loss_cls: 0.6540, sup_loss_bbox: 0.7134, sup_num_gts: 3.2400, unsup_loss_cls: 1.3332, unsup_loss_bbox: 1.0061, unsup_num_gts: 2.4150, unsup_gmm_thr: 0.2490, loss: 3.7067, grad_norm: 131.4299
2023-04-23 15:04:33,755 - mmdet.ssod - INFO - Iter [900/20000]  lr: 1.000e-04, eta: 3:44:14, time: 0.711, data_time: 0.007, memory: 3440, ema_momentum: 0.9989, sup_loss_cls: 0.7472, sup_loss_bbox: 0.7279, sup_num_gts: 2.5200, unsup_loss_cls: 0.9732, unsup_loss_bbox: 0.9590, unsup_num_gts: 2.6350, unsup_gmm_thr: 0.2472, loss: 3.4072, grad_norm: 96.6955
2023-04-23 15:05:09,830 - mmdet.ssod - INFO - Iter [950/20000]  lr: 1.000e-04, eta: 3:43:56, time: 0.722, data_time: 0.008, memory: 3440, ema_momentum: 0.9989, sup_loss_cls: 0.6705, sup_loss_bbox: 0.7444, sup_num_gts: 3.1800, unsup_loss_cls: 0.9783, unsup_loss_bbox: 0.9558, unsup_num_gts: 2.5000, unsup_gmm_thr: 0.2619, loss: 3.3491, grad_norm: 74.7907
2023-04-23 15:05:45,591 - mmdet.ssod - INFO - Exp name: consistent_teacher_r50_fpn_coco_180k_1p.py
2023-04-23 15:05:45,592 - mmdet.ssod - INFO - Iter [1000/20000] lr: 1.000e-04, eta: 3:43:30, time: 0.715, data_time: 0.008, memory: 3440, ema_momentum: 0.9990, sup_loss_cls: 0.6728, sup_loss_bbox: 0.7353, sup_num_gts: 3.0800, unsup_loss_cls: 0.8120, unsup_loss_bbox: 0.8506, unsup_num_gts: 2.2550, unsup_gmm_thr: 0.2650, loss: 3.0707, grad_norm: 71.1792
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.
wandb:
wandb: 
wandb: Run history:
wandb:           bbox_gt_num ▂▁▄▃▅▃▃▅▆█▅▄▄▄▂▄▂▄▅▂
wandb:               gmm_thr ▁▃▃▅▄▅▆▆▆▇▆▆▇▇▇██▇▇█
wandb:         learning_rate ▁▂▃▃▄▅▆▆▇███████████
wandb:              momentum ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:            sup_gt_num ▂▂▃▅▃▇▃▄▂▂▁▂▁▁▇▃█▁▃▂
wandb:    train/ema_momentum ▁▅▆▇▇▇▇▇████████████
wandb:       train/grad_norm ▃▅▂▁▃▁▃▂▂▂▅▅▆▄▄▃█▄▂▂
wandb:            train/loss █▅▃▂▂▁▁▁▁▁▆▆▆▆▆▅▆▅▅▄
wandb:   train/sup_loss_bbox █▇▄▃▃▃▂▂▂▂▂▂▁▁▂▂▁▁▁▁
wandb:    train/sup_loss_cls █▅▃▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁
wandb:     train/sup_num_gts ▂▅▄▃▃▅▄▆▇▅█▁▁▂▅██▂█▇
wandb:   train/unsup_gmm_thr ▁▂▃▄▄▅▆▆▆▆▆▇▇▇▇▇▇▇██
wandb: train/unsup_loss_bbox ▁▁▁▁▁▁▁▁▁▁█████▇▇▇▇▆
wandb:  train/unsup_loss_cls ▁▁▁▁▁▁▁▁▁▁█▇▇█▇▇█▆▆▅
wandb:   train/unsup_num_gts ▁▂▄▅▇█▇▇▇█▇▇▇▇▇▇▇██▇
wandb:
wandb: Run summary:
wandb:           bbox_gt_num 1.0
wandb:               gmm_thr 0.26834
wandb:         learning_rate 0.0001
wandb:              momentum 0.9
wandb:            sup_gt_num 2.0
wandb:    train/ema_momentum 0.999
wandb:       train/grad_norm 71.17918
wandb:            train/loss 3.07069
wandb:   train/sup_loss_bbox 0.73531
wandb:    train/sup_loss_cls 0.67276
wandb:     train/sup_num_gts 3.08
wandb:   train/unsup_gmm_thr 0.26501
wandb: train/unsup_loss_bbox 0.8506
wandb:  train/unsup_loss_cls 0.81202
wandb:   train/unsup_num_gts 2.255
wandb:
wandb: Synced consistent_teacher_r50_fpn_coco_180k_1p: https://wandb.ai/joeyslvs/consistent-teacher/runs/1hfumpr6
wandb: Synced 6 W&B file(s), 4 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: .\wandb\run-20230423_145341-1hfumpr6\logs

Adamdad commented 1 year ago

Dear @joeyslv , This is a DDP problem, which I cannot figure it out right now. You may want to provide more running and env informations. As I mentioned, I have never train it with 1 GPU.

Best

yuan738 commented 1 year ago

And during the verification process in the training phase, there was an error. It seems that distributed verification was used, but I only have one graphics card. How can I modify it so that verification can be performed on a single graphics card

 File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Hello, I also met this error. Have you solved it? Thank you!

joeyslv commented 1 year ago

And during the verification process in the training phase, there was an error. It seems that distributed verification was used, but I only have one graphics card. How can I modify it so that verification can be performed on a single graphics card

 File "tools/train.py", line 193, in main
    meta=meta,
  File "e:\object-detection\github\consist\consistentteacher\ssod\apis\train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "D:\App\anaconda\envs\arknights\lib\site-packages\mmcv\runner\base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 37, in after_train_iter
    self._do_evaluate(runner)
  File "e:\object-detection\github\consist\consistentteacher\ssod\utils\hooks\submodules_evaluation.py", line 51, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 1075, in broadcast
    default_pg = _get_default_group()
  File "D:\App\anaconda\envs\arknights\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Hello, I also met this error. Have you solved it? Thank you!

Yes, I ultimately ran it successfully, but it was on a Linux system. This issue should be caused by the unsuccessful adaptation of mmdet to DDP on Windows. It is recommended to use Linux to run the project. And install the mmdet and mmcv versions specified by the author. git clone https://github.com/open-mmlab/mmdetection.git git clone https://github.com/Adamdad/ConsistentTeacher.git cd ConsistentTeacher/ make install Directly installing the latest mmdet project with the above code will result in many errors.

yuan738 commented 1 year ago

Thanks for your help very much. I also run this code on Windows. I will try it on Ubuntu later. Thank you again!

Adamdad / ConsistentTeacher

why Loss==0 #7

`2023-04-23 12:49:22,302 - mmdet.ssod - INFO - [<StreamHandler (INFO)>, <FileHandler E:\Object-Detection\Github\consist\ConsistentTeacher\work_dirs\consistent_teacher_r50_fpn_coco_180k_1p\20230423_124922.log (INFO)>] 2023-04-23 12:49:22,303 - mmdet.ssod - INFO - Environment info:

TorchVision: 0.10.0+cu111 OpenCV: 4.5.5 MMCV: 1.4.2 MMCV Compiler: MSVC 192930137 MMCV CUDA Compiler: 11.1 MMDetection: 2.25.0+1fa6477