fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Apache License 2.0
3.23k stars 520 forks source link

wrong results: ap=0 #115

Closed HanWangSJTU closed 2 years ago

HanWangSJTU commented 2 years ago

after several epochs, the ap is still close to 0.

Almost all sets are default, training on COCO dataset.

bash: /usr/local/miniconda3/lib/libtinfo.so.6: no version information available (required by bash)

Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, cache_mode=False, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='/dataset/public/coco', dataset_file='coco', dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, enc_n_points=4, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=40, lr_drop_epochs=None, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], mask_loss_coef=1, masks=False, nheads=8, num_feature_levels=4, num_queries=300, num_workers=2, output_dir='exps/r50_deformable_detr', position_embedding='sine', position_embedding_scale=6.283185307179586, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, sgd=False, start_epoch=0, two_stage=False, weight_decay=0.0001, with_box_refine=False, world_size=2) DeformableDETR( (transformer): DeformableTransformer( (encoder): DeformableTransformerEncoder( (layers): ModuleList( (0): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): DeformableTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) (decoder): DeformableTransformerDecoder( (layers): ModuleList( (0): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): DeformableTransformerDecoderLayer( (cross_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout2): Dropout(p=0.1, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout3): Dropout(p=0.1, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout4): Dropout(p=0.1, inplace=False) (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) (reference_points): Linear(in_features=256, out_features=2, bias=True) ) (class_embed): ModuleList( (0): Linear(in_features=256, out_features=91, bias=True) (1): Linear(in_features=256, out_features=91, bias=True) (2): Linear(in_features=256, out_features=91, bias=True) (3): Linear(in_features=256, out_features=91, bias=True) (4): Linear(in_features=256, out_features=91, bias=True) (5): Linear(in_features=256, out_features=91, bias=True) ) (bbox_embed): ModuleList( (0): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (1): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (2): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (3): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (4): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (5): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) ) (query_embed): Embedding(300, 512) (input_proj): ModuleList( (0): Sequential( (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (1): Sequential( (0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (2): Sequential( (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (3): Sequential( (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) (backbone): Joiner( (0): Backbone( (body): IntermediateLayerGetter( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): FrozenBatchNorm2d() (relu): ReLU(inplace=True) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): FrozenBatchNorm2d() ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): FrozenBatchNorm2d() ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): FrozenBatchNorm2d() ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): FrozenBatchNorm2d() ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): FrozenBatchNorm2d() (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): FrozenBatchNorm2d() (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): FrozenBatchNorm2d() (relu): ReLU(inplace=True) ) ) ) ) (1): PositionEmbeddingSine() ) ) number of params: 39847265 loading annotations into memory... Done (t=14.77s) creating index... index created! loading annotations into memory... Done (t=0.50s) creating index... index created! transformer.level_embed transformer.encoder.layers.0.self_attn.sampling_offsets.weight transformer.encoder.layers.0.self_attn.sampling_offsets.bias transformer.encoder.layers.0.self_attn.attention_weights.weight transformer.encoder.layers.0.self_attn.attention_weights.bias transformer.encoder.layers.0.self_attn.value_proj.weight transformer.encoder.layers.0.self_attn.value_proj.bias transformer.encoder.layers.0.self_attn.output_proj.weight transformer.encoder.layers.0.self_attn.output_proj.bias transformer.encoder.layers.0.norm1.weight transformer.encoder.layers.0.norm1.bias transformer.encoder.layers.0.linear1.weight transformer.encoder.layers.0.linear1.bias transformer.encoder.layers.0.linear2.weight transformer.encoder.layers.0.linear2.bias transformer.encoder.layers.0.norm2.weight transformer.encoder.layers.0.norm2.bias transformer.encoder.layers.1.self_attn.sampling_offsets.weight transformer.encoder.layers.1.self_attn.sampling_offsets.bias transformer.encoder.layers.1.self_attn.attention_weights.weight transformer.encoder.layers.1.self_attn.attention_weights.bias transformer.encoder.layers.1.self_attn.value_proj.weight transformer.encoder.layers.1.self_attn.value_proj.bias transformer.encoder.layers.1.self_attn.output_proj.weight transformer.encoder.layers.1.self_attn.output_proj.bias transformer.encoder.layers.1.norm1.weight transformer.encoder.layers.1.norm1.bias transformer.encoder.layers.1.linear1.weight transformer.encoder.layers.1.linear1.bias transformer.encoder.layers.1.linear2.weight transformer.encoder.layers.1.linear2.bias transformer.encoder.layers.1.norm2.weight transformer.encoder.layers.1.norm2.bias transformer.encoder.layers.2.self_attn.sampling_offsets.weight transformer.encoder.layers.2.self_attn.sampling_offsets.bias transformer.encoder.layers.2.self_attn.attention_weights.weight transformer.encoder.layers.2.self_attn.attention_weights.bias transformer.encoder.layers.2.self_attn.value_proj.weight transformer.encoder.layers.2.self_attn.value_proj.bias transformer.encoder.layers.2.self_attn.output_proj.weight transformer.encoder.layers.2.self_attn.output_proj.bias transformer.encoder.layers.2.norm1.weight transformer.encoder.layers.2.norm1.bias transformer.encoder.layers.2.linear1.weight transformer.encoder.layers.2.linear1.bias transformer.encoder.layers.2.linear2.weight transformer.encoder.layers.2.linear2.bias transformer.encoder.layers.2.norm2.weight transformer.encoder.layers.2.norm2.bias transformer.encoder.layers.3.self_attn.sampling_offsets.weight transformer.encoder.layers.3.self_attn.sampling_offsets.bias transformer.encoder.layers.3.self_attn.attention_weights.weight transformer.encoder.layers.3.self_attn.attention_weights.bias transformer.encoder.layers.3.self_attn.value_proj.weight transformer.encoder.layers.3.self_attn.value_proj.bias transformer.encoder.layers.3.self_attn.output_proj.weight transformer.encoder.layers.3.self_attn.output_proj.bias transformer.encoder.layers.3.norm1.weight transformer.encoder.layers.3.norm1.bias transformer.encoder.layers.3.linear1.weight transformer.encoder.layers.3.linear1.bias transformer.encoder.layers.3.linear2.weight transformer.encoder.layers.3.linear2.bias transformer.encoder.layers.3.norm2.weight transformer.encoder.layers.3.norm2.bias transformer.encoder.layers.4.self_attn.sampling_offsets.weight transformer.encoder.layers.4.self_attn.sampling_offsets.bias transformer.encoder.layers.4.self_attn.attention_weights.weight transformer.encoder.layers.4.self_attn.attention_weights.bias transformer.encoder.layers.4.self_attn.value_proj.weight transformer.encoder.layers.4.self_attn.value_proj.bias transformer.encoder.layers.4.self_attn.output_proj.weight transformer.encoder.layers.4.self_attn.output_proj.bias transformer.encoder.layers.4.norm1.weight transformer.encoder.layers.4.norm1.bias transformer.encoder.layers.4.linear1.weight transformer.encoder.layers.4.linear1.bias transformer.encoder.layers.4.linear2.weight transformer.encoder.layers.4.linear2.bias transformer.encoder.layers.4.norm2.weight transformer.encoder.layers.4.norm2.bias transformer.encoder.layers.5.self_attn.sampling_offsets.weight transformer.encoder.layers.5.self_attn.sampling_offsets.bias transformer.encoder.layers.5.self_attn.attention_weights.weight transformer.encoder.layers.5.self_attn.attention_weights.bias transformer.encoder.layers.5.self_attn.value_proj.weight transformer.encoder.layers.5.self_attn.value_proj.bias transformer.encoder.layers.5.self_attn.output_proj.weight transformer.encoder.layers.5.self_attn.output_proj.bias transformer.encoder.layers.5.norm1.weight transformer.encoder.layers.5.norm1.bias transformer.encoder.layers.5.linear1.weight transformer.encoder.layers.5.linear1.bias transformer.encoder.layers.5.linear2.weight transformer.encoder.layers.5.linear2.bias transformer.encoder.layers.5.norm2.weight transformer.encoder.layers.5.norm2.bias transformer.decoder.layers.0.cross_attn.sampling_offsets.weight transformer.decoder.layers.0.cross_attn.sampling_offsets.bias transformer.decoder.layers.0.cross_attn.attention_weights.weight transformer.decoder.layers.0.cross_attn.attention_weights.bias transformer.decoder.layers.0.cross_attn.value_proj.weight transformer.decoder.layers.0.cross_attn.value_proj.bias transformer.decoder.layers.0.cross_attn.output_proj.weight transformer.decoder.layers.0.cross_attn.output_proj.bias transformer.decoder.layers.0.norm1.weight transformer.decoder.layers.0.norm1.bias transformer.decoder.layers.0.self_attn.in_proj_weight transformer.decoder.layers.0.self_attn.in_proj_bias transformer.decoder.layers.0.self_attn.out_proj.weight transformer.decoder.layers.0.self_attn.out_proj.bias transformer.decoder.layers.0.norm2.weight transformer.decoder.layers.0.norm2.bias transformer.decoder.layers.0.linear1.weight transformer.decoder.layers.0.linear1.bias transformer.decoder.layers.0.linear2.weight transformer.decoder.layers.0.linear2.bias transformer.decoder.layers.0.norm3.weight transformer.decoder.layers.0.norm3.bias transformer.decoder.layers.1.cross_attn.sampling_offsets.weight transformer.decoder.layers.1.cross_attn.sampling_offsets.bias transformer.decoder.layers.1.cross_attn.attention_weights.weight transformer.decoder.layers.1.cross_attn.attention_weights.bias transformer.decoder.layers.1.cross_attn.value_proj.weight transformer.decoder.layers.1.cross_attn.value_proj.bias transformer.decoder.layers.1.cross_attn.output_proj.weight transformer.decoder.layers.1.cross_attn.output_proj.bias transformer.decoder.layers.1.norm1.weight transformer.decoder.layers.1.norm1.bias transformer.decoder.layers.1.self_attn.in_proj_weight transformer.decoder.layers.1.self_attn.in_proj_bias transformer.decoder.layers.1.self_attn.out_proj.weight transformer.decoder.layers.1.self_attn.out_proj.bias transformer.decoder.layers.1.norm2.weight transformer.decoder.layers.1.norm2.bias transformer.decoder.layers.1.linear1.weight transformer.decoder.layers.1.linear1.bias transformer.decoder.layers.1.linear2.weight transformer.decoder.layers.1.linear2.bias transformer.decoder.layers.1.norm3.weight transformer.decoder.layers.1.norm3.bias transformer.decoder.layers.2.cross_attn.sampling_offsets.weight transformer.decoder.layers.2.cross_attn.sampling_offsets.bias transformer.decoder.layers.2.cross_attn.attention_weights.weight transformer.decoder.layers.2.cross_attn.attention_weights.bias transformer.decoder.layers.2.cross_attn.value_proj.weight transformer.decoder.layers.2.cross_attn.value_proj.bias transformer.decoder.layers.2.cross_attn.output_proj.weight transformer.decoder.layers.2.cross_attn.output_proj.bias transformer.decoder.layers.2.norm1.weight transformer.decoder.layers.2.norm1.bias transformer.decoder.layers.2.self_attn.in_proj_weight transformer.decoder.layers.2.self_attn.in_proj_bias transformer.decoder.layers.2.self_attn.out_proj.weight transformer.decoder.layers.2.self_attn.out_proj.bias transformer.decoder.layers.2.norm2.weight transformer.decoder.layers.2.norm2.bias transformer.decoder.layers.2.linear1.weight transformer.decoder.layers.2.linear1.bias transformer.decoder.layers.2.linear2.weight transformer.decoder.layers.2.linear2.bias transformer.decoder.layers.2.norm3.weight transformer.decoder.layers.2.norm3.bias transformer.decoder.layers.3.cross_attn.sampling_offsets.weight transformer.decoder.layers.3.cross_attn.sampling_offsets.bias transformer.decoder.layers.3.cross_attn.attention_weights.weight transformer.decoder.layers.3.cross_attn.attention_weights.bias transformer.decoder.layers.3.cross_attn.value_proj.weight transformer.decoder.layers.3.cross_attn.value_proj.bias transformer.decoder.layers.3.cross_attn.output_proj.weight transformer.decoder.layers.3.cross_attn.output_proj.bias transformer.decoder.layers.3.norm1.weight transformer.decoder.layers.3.norm1.bias transformer.decoder.layers.3.self_attn.in_proj_weight transformer.decoder.layers.3.self_attn.in_proj_bias transformer.decoder.layers.3.self_attn.out_proj.weight transformer.decoder.layers.3.self_attn.out_proj.bias transformer.decoder.layers.3.norm2.weight transformer.decoder.layers.3.norm2.bias transformer.decoder.layers.3.linear1.weight transformer.decoder.layers.3.linear1.bias transformer.decoder.layers.3.linear2.weight transformer.decoder.layers.3.linear2.bias transformer.decoder.layers.3.norm3.weight transformer.decoder.layers.3.norm3.bias transformer.decoder.layers.4.cross_attn.sampling_offsets.weight transformer.decoder.layers.4.cross_attn.sampling_offsets.bias transformer.decoder.layers.4.cross_attn.attention_weights.weight transformer.decoder.layers.4.cross_attn.attention_weights.bias transformer.decoder.layers.4.cross_attn.value_proj.weight transformer.decoder.layers.4.cross_attn.value_proj.bias transformer.decoder.layers.4.cross_attn.output_proj.weight transformer.decoder.layers.4.cross_attn.output_proj.bias transformer.decoder.layers.4.norm1.weight transformer.decoder.layers.4.norm1.bias transformer.decoder.layers.4.self_attn.in_proj_weight transformer.decoder.layers.4.self_attn.in_proj_bias transformer.decoder.layers.4.self_attn.out_proj.weight transformer.decoder.layers.4.self_attn.out_proj.bias transformer.decoder.layers.4.norm2.weight transformer.decoder.layers.4.norm2.bias transformer.decoder.layers.4.linear1.weight transformer.decoder.layers.4.linear1.bias transformer.decoder.layers.4.linear2.weight transformer.decoder.layers.4.linear2.bias transformer.decoder.layers.4.norm3.weight transformer.decoder.layers.4.norm3.bias transformer.decoder.layers.5.cross_attn.sampling_offsets.weight transformer.decoder.layers.5.cross_attn.sampling_offsets.bias transformer.decoder.layers.5.cross_attn.attention_weights.weight transformer.decoder.layers.5.cross_attn.attention_weights.bias transformer.decoder.layers.5.cross_attn.value_proj.weight transformer.decoder.layers.5.cross_attn.value_proj.bias transformer.decoder.layers.5.cross_attn.output_proj.weight transformer.decoder.layers.5.cross_attn.output_proj.bias transformer.decoder.layers.5.norm1.weight transformer.decoder.layers.5.norm1.bias transformer.decoder.layers.5.self_attn.in_proj_weight transformer.decoder.layers.5.self_attn.in_proj_bias transformer.decoder.layers.5.self_attn.out_proj.weight transformer.decoder.layers.5.self_attn.out_proj.bias transformer.decoder.layers.5.norm2.weight transformer.decoder.layers.5.norm2.bias transformer.decoder.layers.5.linear1.weight transformer.decoder.layers.5.linear1.bias transformer.decoder.layers.5.linear2.weight transformer.decoder.layers.5.linear2.bias transformer.decoder.layers.5.norm3.weight transformer.decoder.layers.5.norm3.bias transformer.reference_points.weight transformer.reference_points.bias class_embed.0.weight class_embed.0.bias bbox_embed.0.layers.0.weight bbox_embed.0.layers.0.bias bbox_embed.0.layers.1.weight bbox_embed.0.layers.1.bias bbox_embed.0.layers.2.weight bbox_embed.0.layers.2.bias query_embed.weight input_proj.0.0.weight input_proj.0.0.bias input_proj.0.1.weight input_proj.0.1.bias input_proj.1.0.weight input_proj.1.0.bias input_proj.1.1.weight input_proj.1.1.bias input_proj.2.0.weight input_proj.2.0.bias input_proj.2.1.weight input_proj.2.1.bias input_proj.3.0.weight input_proj.3.0.bias input_proj.3.1.weight input_proj.3.1.bias backbone.0.body.conv1.weight backbone.0.body.layer1.0.conv1.weight backbone.0.body.layer1.0.conv2.weight backbone.0.body.layer1.0.conv3.weight backbone.0.body.layer1.0.downsample.0.weight backbone.0.body.layer1.1.conv1.weight backbone.0.body.layer1.1.conv2.weight backbone.0.body.layer1.1.conv3.weight backbone.0.body.layer1.2.conv1.weight backbone.0.body.layer1.2.conv2.weight backbone.0.body.layer1.2.conv3.weight backbone.0.body.layer2.0.conv1.weight backbone.0.body.layer2.0.conv2.weight backbone.0.body.layer2.0.conv3.weight backbone.0.body.layer2.0.downsample.0.weight backbone.0.body.layer2.1.conv1.weight backbone.0.body.layer2.1.conv2.weight backbone.0.body.layer2.1.conv3.weight backbone.0.body.layer2.2.conv1.weight backbone.0.body.layer2.2.conv2.weight backbone.0.body.layer2.2.conv3.weight backbone.0.body.layer2.3.conv1.weight backbone.0.body.layer2.3.conv2.weight backbone.0.body.layer2.3.conv3.weight backbone.0.body.layer3.0.conv1.weight backbone.0.body.layer3.0.conv2.weight backbone.0.body.layer3.0.conv3.weight backbone.0.body.layer3.0.downsample.0.weight backbone.0.body.layer3.1.conv1.weight backbone.0.body.layer3.1.conv2.weight backbone.0.body.layer3.1.conv3.weight backbone.0.body.layer3.2.conv1.weight backbone.0.body.layer3.2.conv2.weight backbone.0.body.layer3.2.conv3.weight backbone.0.body.layer3.3.conv1.weight backbone.0.body.layer3.3.conv2.weight backbone.0.body.layer3.3.conv3.weight backbone.0.body.layer3.4.conv1.weight backbone.0.body.layer3.4.conv2.weight backbone.0.body.layer3.4.conv3.weight backbone.0.body.layer3.5.conv1.weight backbone.0.body.layer3.5.conv2.weight backbone.0.body.layer3.5.conv3.weight backbone.0.body.layer4.0.conv1.weight backbone.0.body.layer4.0.conv2.weight backbone.0.body.layer4.0.conv3.weight backbone.0.body.layer4.0.downsample.0.weight backbone.0.body.layer4.1.conv1.weight backbone.0.body.layer4.1.conv2.weight backbone.0.body.layer4.1.conv3.weight backbone.0.body.layer4.2.conv1.weight backbone.0.body.layer4.2.conv2.weight backbone.0.body.layer4.2.conv3.weight Start training Epoch: [0] [ 0/29572] eta: 4:32:47 lr: 0.000200 class_error: 100.00 grad_norm: 78.99 loss: 40.0816 (40.0816) loss_bbox: 2.8369 (2.8369) loss_bbox_0: 2.9144 (2.9144) loss_bbox_1: 2.8846 (2.8846) loss_bbox_2: 2.9020 (2.9020) loss_bbox_3: 2.8461 (2.8461) loss_bbox_4: 2.8331 (2.8331) loss_ce: 2.2777 (2.2777) loss_ce_0: 2.0421 (2.0421) loss_ce_1: 2.1291 (2.1291) loss_ce_2: 2.0680 (2.0680) loss_ce_3: 2.2875 (2.2875) loss_ce_4: 2.2155 (2.2155) loss_giou: 1.6408 (1.6408) loss_giou_0: 1.6408 (1.6408) loss_giou_1: 1.6408 (1.6408) loss_giou_2: 1.6408 (1.6408) loss_giou_3: 1.6408 (1.6408) loss_giou_4: 1.6408 (1.6408) cardinality_error_unscaled: 296.2500 (296.2500) cardinality_error_0_unscaled: 295.5000 (295.5000) cardinality_error_1_unscaled: 296.2500 (296.2500) cardinality_error_2_unscaled: 296.2500 (296.2500) cardinality_error_3_unscaled: 296.2500 (296.2500) cardinality_error_4_unscaled: 296.2500 (296.2500) class_error_unscaled: 100.0000 (100.0000) loss_bbox_unscaled: 0.5674 (0.5674) loss_bbox_0_unscaled: 0.5829 (0.5829) loss_bbox_1_unscaled: 0.5769 (0.5769) loss_bbox_2_unscaled: 0.5804 (0.5804) loss_bbox_3_unscaled: 0.5692 (0.5692) loss_bbox_4_unscaled: 0.5666 (0.5666) loss_ce_unscaled: 1.1388 (1.1388) loss_ce_0_unscaled: 1.0210 (1.0210) loss_ce_1_unscaled: 1.0646 (1.0646) loss_ce_2_unscaled: 1.0340 (1.0340) loss_ce_3_unscaled: 1.1438 (1.1438) loss_ce_4_unscaled: 1.1078 (1.1078) loss_giou_unscaled: 0.8204 (0.8204) loss_giou_0_unscaled: 0.8204 (0.8204) loss_giou_1_unscaled: 0.8204 (0.8204) loss_giou_2_unscaled: 0.8204 (0.8204) loss_giou_3_unscaled: 0.8204 (0.8204) loss_giou_4_unscaled: 0.8204 (0.8204) time: 0.5535 data: 0.0000 max mem: 4316

HanWangSJTU commented 2 years ago

Accumulating evaluation results... DONE (t=8.49s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.008 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.010 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.010 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.017

HanWangSJTU commented 2 years ago

I fix it by drop the lr.

kinredon commented 2 years ago

Why does this happen?

zgw7297 commented 1 year ago

I met the same question in training my datasets, i found the class_error is alway 100 when traing, i think maybe it is the reason why ap = 0, but i dont kown why this happen.

bkkm78 commented 1 year ago

@HanWangSJTU Hi, would you mind sharing the learning rates you used in your experiments? Is it linear scaling according to batch size? Thanks!