czczup / ViT-Adapter

[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
https://arxiv.org/abs/2205.08534
Apache License 2.0
1.23k stars 137 forks source link

Training VIT-Adapter on custom dataset #100

Open goodmayonnaise opened 1 year ago

goodmayonnaise commented 1 year ago

I'm training on mask2former_beit_adapter_large_896_80k_ms model with custom dataset and I have 2 issues that I'm having hard time to figure them out.

  1. I tried running train.py but after each epoch, when validation starts, it is slower than training speed which I thought it was weird and not sure what's the problem.
  2. To solve the first problem I have made some changes on config file and now during training process, validation is being skipped even though I did not type in --no-validate option on CLI.

P.S. I'm training on A100 GPU

Here's my code, please help me out!

validation (test.py) image

logfile (i did not type in --no-validate option on CLI)

{"env_info": "sys.platform: linux\nPython: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]\nCUDA available: True\nGPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB\nCUDA_HOME: /usr/local/cuda\nNVCC: Build cuda_11.1.TC455_06.29190527_0\nGCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0\nPyTorch: 1.9.0+cu111\nPyTorch compiling details: PyTorch built with:\n  - GCC 7.3\n  - C++ Version: 201402\n  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - NNPACK is enabled\n  - CPU capability usage: AVX2\n  - CUDA Runtime 11.1\n  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n  - CuDNN 8.0.5\n  - Magma 2.5.2\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, \n\nTorchVision: 0.10.0+cu111\nOpenCV: 4.7.0\nMMCV: 1.4.2\nMMCV Compiler: GCC 7.3\nMMCV CUDA Compiler: 11.1\nMMSegmentation: 0.20.2+", "seed": 1493696795, "exp_name": "mask2former_beit_adapter_large_896_80k_kitti_ms.py", "mmseg_version": "0.20.2+", "config": "num_things_classes = 0\nnum_stuff_classes = 20\nnum_classes = 20\nnorm_cfg = dict(type='SyncBN', requires_grad=True)\nmodel = dict(\n    type='EncoderDecoderMask2FormerAug',\n    pretrained='pretrained/beit_large_patch16_224_pt22k_ft22k.pth',\n    backbone=dict(\n        type='BEiTAdapter',\n        patch_size=16,\n        embed_dim=1024,\n        depth=24,\n        num_heads=16,\n        mlp_ratio=4,\n        qkv_bias=True,\n        use_abs_pos_emb=False,\n        use_rel_pos_bias=True,\n        img_size=896,\n        init_values=1e-06,\n        drop_path_rate=0.3,\n        conv_inplane=64,\n        n_points=4,\n        deform_num_heads=16,\n        cffn_ratio=0.25,\n        deform_ratio=0.5,\n        with_cp=True,\n        interaction_indexes=[[0, 5], [6, 11], [12, 17], [18, 23]],\n        pretrained='pretrained/beit_large_patch16_224_pt22k_ft22k.pth'),\n    decode_head=dict(\n        type='Mask2FormerHead',\n        in_channels=[1024, 1024, 1024, 1024],\n        feat_channels=1024,\n        out_channels=1024,\n        in_index=[0, 1, 2, 3],\n        num_things_classes=0,\n        num_stuff_classes=20,\n        num_queries=100,\n        num_transformer_feat_level=3,\n        pixel_decoder=dict(\n            type='MSDeformAttnPixelDecoder',\n            num_outs=3,\n            norm_cfg=dict(type='GN', num_groups=32),\n            act_cfg=dict(type='ReLU'),\n            encoder=dict(\n                type='DetrTransformerEncoder',\n                num_layers=6,\n                transformerlayers=dict(\n                    type='BaseTransformerLayer',\n                    attn_cfgs=dict(\n                        type='MultiScaleDeformableAttention',\n                        embed_dims=1024,\n                        num_heads=32,\n                        num_levels=3,\n                        num_points=4,\n                        im2col_step=64,\n                        dropout=0.0,\n                        batch_first=False,\n                        norm_cfg=None,\n                        init_cfg=None),\n                    ffn_cfgs=dict(\n                        type='FFN',\n                        embed_dims=1024,\n                        feedforward_channels=4096,\n                        num_fcs=2,\n                        ffn_drop=0.0,\n                        act_cfg=dict(type='ReLU', inplace=True),\n                        with_cp=True),\n                    operation_order=('self_attn', 'norm', 'ffn', 'norm')),\n                init_cfg=None),\n            positional_encoding=dict(\n                type='SinePositionalEncoding', num_feats=512, normalize=True),\n            init_cfg=None),\n        enforce_decoder_input_project=False,\n        positional_encoding=dict(\n            type='SinePositionalEncoding', num_feats=512, normalize=True),\n        transformer_decoder=dict(\n            type='DetrTransformerDecoder',\n            return_intermediate=True,\n            num_layers=9,\n            transformerlayers=dict(\n                type='DetrTransformerDecoderLayer',\n                attn_cfgs=dict(\n                    type='MultiheadAttention',\n                    embed_dims=1024,\n                    num_heads=32,\n                    attn_drop=0.0,\n                    proj_drop=0.0,\n                    dropout_layer=None,\n                    batch_first=False),\n                ffn_cfgs=dict(\n                    embed_dims=1024,\n                    feedforward_channels=4096,\n                    num_fcs=2,\n                    act_cfg=dict(type='ReLU', inplace=True),\n                    ffn_drop=0.0,\n                    dropout_layer=None,\n                    add_identity=True,\n                    with_cp=True),\n                feedforward_channels=4096,\n                operation_order=('cross_attn', 'norm', 'self_attn', 'norm',\n                                 'ffn', 'norm')),\n            init_cfg=None),\n        loss_cls=dict(\n            type='CrossEntropyLoss',\n            use_sigmoid=False,\n            loss_weight=2.0,\n            reduction='mean',\n            class_weight=[\n                1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,\n                1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.1\n            ]),\n        loss_mask=dict(\n            type='CrossEntropyLoss',\n            use_sigmoid=True,\n            reduction='mean',\n            loss_weight=5.0),\n        loss_dice=dict(\n            type='DiceLoss',\n            use_sigmoid=True,\n            activate=True,\n            reduction='mean',\n            naive_dice=True,\n            eps=1.0,\n            loss_weight=5.0),\n        train_cfg=dict(\n            num_points=12544,\n            oversample_ratio=3.0,\n            importance_sample_ratio=0.75,\n            assigner=dict(\n                type='MaskHungarianAssigner',\n                cls_cost=dict(type='ClassificationCost', weight=2.0),\n                mask_cost=dict(\n                    type='CrossEntropyLossCost', weight=5.0, use_sigmoid=True),\n                dice_cost=dict(\n                    type='DiceCost', weight=5.0, pred_act=True, eps=1.0)),\n            sampler=dict(type='MaskPseudoSampler')),\n        test_cfg=dict(\n            panoptic_on=False,\n            semantic_on=True,\n            instance_on=False,\n            max_per_image=100,\n            iou_thr=0.8,\n            filter_low_score=True,\n            mode='slide',\n            crop_size=(896, 896),\n            stride=(512, 512))),\n    train_cfg=dict(\n        num_points=12544,\n        oversample_ratio=3.0,\n        importance_sample_ratio=0.75,\n        assigner=dict(\n            type='MaskHungarianAssigner',\n            cls_cost=dict(type='ClassificationCost', weight=2.0),\n            mask_cost=dict(\n                type='CrossEntropyLossCost', weight=5.0, use_sigmoid=True),\n            dice_cost=dict(\n                type='DiceCost', weight=5.0, pred_act=True, eps=1.0)),\n        sampler=dict(type='MaskPseudoSampler')),\n    test_cfg=dict(\n        panoptic_on=False,\n        semantic_on=True,\n        instance_on=False,\n        max_per_image=100,\n        iou_thr=0.8,\n        filter_low_score=True,\n        mode='slide',\n        crop_size=(896, 896),\n        stride=(512, 512)),\n    init_cfg=None)\ndataset_type = 'KITTIDataset'\ndata_root = 'data/kitti/'\nimg_norm_cfg = dict(\n    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)\ncrop_size = (896, 896)\ntrain_pipeline = [\n    dict(type='LoadImageFromFile'),\n    dict(type='LoadAnnotations'),\n    dict(type='Resize', img_scale=(3678, 1110), ratio_range=(0.5, 2.0)),\n    dict(type='RandomCrop', crop_size=(896, 896), cat_max_ratio=0.75),\n    dict(type='RandomFlip', prob=0.5),\n    dict(type='PhotoMetricDistortion'),\n    dict(\n        type='Normalize',\n        mean=[123.675, 116.28, 103.53],\n        std=[58.395, 57.12, 57.375],\n        to_rgb=True),\n    dict(type='Pad', size=(896, 896), pad_val=0, seg_pad_val=255),\n    dict(type='ToMask'),\n    dict(type='DefaultFormatBundle'),\n    dict(\n        type='Collect',\n        keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels'])\n]\ntest_pipeline = [\n    dict(type='LoadImageFromFile'),\n    dict(\n        type='MultiScaleFlipAug',\n        img_scale=(3678, 1110),\n        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],\n        flip=True,\n        transforms=[\n            dict(\n                type='SETR_Resize',\n                keep_ratio=True,\n                crop_size=(896, 896),\n                setr_multi_scale=True),\n            dict(type='RandomFlip'),\n            dict(\n                type='Normalize',\n                mean=[123.675, 116.28, 103.53],\n                std=[58.395, 57.12, 57.375],\n                to_rgb=True),\n            dict(type='ImageToTensor', keys=['img']),\n            dict(type='Collect', keys=['img'])\n        ])\n]\ndata = dict(\n    samples_per_gpu=4,\n    workers_per_gpu=4,\n    train=dict(\n        type='KITTIDataset',\n        data_root='data/kitti/',\n        img_dir='img_dir/train',\n        ann_dir='ann_dir/train',\n        pipeline=[\n            dict(type='LoadImageFromFile'),\n            dict(type='LoadAnnotations'),\n            dict(\n                type='Resize', img_scale=(3678, 1110), ratio_range=(0.5, 2.0)),\n            dict(type='RandomCrop', crop_size=(896, 896), cat_max_ratio=0.75),\n            dict(type='RandomFlip', prob=0.5),\n            dict(type='PhotoMetricDistortion'),\n            dict(\n                type='Normalize',\n                mean=[123.675, 116.28, 103.53],\n                std=[58.395, 57.12, 57.375],\n                to_rgb=True),\n            dict(type='Pad', size=(896, 896), pad_val=0, seg_pad_val=255),\n            dict(type='ToMask'),\n            dict(type='DefaultFormatBundle'),\n            dict(\n                type='Collect',\n                keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels'])\n        ],\n        img_suffix='.png',\n        seg_map_suffix='.png'),\n    val=dict(\n        type='KITTIDataset',\n        data_root='data/kitti/',\n        img_dir='img_dir/val',\n        ann_dir='ann_dir/val',\n        pipeline=[\n            dict(type='LoadImageFromFile'),\n            dict(\n                type='MultiScaleFlipAug',\n                img_scale=(3678, 1110),\n                img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],\n                flip=True,\n                transforms=[\n                    dict(\n                        type='SETR_Resize',\n                        keep_ratio=True,\n                        crop_size=(896, 896),\n                        setr_multi_scale=True),\n                    dict(type='RandomFlip'),\n                    dict(\n                        type='Normalize',\n                        mean=[123.675, 116.28, 103.53],\n                        std=[58.395, 57.12, 57.375],\n                        to_rgb=True),\n                    dict(type='ImageToTensor', keys=['img']),\n                    dict(type='Collect', keys=['img'])\n                ])\n        ]),\n    test=dict(\n        type='KITTIDataset',\n        data_root='data/kitti/',\n        img_dir='img_dir/val',\n        ann_dir='ann_dir/val',\n        pipeline=[\n            dict(type='LoadImageFromFile'),\n            dict(\n                type='MultiScaleFlipAug',\n                img_scale=(3678, 1110),\n                img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],\n                flip=True,\n                transforms=[\n                    dict(\n                        type='SETR_Resize',\n                        keep_ratio=True,\n                        crop_size=(896, 896),\n                        setr_multi_scale=True),\n                    dict(type='RandomFlip'),\n                    dict(\n                        type='Normalize',\n                        mean=[123.675, 116.28, 103.53],\n                        std=[58.395, 57.12, 57.375],\n                        to_rgb=True),\n                    dict(type='ImageToTensor', keys=['img']),\n                    dict(type='Collect', keys=['img'])\n                ])\n        ]))\nlog_config = dict(\n    interval=50,\n    hooks=[\n        dict(type='TextLoggerHook', by_epoch=True),\n        dict(type='TensorboardLoggerHook')\n    ])\ndist_params = dict(backend='nccl')\nlog_level = 'INFO'\nload_from = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'\nresume_from = None\nworkflow = [('train', 1)]\ncudnn_benchmark = True\noptimizer = dict(\n    type='AdamW',\n    lr=2e-05,\n    betas=(0.9, 0.999),\n    weight_decay=0.05,\n    constructor='LayerDecayOptimizerConstructor',\n    paramwise_cfg=dict(num_layers=24, layer_decay_rate=0.9))\noptimizer_config = dict()\nlr_config = dict(\n    policy='poly',\n    warmup='linear',\n    warmup_iters=1500,\n    warmup_ratio=1e-06,\n    power=1.0,\n    min_lr=0.0,\n    by_epoch=True)\nrunner = dict(type='EpochBasedRunner', max_epochs=500)\ncheckpoint_config = dict(by_epoch=True, interval=1000, max_keep_ckpts=1)\nevaluation = dict(\n    interval=1000, metric='mIoU', pre_eval=True, save_best='mIoU')\npretrained = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'\nwork_dir = './work_dirs/mask2former_beit_adapter_large_896_80k_kitti_ms'\ngpu_ids = range(0, 8)\nauto_resume = False\nseed = 1493696795\n", "CLASSES": ["unlabeled", "car", "bicycle", "motorcycle", "truck", "other-vehicle", "person", "bicyclist", "motorcyclist", "road", "parking", "sidewalk", "other-ground", "building", "fence", "vegetation", "trunk", "terrain", "pole", "traffic-sign"], "PALETTE": [[0, 0, 0], [245, 150, 100], [245, 230, 100], [150, 60, 30], [180, 30, 80], [255, 0, 0], [30, 30, 255], [200, 40, 255], [90, 30, 150], [255, 0, 255], [255, 150, 255], [75, 0, 75], [75, 0, 175], [0, 200, 255], [50, 120, 255], [0, 175, 0], [0, 60, 135], [80, 240, 150], [150, 240, 255], [0, 0, 255]], "hook_msgs": {}}
{"mode": "train", "epoch": 1, "iter": 50, "lr": 0.0, "memory": 49619, "data_time": 0.74881, "decode.loss_cls": 5.78103, "decode.loss_mask": 2.65498, "decode.loss_dice": 4.36656, "decode.d0.loss_cls": 5.93533, "decode.d0.loss_mask": 1.84931, "decode.d0.loss_dice": 4.02757, "decode.d1.loss_cls": 6.42408, "decode.d1.loss_mask": 1.83262, "decode.d1.loss_dice": 4.08452, "decode.d2.loss_cls": 6.05425, "decode.d2.loss_mask": 2.08087, "decode.d2.loss_dice": 4.14527, "decode.d3.loss_cls": 5.82976, "decode.d3.loss_mask": 2.30131, "decode.d3.loss_dice": 4.17073, "decode.d4.loss_cls": 5.95528, "decode.d4.loss_mask": 2.03955, "decode.d4.loss_dice": 4.48296, "decode.d5.loss_cls": 5.27095, "decode.d5.loss_mask": 2.69255, "decode.d5.loss_dice": 4.27344, "decode.d6.loss_cls": 4.78678, "decode.d6.loss_mask": 2.18379, "decode.d6.loss_dice": 4.48807, "decode.d7.loss_cls": 4.71009, "decode.d7.loss_mask": 2.44317, "decode.d7.loss_dice": 4.37645, "decode.d8.loss_cls": 5.08187, "decode.d8.loss_mask": 2.65238, "decode.d8.loss_dice": 4.35125, "loss": 121.32677, "time": 5.81762}
{"mode": "train", "epoch": 1, "iter": 100, "lr": 0.0, "memory": 49619, "data_time": 0.18719, "decode.loss_cls": 3.89394, "decode.loss_mask": 2.11898, "decode.loss_dice": 4.4035, "decode.d0.loss_cls": 5.89782, "decode.d0.loss_mask": 1.70889, "decode.d0.loss_dice": 4.02032, "decode.d1.loss_cls": 4.76334, "decode.d1.loss_mask": 1.6848, "decode.d1.loss_dice": 4.06056, "decode.d2.loss_cls": 4.13349, "decode.d2.loss_mask": 1.71326, "decode.d2.loss_dice": 4.12486, "decode.d3.loss_cls": 3.81226, "decode.d3.loss_mask": 1.78622, "decode.d3.loss_dice": 4.17997, "decode.d4.loss_cls": 3.94365, "decode.d4.loss_mask": 1.85984, "decode.d4.loss_dice": 4.28701, "decode.d5.loss_cls": 3.84702, "decode.d5.loss_mask": 1.94485, "decode.d5.loss_dice": 4.3224, "decode.d6.loss_cls": 3.87003, "decode.d6.loss_mask": 2.0017, "decode.d6.loss_dice": 4.37135, "decode.d7.loss_cls": 3.81229, "decode.d7.loss_mask": 2.07599, "decode.d7.loss_dice": 4.39882, "decode.d8.loss_cls": 3.72994, "decode.d8.loss_mask": 2.08757, "decode.d8.loss_dice": 4.39755, "loss": 103.25221, "time": 4.36444}
{"mode": "train", "epoch": 1, "iter": 150, "lr": 0.0, "memory": 49619, "data_time": 0.18641, "decode.loss_cls": 3.59433, "decode.loss_mask": 2.02743, "decode.loss_dice": 4.39346, "decode.d0.loss_cls": 5.85289, "decode.d0.loss_mask": 1.6623, "decode.d0.loss_dice": 4.00734, "decode.d1.loss_cls": 3.74703, "decode.d1.loss_mask": 1.61941, "decode.d1.loss_dice": 4.03069, "decode.d2.loss_cls": 3.63682, "decode.d2.loss_mask": 1.61777, "decode.d2.loss_dice": 4.0665, "decode.d3.loss_cls": 3.51041, "decode.d3.loss_mask": 1.59669, "decode.d3.loss_dice": 4.12162, "decode.d4.loss_cls": 3.5813, "decode.d4.loss_mask": 1.59959, "decode.d4.loss_dice": 4.19146, "decode.d5.loss_cls": 3.58616, "decode.d5.loss_mask": 1.66779, "decode.d5.loss_dice": 4.24156, "decode.d6.loss_cls": 3.61807, "decode.d6.loss_mask": 1.80559, "decode.d6.loss_dice": 4.30017, "decode.d7.loss_cls": 3.60496, "decode.d7.loss_mask": 1.9178, "decode.d7.loss_dice": 4.36065, "decode.d8.loss_cls": 3.56945, "decode.d8.loss_mask": 1.96623, "decode.d8.loss_dice": 4.37825, "loss": 97.87369, "time": 4.36769}
{"mode": "train", "epoch": 1, "iter": 200, "lr": 0.0, "memory": 49619, "data_time": 0.1961, "decode.loss_cls": 3.51842, "decode.loss_mask": 1.78264, "decode.loss_dice": 4.30783, "decode.d0.loss_cls": 5.86351, "decode.d0.loss_mask": 1.64749, "decode.d0.loss_dice": 3.98966, "decode.d1.loss_cls": 3.56313, "decode.d1.loss_mask": 1.62145, "decode.d1.loss_dice": 4.00288, "decode.d2.loss_cls": 3.4678, "decode.d2.loss_mask": 1.61147, "decode.d2.loss_dice": 4.0307, "decode.d3.loss_cls": 3.3953, "decode.d3.loss_mask": 1.58056, "decode.d3.loss_dice": 4.06868, "decode.d4.loss_cls": 3.46396, "decode.d4.loss_mask": 1.55564, "decode.d4.loss_dice": 4.11005, "decode.d5.loss_cls": 3.47837, "decode.d5.loss_mask": 1.53882, "decode.d5.loss_dice": 4.14166, "decode.d6.loss_cls": 3.49903, "decode.d6.loss_mask": 1.54475, "decode.d6.loss_dice": 4.1846, "decode.d7.loss_cls": 3.50915, "decode.d7.loss_mask": 1.58132, "decode.d7.loss_dice": 4.23413, "decode.d8.loss_cls": 3.50819, "decode.d8.loss_mask": 1.67311, "decode.d8.loss_dice": 4.26321, "loss": 94.73751, "time": 4.36105}
{"mode": "train", "epoch": 1, "iter": 250, "lr": 0.0, "memory": 49619, "data_time": 0.20021, "decode.loss_cls": 3.42857, "decode.loss_mask": 1.54597, "decode.loss_dice": 4.15327, "decode.d0.loss_cls": 5.84628, "decode.d0.loss_mask": 1.65803, "decode.d0.loss_dice": 3.95176, "decode.d1.loss_cls": 3.33193, "decode.d1.loss_mask": 1.64655, "decode.d1.loss_dice": 3.94677, "decode.d2.loss_cls": 3.20524, "decode.d2.loss_mask": 1.61957, "decode.d2.loss_dice": 3.96398, "decode.d3.loss_cls": 3.14096, "decode.d3.loss_mask": 1.61315, "decode.d3.loss_dice": 3.96361, "decode.d4.loss_cls": 3.23839, "decode.d4.loss_mask": 1.60125, "decode.d4.loss_dice": 3.9655, "decode.d5.loss_cls": 3.28644, "decode.d5.loss_mask": 1.58352, "decode.d5.loss_dice": 4.01075, "decode.d6.loss_cls": 3.32505, "decode.d6.loss_mask": 1.56798, "decode.d6.loss_dice": 4.06173, "decode.d7.loss_cls": 3.38733, "decode.d7.loss_mask": 1.52764, "decode.d7.loss_dice": 4.10279, "decode.d8.loss_cls": 3.41044, "decode.d8.loss_mask": 1.5248, "decode.d8.loss_dice": 4.12979, "loss": 91.73903, "time": 4.35106}
{"mode": "train", "epoch": 1, "iter": 300, "lr": 0.0, "memory": 49619, "data_time": 0.19636, "decode.loss_cls": 3.06032, "decode.loss_mask": 1.57261, "decode.loss_dice": 4.03596, "decode.d0.loss_cls": 5.82999, "decode.d0.loss_mask": 1.65725, "decode.d0.loss_dice": 3.92071, "decode.d1.loss_cls": 3.09874, "decode.d1.loss_mask": 1.65198, "decode.d1.loss_dice": 3.86976, "decode.d2.loss_cls": 2.8274, "decode.d2.loss_mask": 1.64628, "decode.d2.loss_dice": 3.8778, "decode.d3.loss_cls": 2.66191, "decode.d3.loss_mask": 1.66284, "decode.d3.loss_dice": 3.88709, "decode.d4.loss_cls": 2.7283, "decode.d4.loss_mask": 1.65725, "decode.d4.loss_dice": 3.8937, "decode.d5.loss_cls": 2.69812, "decode.d5.loss_mask": 1.65295, "decode.d5.loss_dice": 3.91637, "decode.d6.loss_cls": 2.71295, "decode.d6.loss_mask": 1.6479, "decode.d6.loss_dice": 3.93254, "decode.d7.loss_cls": 2.84047, "decode.d7.loss_mask": 1.61654, "decode.d7.loss_dice": 3.96706, "decode.d8.loss_cls": 2.97224, "decode.d8.loss_mask": 1.58976, "decode.d8.loss_dice": 3.96488, "loss": 87.05168, "time": 4.3693}
{"mode": "train", "epoch": 1, "iter": 350, "lr": 0.0, "memory": 49619, "data_time": 0.1893, "decode.loss_cls": 2.46089, "decode.loss_mask": 1.6875, "decode.loss_dice": 3.93899, "decode.d0.loss_cls": 5.83859, "decode.d0.loss_mask": 1.69471, "decode.d0.loss_dice": 3.87156, "decode.d1.loss_cls": 2.77507, "decode.d1.loss_mask": 1.69211, "decode.d1.loss_dice": 3.80459, "decode.d2.loss_cls": 2.2749, "decode.d2.loss_mask": 1.71314, "decode.d2.loss_dice": 3.85385, "decode.d3.loss_cls": 1.98652, "decode.d3.loss_mask": 1.73945, "decode.d3.loss_dice": 3.87934, "decode.d4.loss_cls": 1.94735, "decode.d4.loss_mask": 1.74849, "decode.d4.loss_dice": 3.89417, "decode.d5.loss_cls": 1.93345, "decode.d5.loss_mask": 1.74649, "decode.d5.loss_dice": 3.90114, "decode.d6.loss_cls": 1.93596, "decode.d6.loss_mask": 1.75009, "decode.d6.loss_dice": 3.92255, "decode.d7.loss_cls": 2.08849, "decode.d7.loss_mask": 1.74349, "decode.d7.loss_dice": 3.9268, "decode.d8.loss_cls": 2.1817, "decode.d8.loss_mask": 1.72321, "decode.d8.loss_dice": 3.92333, "loss": 81.57794, "time": 4.35731}
{"mode": "train", "epoch": 1, "iter": 400, "lr": 0.0, "memory": 49619, "data_time": 0.19871, "decode.loss_cls": 1.65818, "decode.loss_mask": 1.74892, "decode.loss_dice": 3.87689, "decode.d0.loss_cls": 5.84071, "decode.d0.loss_mask": 1.69882, "decode.d0.loss_dice": 3.84848, "decode.d1.loss_cls": 2.33555, "decode.d1.loss_mask": 1.72077, "decode.d1.loss_dice": 3.81969, "decode.d2.loss_cls": 1.68194, "decode.d2.loss_mask": 1.7731, "decode.d2.loss_dice": 3.85606, "decode.d3.loss_cls": 1.33154, "decode.d3.loss_mask": 1.80092, "decode.d3.loss_dice": 3.87345, "decode.d4.loss_cls": 1.22856, "decode.d4.loss_mask": 1.80271, "decode.d4.loss_dice": 3.8726, "decode.d5.loss_cls": 1.24241, "decode.d5.loss_mask": 1.80771, "decode.d5.loss_dice": 3.87252, "decode.d6.loss_cls": 1.30358, "decode.d6.loss_mask": 1.79703, "decode.d6.loss_dice": 3.88261, "decode.d7.loss_cls": 1.43455, "decode.d7.loss_mask": 1.77436, "decode.d7.loss_dice": 3.86279, "decode.d8.loss_cls": 1.46029, "decode.d8.loss_mask": 1.76547, "decode.d8.loss_dice": 3.87178, "loss": 75.84399, "time": 4.35396}
{"mode": "train", "epoch": 1, "iter": 450, "lr": 0.0, "memory": 49619, "data_time": 0.19515, "decode.loss_cls": 1.13367, "decode.loss_mask": 1.7836, "decode.loss_dice": 3.84949, "decode.d0.loss_cls": 5.83513, "decode.d0.loss_mask": 1.7058, "decode.d0.loss_dice": 3.81891, "decode.d1.loss_cls": 1.90011, "decode.d1.loss_mask": 1.75705, "decode.d1.loss_dice": 3.81178, "decode.d2.loss_cls": 1.22854, "decode.d2.loss_mask": 1.81106, "decode.d2.loss_dice": 3.83813, "decode.d3.loss_cls": 0.84495, "decode.d3.loss_mask": 1.83282, "decode.d3.loss_dice": 3.843, "decode.d4.loss_cls": 0.79268, "decode.d4.loss_mask": 1.83178, "decode.d4.loss_dice": 3.84802, "decode.d5.loss_cls": 0.79619, "decode.d5.loss_mask": 1.82053, "decode.d5.loss_dice": 3.85381, "decode.d6.loss_cls": 0.83358, "decode.d6.loss_mask": 1.8142, "decode.d6.loss_dice": 3.86756, "decode.d7.loss_cls": 0.95894, "decode.d7.loss_mask": 1.7954, "decode.d7.loss_dice": 3.86283, "decode.d8.loss_cls": 1.00795, "decode.d8.loss_mask": 1.78504, "decode.d8.loss_dice": 3.85568, "loss": 71.71824, "time": 4.35241}
{"mode": "train", "epoch": 1, "iter": 500, "lr": 0.0, "memory": 49619, "data_time": 0.19427, "decode.loss_cls": 0.91727, "decode.loss_mask": 1.80694, "decode.loss_dice": 3.82205, "decode.d0.loss_cls": 5.79506, "decode.d0.loss_mask": 1.70454, "decode.d0.loss_dice": 3.80227, "decode.d1.loss_cls": 1.5722, "decode.d1.loss_mask": 1.78672, "decode.d1.loss_dice": 3.78333, "decode.d2.loss_cls": 0.95792, "decode.d2.loss_mask": 1.82964, "decode.d2.loss_dice": 3.7957, "decode.d3.loss_cls": 0.72391, "decode.d3.loss_mask": 1.84386, "decode.d3.loss_dice": 3.79724, "decode.d4.loss_cls": 0.69161, "decode.d4.loss_mask": 1.84781, "decode.d4.loss_dice": 3.80864, "decode.d5.loss_cls": 0.6812, "decode.d5.loss_mask": 1.834, "decode.d5.loss_dice": 3.81587, "decode.d6.loss_cls": 0.68505, "decode.d6.loss_mask": 1.83409, "decode.d6.loss_dice": 3.82263, "decode.d7.loss_cls": 0.76717, "decode.d7.loss_mask": 1.82135, "decode.d7.loss_dice": 3.81748, "decode.d8.loss_cls": 0.81296, "decode.d8.loss_mask": 1.82246, "decode.d8.loss_dice": 3.82328, "loss": 69.82427, "time": 4.3532}
{"mode": "train", "epoch": 1, "iter": 550, "lr": 0.0, "memory": 49619, "data_time": 0.19614, "decode.loss_cls": 0.76731, "decode.loss_mask": 1.83225, "decode.loss_dice": 3.78087, "decode.d0.loss_cls": 5.77251, "decode.d0.loss_mask": 1.71428, "decode.d0.loss_dice": 3.77406, "decode.d1.loss_cls": 1.29271, "decode.d1.loss_mask": 1.80925, "decode.d1.loss_dice": 3.74104, "decode.d2.loss_cls": 0.79953, "decode.d2.loss_mask": 1.8454, "decode.d2.loss_dice": 3.74579, "decode.d3.loss_cls": 0.63494, "decode.d3.loss_mask": 1.85712, "decode.d3.loss_dice": 3.75057, "decode.d4.loss_cls": 0.63249, "decode.d4.loss_mask": 1.85275, "decode.d4.loss_dice": 3.75243, "decode.d5.loss_cls": 0.62802, "decode.d5.loss_mask": 1.84887, "decode.d5.loss_dice": 3.75891, "decode.d6.loss_cls": 0.61935, "decode.d6.loss_mask": 1.84458, "decode.d6.loss_dice": 3.76096, "decode.d7.loss_cls": 0.65088, "decode.d7.loss_mask": 1.84147, "decode.d7.loss_dice": 3.77099, "decode.d8.loss_cls": 0.71004, "decode.d8.loss_mask": 1.83635, "decode.d8.loss_dice": 3.77018, "loss": 68.39587, "time": 4.33886}

...

{"mode": "train", "epoch": 34, "iter": 1050, "lr": 0.0, "memory": 49891, "data_time": 0.19045, "decode.loss_cls": 0.09864, "decode.loss_mask": 1.43425, "decode.loss_dice": 2.26739, "decode.d0.loss_cls": 0.32443, "decode.d0.loss_mask": 1.44056, "decode.d0.loss_dice": 2.30322, "decode.d1.loss_cls": 0.11411, "decode.d1.loss_mask": 1.43477, "decode.d1.loss_dice": 2.27325, "decode.d2.loss_cls": 0.10755, "decode.d2.loss_mask": 1.43497, "decode.d2.loss_dice": 2.27118, "decode.d3.loss_cls": 0.10076, "decode.d3.loss_mask": 1.43365, "decode.d3.loss_dice": 2.26565, "decode.d4.loss_cls": 0.10561, "decode.d4.loss_mask": 1.43499, "decode.d4.loss_dice": 2.26601, "decode.d5.loss_cls": 0.10718, "decode.d5.loss_mask": 1.43432, "decode.d5.loss_dice": 2.26739, "decode.d6.loss_cls": 0.10188, "decode.d6.loss_mask": 1.43455, "decode.d6.loss_dice": 2.2661, "decode.d7.loss_cls": 0.09854, "decode.d7.loss_mask": 1.4344, "decode.d7.loss_dice": 2.26784, "decode.d8.loss_cls": 0.1018, "decode.d8.loss_mask": 1.43442, "decode.d8.loss_dice": 2.26845, "loss": 38.32789, "time": 12.6123}
{"mode": "train", "epoch": 34, "iter": 1100, "lr": 0.0, "memory": 49891, "data_time": 0.18627, "decode.loss_cls": 0.10317, "decode.loss_mask": 1.45173, "decode.loss_dice": 2.27186, "decode.d0.loss_cls": 0.32378, "decode.d0.loss_mask": 1.46159, "decode.d0.loss_dice": 2.30297, "decode.d1.loss_cls": 0.11509, "decode.d1.loss_mask": 1.45471, "decode.d1.loss_dice": 2.27793, "decode.d2.loss_cls": 0.11197, "decode.d2.loss_mask": 1.45283, "decode.d2.loss_dice": 2.27298, "decode.d3.loss_cls": 0.10239, "decode.d3.loss_mask": 1.45338, "decode.d3.loss_dice": 2.2716, "decode.d4.loss_cls": 0.10432, "decode.d4.loss_mask": 1.45317, "decode.d4.loss_dice": 2.27359, "decode.d5.loss_cls": 0.10288, "decode.d5.loss_mask": 1.45181, "decode.d5.loss_dice": 2.27396, "decode.d6.loss_cls": 0.10173, "decode.d6.loss_mask": 1.45216, "decode.d6.loss_dice": 2.27185, "decode.d7.loss_cls": 0.10116, "decode.d7.loss_mask": 1.45116, "decode.d7.loss_dice": 2.27168, "decode.d8.loss_cls": 0.09726, "decode.d8.loss_mask": 1.45186, "decode.d8.loss_dice": 2.27409, "loss": 38.56064, "time": 12.62414}
{"mode": "train", "epoch": 34, "iter": 1150, "lr": 0.0, "memory": 49891, "data_time": 0.19274, "decode.loss_cls": 0.09726, "decode.loss_mask": 1.44713, "decode.loss_dice": 2.25939, "decode.d0.loss_cls": 0.31481, "decode.d0.loss_mask": 1.45476, "decode.d0.loss_dice": 2.29257, "decode.d1.loss_cls": 0.11057, "decode.d1.loss_mask": 1.44864, "decode.d1.loss_dice": 2.26464, "decode.d2.loss_cls": 0.11208, "decode.d2.loss_mask": 1.44779, "decode.d2.loss_dice": 2.25961, "decode.d3.loss_cls": 0.10269, "decode.d3.loss_mask": 1.44675, "decode.d3.loss_dice": 2.26108, "decode.d4.loss_cls": 0.10136, "decode.d4.loss_mask": 1.4467, "decode.d4.loss_dice": 2.26004, "decode.d5.loss_cls": 0.09715, "decode.d5.loss_mask": 1.44715, "decode.d5.loss_dice": 2.25922, "decode.d6.loss_cls": 0.10055, "decode.d6.loss_mask": 1.44683, "decode.d6.loss_dice": 2.25765, "decode.d7.loss_cls": 0.09495, "decode.d7.loss_mask": 1.44697, "decode.d7.loss_dice": 2.25745, "decode.d8.loss_cls": 0.09382, "decode.d8.loss_mask": 1.4476, "decode.d8.loss_dice": 2.25681, "loss": 38.33401, "time": 12.62331}

config.py

num_things_classes = 0
num_stuff_classes = 20
num_classes = 20
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='EncoderDecoderMask2FormerAug',
    pretrained='pretrained/beit_large_patch16_224_pt22k_ft22k.pth',
    backbone=dict(
        type='BEiTAdapter',
        patch_size=16,
        embed_dim=1024,
        depth=24,
        num_heads=16,
        mlp_ratio=4,
        qkv_bias=True,
        use_abs_pos_emb=False,
        use_rel_pos_bias=True,
        img_size=896,
        init_values=1e-06,
        drop_path_rate=0.3,
        conv_inplane=64,
        n_points=4,
        deform_num_heads=16,
        cffn_ratio=0.25,
        deform_ratio=0.5,
        with_cp=True,
        interaction_indexes=[[0, 5], [6, 11], [12, 17], [18, 23]]),
    decode_head=dict(
        type='Mask2FormerHead',
        in_channels=[1024, 1024, 1024, 1024],
        feat_channels=1024,
        out_channels=1024,
        in_index=[0, 1, 2, 3],
        num_things_classes=0,
        num_stuff_classes=20,
        num_queries=100,
        num_transformer_feat_level=3,
        pixel_decoder=dict(
            type='MSDeformAttnPixelDecoder',
            num_outs=3,
            norm_cfg=dict(type='GN', num_groups=32),
            act_cfg=dict(type='ReLU'),
            encoder=dict(
                type='DetrTransformerEncoder',
                num_layers=6,
                transformerlayers=dict(
                    type='BaseTransformerLayer',
                    attn_cfgs=dict(
                        type='MultiScaleDeformableAttention',
                        embed_dims=1024,
                        num_heads=32,
                        num_levels=3,
                        num_points=4,
                        im2col_step=64,
                        dropout=0.0,
                        batch_first=False,
                        norm_cfg=None,
                        init_cfg=None),
                    ffn_cfgs=dict(
                        type='FFN',
                        embed_dims=1024,
                        feedforward_channels=4096,
                        num_fcs=2,
                        ffn_drop=0.0,
                        act_cfg=dict(type='ReLU', inplace=True),
                        with_cp=True),
                    operation_order=('self_attn', 'norm', 'ffn', 'norm')),
                init_cfg=None),
            positional_encoding=dict(
                type='SinePositionalEncoding', num_feats=512, normalize=True),
            init_cfg=None),
        enforce_decoder_input_project=False,
        positional_encoding=dict(
            type='SinePositionalEncoding', num_feats=512, normalize=True),
        transformer_decoder=dict(
            type='DetrTransformerDecoder',
            return_intermediate=True,
            num_layers=9,
            transformerlayers=dict(
                type='DetrTransformerDecoderLayer',
                attn_cfgs=dict(
                    type='MultiheadAttention',
                    embed_dims=1024,
                    num_heads=32,
                    attn_drop=0.0,
                    proj_drop=0.0,
                    dropout_layer=None,
                    batch_first=False),
                ffn_cfgs=dict(
                    embed_dims=1024,
                    feedforward_channels=4096,
                    num_fcs=2,
                    act_cfg=dict(type='ReLU', inplace=True),
                    ffn_drop=0.0,
                    dropout_layer=None,
                    add_identity=True,
                    with_cp=True),
                feedforward_channels=4096,
                operation_order=('cross_attn', 'norm', 'self_attn', 'norm',
                                 'ffn', 'norm')),
            init_cfg=None),
        loss_cls=dict(
            type='CrossEntropyLoss',
            use_sigmoid=False,
            loss_weight=2.0,
            reduction='mean',
            class_weight=[
                1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
                1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.1
            ]),
        loss_mask=dict(
            type='CrossEntropyLoss',
            use_sigmoid=True,
            reduction='mean',
            loss_weight=5.0),
        loss_dice=dict(
            type='DiceLoss',
            use_sigmoid=True,
            activate=True,
            reduction='mean',
            naive_dice=True,
            eps=1.0,
            loss_weight=5.0)),
    train_cfg=dict(
        num_points=12544,
        oversample_ratio=3.0,
        importance_sample_ratio=0.75,
        assigner=dict(
            type='MaskHungarianAssigner',
            cls_cost=dict(type='ClassificationCost', weight=2.0),
            mask_cost=dict(
                type='CrossEntropyLossCost', weight=5.0, use_sigmoid=True),
            dice_cost=dict(
                type='DiceCost', weight=5.0, pred_act=True, eps=1.0)),
        sampler=dict(type='MaskPseudoSampler')),
    test_cfg=dict(
        panoptic_on=False,
        semantic_on=True,
        instance_on=False,
        max_per_image=100,
        iou_thr=0.8,
        filter_low_score=True,
        mode='slide',
        crop_size=(896, 896),
        stride=(512, 512)),
    init_cfg=None)
dataset_type = 'KITTIDataset'
data_root = 'data/kitti/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (896, 896)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(3678, 1110), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(896, 896), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(896, 896), pad_val=0, seg_pad_val=255),
    dict(type='ToMask'),
    dict(type='DefaultFormatBundle'),
    dict(
        type='Collect',
        keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(3678, 1110),
        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],
        flip=True,
        transforms=[
            dict(
                type='SETR_Resize',
                keep_ratio=True,
                crop_size=(896, 896),
                setr_multi_scale=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type='KITTIDataset',
        data_root='data/kitti/',
        img_dir='img_dir/train',
        ann_dir='ann_dir/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(
                type='Resize', img_scale=(3678, 1110), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(896, 896), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(896, 896), pad_val=0, seg_pad_val=255),
            dict(type='ToMask'),
            dict(type='DefaultFormatBundle'),
            dict(
                type='Collect',
                keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels'])
        ]),
    val=dict(
        type='KITTIDataset',
        data_root='data/kitti/',
        img_dir='img_dir/val',
        ann_dir='ann_dir/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(3678, 1110),
                img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],
                flip=True,
                transforms=[
                    dict(
                        type='SETR_Resize',
                        keep_ratio=True,
                        crop_size=(896, 896),
                        setr_multi_scale=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='KITTIDataset',
        data_root='data/kitti/',
        img_dir='img_dir/val',
        ann_dir='ann_dir/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(3678, 1110),
                img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0],
                flip=True,
                transforms=[
                    dict(
                        type='SETR_Resize',
                        keep_ratio=True,
                        crop_size=(896, 896),
                        setr_multi_scale=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=True),
        dict(type='TensorboardLoggerHook')
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(
    type='AdamW',
    lr=2e-05,
    betas=(0.9, 0.999),
    weight_decay=0.05,
    constructor='LayerDecayOptimizerConstructor',
    paramwise_cfg=dict(num_layers=24, layer_decay_rate=0.9))
optimizer_config = dict()
lr_config = dict(
    policy='poly',
    warmup='linear',
    warmup_iters=1500,
    warmup_ratio=1e-06,
    power=1.0,
    min_lr=0.0,
    by_epoch=True)
runner = dict(type='EpochBasedRunner', max_epochs=500)
checkpoint_config = dict(by_epoch=True, interval=1000, max_keep_ckpts=1)
evaluation = dict(
    interval=1000, metric='mIoU', pre_eval=True, save_best='mIoU')
pretrained = 'pretrained/beit_large_patch16_224_pt22k_ft22k.pth'
work_dir = './work_dirs/mask2former_beit_adapter_large_896_80k_kitti_ms'
gpu_ids = range(0, 8)
auto_resume = False

shell python -m torch.distributed.launch --nproc_per_node 8 segmentation/train.py segmentation/configs/kitti/mask2former_beit_adapter_large_896_80k_kitti_ms.py --launch pytorch --load-from pretrained/beit_large_patch16_224_pt22k_ft22k.pth

huxycn commented 1 year ago

I had the same problem both on custom dataset and ADE20K, have you figured out why or solved it ?

czczup commented 1 year ago

Hi, please train models using config with the ss suffix. For example, you should use mask2former_beit_adapter_large_896_80k_ss instead of mask2former_beit_adapter_large_896_80k_ms. ms means performing multi-scale testing, so it will be very slow.