ViTAE-Transformer / Remote-Sensing-RVSA

The official repo for [TGRS'22] "Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model"
MIT License
420 stars 32 forks source link

I reproduced the code with an accuracy of only 54 #19

Open facias914 opened 1 year ago

facias914 commented 1 year ago

The original code version is too old, so I reproduced the code to the new mmrotate version. I loaded the weights you provided and it went fine, the result was an accuracy of 68 on the validation set and 54 on the test set.

I don't know where the problem is, the weight file is loaded smoothly, I have also checked the configuration file parameters, but I just don't know what went wrong.If there is a problem with my reproduction, the final result should be 0, and it should not be as high as 54

dataset_type = 'DOTADataset'
data_root = '/data/facias/DOTA/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

angle_version = 'le90'
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='RResize', img_scale=(1024, 1024)),
    dict(
        type='RRandomFlip',
        flip_ratio=[0.25, 0.25, 0.25],
        direction=['horizontal', 'vertical', 'diagonal'],
        version=angle_version),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1024, 1024),
        flip=False,
        transforms=[
            dict(type='RResize', img_scale=(1024, 1024)),
            dict(type='RRandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img'])
        ])
]

data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'train_split/labelTXt/',
        img_prefix=data_root + 'train_split/images/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'val_split/labelTxt/',
        img_prefix=data_root + 'val_split/images/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        test_mode=True,   #若test数据集没有标注,则设为True
        ann_file=data_root + 'test_split/images/',
        img_prefix=data_root + 'test_split/images/',
        pipeline=test_pipeline))

model = dict(
    type='OrientedRCNN',
    backbone=dict(
        type='ViT_Win_RVSA_V3_WSZ7',
        img_size=1024,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4,
        qkv_bias=True,
        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.15,
        use_abs_pos_emb=True),
    neck=dict(
        type='FPN',
        in_channels=[768, 768, 768, 768],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='OrientedRPNHead',
        in_channels=256,
        feat_channels=256,
        version=angle_version,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='MidpointOffsetCoder',
            angle_range=angle_version,
            target_means=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0, 0.5, 0.5]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(
            type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0)),
    roi_head=dict(
        type='OrientedStandardRoIHead',
        bbox_roi_extractor=dict(
            type='RotatedSingleRoIExtractor',
            roi_layer=dict(
                type='RoIAlignRotated',
                out_size=7,
                sample_num=2,
                clockwise=True),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='RotatedShared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=15,
            bbox_coder=dict(
                type='DeltaXYWHAOBBoxCoder',
                angle_range=angle_version,
                norm_factor=None,
                edge_swap=True,
                proj_xy=True,
                target_means=(.0, .0, .0, .0, .0),
                target_stds=(0.1, 0.1, 0.2, 0.2, 0.1)),
            reg_class_agnostic=True,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=0,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=2000,
            nms=dict(type='nms', iou_threshold=0.8),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=False,
                iou_calculator=dict(type='RBboxOverlaps2D'),
                ignore_iof_thr=-1),
            sampler=dict(
                type='RRandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=2000,
            max_per_img=2000,
            nms=dict(type='nms', iou_threshold=0.8),
            min_bbox_size=0),
        rcnn=dict(
            nms_pre=2000,
            min_bbox_size=0,
            score_thr=0.05,
            nms=dict(iou_thr=0.1),
            max_per_img=2000)))
# evaluation
evaluation = dict(interval=1, metric='mAP')
# optimizer
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)

# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable

dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]

# disable opencv multithreading to avoid system being overloaded
opencv_num_threads = 0
# set multi-process start method as `fork` to speed up the training
mp_start_method = 'fork'
DotWang commented 1 year ago

@facias914 It is suggested to use adamw for training vision transformer networks, and reproduce the method according to our parameter settings. The detailed settings can be seen in config files.

facias914 commented 1 year ago

Thank you for your apply. I just used the checkpoint file provided by you. I compared the parameter list and it is consistent with yours. The specific content is above.

DotWang commented 1 year ago

@facias914 I mean hyper parameters, such as optimizer, scheduler, and so on, please refer to this config: https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA/blob/main/Object%20Detection/configs/obb/oriented_rcnn/vit_base_win/faster_rcnn_orpn_our_rsp_vit-base-win-rvsa_v3_wsz7_fpn_1x_dota10_lr1e-4_ldr75_dpr15.py

facias914 commented 1 year ago

I know what you mean. But I just use the checkpoint file to inference, not to train.

facias914 commented 1 year ago

Now I want to use the old mmrotate version to reproduce the code. But the old mmrotate need DOTA dataset's annotations to be pkl. Can you send me a DOTA datasets with pkl annotation? Thank you !

DotWang commented 1 year ago

@facias914 which checkpoint you use? Did you use this: https://1drv.ms/u/s!AimBgYV7JjTlgVJM4Znng50US8KD?e=o4MRMQ ?

DotWang commented 1 year ago

@facias914 The DOTA dataset needs to be clipped with BBoxTookit, then the pkl can be obtained.

facias914 commented 1 year ago

@facias914 which checkpoint you use? Did you use this: https://1drv.ms/u/s!AimBgYV7JjTlgVJM4Znng50US8KD?e=o4MRMQ ?

Yes, I used this one

截屏2023-07-13 21 32 47

facias914 commented 1 year ago

@facias914 The DOTA dataset needs to be clipped with BBoxTookit, then the pkl can be obtained.

Thank you!

DotWang commented 1 year ago

@facias914 which checkpoint you use? Did you use this: https://1drv.ms/u/s!AimBgYV7JjTlgVJM4Znng50US8KD?e=o4MRMQ ?

Yes, I used this one

截屏2023-07-13 21 32 47

So the inference is conducted on unclipped images?

facias914 commented 1 year ago

@facias914 which checkpoint you use? Did you use this: https://1drv.ms/u/s!AimBgYV7JjTlgVJM4Znng50US8KD?e=o4MRMQ ?

Yes, I used this one 截屏2023-07-13 21 32 47

So the inference is conducted on unclipped images?

No, I use clipped val images and clipped test images with size=1024 and gap = 200, obtaining mAP=68 and 54 respectively.

DotWang commented 1 year ago

@facias914 OK, in fact, we didn't conduct the local validation. We directly train the model on the merged train+val set, and submit the results of testing set to the evaluation website. You can implement the same evaluation. The testing set also needs to be clipped with BBoxToolkit.

facias914 commented 1 year ago

@facias914 OK, in fact, we didn't conduct the local validation. We directly train the model on the merged train+val set, and submit the results of testing set to the evaluation website. You can implement the same evaluation. The testing set also needs to be clipped with BBoxToolkit.

Thank You!I get the mAP87 in val dataset using the old mmrotate version. By the way , I get the test dataset pkl file , but I don't find the code to convert the pkl to txt. I used the pkl2txt code writen by myself , but failed to get the result in the DOTA website . Can you tell me where the pkl2txt file is?

DotWang commented 1 year ago

@facias914 Shouldn't obbdetection automatically convert the pkl? (mmrotate is built based on obbdetection) https://mmrotate.readthedocs.io/en/latest/get_started.html#test-a-model

facias914 commented 1 year ago

@facias914 Shouldn't obbdetection automatically convert the pkl? (mmrotate is built based on obbdetection) https://mmrotate.readthedocs.io/en/latest/get_started.html#test-a-model

Yes, I used obbdetection built from https://github.com/jbwang1997/OBBDetection, and I have got the pkl file. But the DOTA website requests txt file. So the pkl2txt code is what I need now.

DotWang commented 1 year ago

@facias914 For DOTA-V1.0, using --format-only and OBBDetection will auto produce the required format, please refer to our readme.

facias914 commented 1 year ago

@facias914 For DOTA-V1.0, using --format-only and OBBDetection will auto produce the required format, please refer to our readme.

Thank you very much for your reply. I also ran to 78.74 in the test set. Next, I will investigate why there is a big gap between the reproducible code and the official code.

HuangShiqi128 commented 1 year ago

Hi,

May I ask which config file in Bboxtoolkit you use to split DOTA dataset? is it ss_trainval.json?

{ "nproc": 20, "load_type": "dota", "img_dirs": [ "data/DOTA1_0/train/images/", "data/DOTA1_0/val/images/" ], "ann_dirs": [ "data/DOTA1_0/train/labelTxt/", "data/DOTA1_0/val/labelTxt/" ], "classes": null, "prior_annfile": null, "merge_type": "addition", "sizes": [ 1024 ], "gaps": [ 200 ], "rates": [ 1.0 ], "img_rate_thr": 0.6, "iof_thr": 0.7, "no_padding": false, "padding_value": [ 104, 116, 124 ], "filter_empty": true, "save_dir": "data/split_ss_dota1_0/trainval/", "save_ext": ".png" }

PP-explore commented 4 months ago

@facias914 For DOTA-V1.0, using --format-only and OBBDetection will auto produce the required format, please refer to our readme.

Thank you very much for your reply. I also ran to 78.74 in the test set. Next, I will investigate why there is a big gap between the reproducible code and the official code.

@facias914
Hello, I also encountered the same issue. I used the official ViTAE-B + RVSA model to infer the test dataset, but the mAP I obtained is only 0.394. I would like to ask for your advice on how you achieved a mAP similar to the authors'. Could you provide some help? I have shared some of my configuration information in this issue :https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA/issues/39#issue-2417829111.