Training Issues about DAIR-V2X & V2X-ViT

Thank you for releasing the codes of the CoCa3D and related dataset APIs in opencood. Issues When I trained the dair_v2xvit.yaml yaml file on DAIR-V2X-C_Example dataset with complemented annotations (as in coca3d), I got 0.0 mAP performance as follows. What problems may I have? I have no idea about this issue.

The Average Precision at IOU 0.3 is 0.00, The Average Precision at IOU 0.5 is 0.00, The Average Precision at IOU 0.7 is 0.00

Some packages in environment as follows

cudatoolkit               11.6.0              hecad31d_10
cudnn                     8.8.0.121            h0800d71_0
python                    3.7.11               h12debd9_0
pytorch                   1.12.0          py3.7_cuda11.6_cudnn8.3.2_0
spconv                    1.2.1                    pypi_0
torchvision               0.13.0               py37_cu116

And the 70 epochs' training loss (46 examples in the example dataset with batch_size = 2) is as follows. Is the training loss correct or similar with yours? I tried 20 epochs and got similar results with 0 AP

At epoch 68, the validation loss is 1.237085
learning rate 0.000002
[epoch 69][1/23], || Loss: 1.8701 || Conf Loss: 0.5903 || Loc Loss: 1.2799
[epoch 69][2/23], || Loss: 1.4890 || Conf Loss: 0.5439 || Loc Loss: 0.9451
[epoch 69][3/23], || Loss: 1.4851 || Conf Loss: 0.5335 || Loc Loss: 0.9516
[epoch 69][4/23], || Loss: 1.3804 || Conf Loss: 0.5347 || Loc Loss: 0.8456
[epoch 69][5/23], || Loss: 2.0780 || Conf Loss: 0.6309 || Loc Loss: 1.4471
[epoch 69][6/23], || Loss: 1.4323 || Conf Loss: 0.5277 || Loc Loss: 0.9046
[epoch 69][7/23], || Loss: 1.8251 || Conf Loss: 0.6316 || Loc Loss: 1.1935
[epoch 69][8/23], || Loss: 1.4602 || Conf Loss: 0.5261 || Loc Loss: 0.9341
[epoch 69][9/23], || Loss: 1.6557 || Conf Loss: 0.5603 || Loc Loss: 1.0954
[epoch 69][10/23], || Loss: 1.6165 || Conf Loss: 0.6315 || Loc Loss: 0.9850
[epoch 69][11/23], || Loss: 1.6900 || Conf Loss: 0.5894 || Loc Loss: 1.1006
[epoch 69][12/23], || Loss: 1.5556 || Conf Loss: 0.5379 || Loc Loss: 1.0177
[epoch 69][13/23], || Loss: 1.4303 || Conf Loss: 0.5791 || Loc Loss: 0.8512
[epoch 69][14/23], || Loss: 1.3266 || Conf Loss: 0.5066 || Loc Loss: 0.8200
[epoch 69][15/23], || Loss: 1.3096 || Conf Loss: 0.5032 || Loc Loss: 0.8063
[epoch 69][16/23], || Loss: 1.3690 || Conf Loss: 0.5369 || Loc Loss: 0.8322
[epoch 69][17/23], || Loss: 1.3448 || Conf Loss: 0.5229 || Loc Loss: 0.8218
[epoch 69][18/23], || Loss: 1.4010 || Conf Loss: 0.5511 || Loc Loss: 0.8498
[epoch 69][19/23], || Loss: 1.3839 || Conf Loss: 0.5693 || Loc Loss: 0.8146
[epoch 69][20/23], || Loss: 1.3652 || Conf Loss: 0.4989 || Loc Loss: 0.8663
[epoch 69][21/23], || Loss: 1.4203 || Conf Loss: 0.5290 || Loc Loss: 0.8913
[epoch 69][22/23], || Loss: 1.3576 || Conf Loss: 0.5159 || Loc Loss: 0.8417
[epoch 69][23/23], || Loss: 1.3644 || Conf Loss: 0.5601 || Loc Loss: 0.8043

The dairv2x_v2xvit.yaml file is as follows. I only changed the xxx_dir params in the yaml file and in inference.py by my home dir.

name: dair_npj_v2xvit_w
#root_dir: "/GPFS/rhome/quanhaoli/DAIR-V2X/data/split_datas/train.json"
#validate_dir: "/GPFS/rhome/quanhaoli/DAIR-V2X/data/split_datas/val.json"
#test_dir: "/GPFS/rhome/quanhaoli/DAIR-V2X/data/split_datas/val.json"
root_dir: "~/CoCa3D/data/dair-v2x/split_datas/train.json"
validate_dir: "~/CoCa3D/data/dair-v2x/split_datas/val.json"
test_dir: "~/CoCa3D/data/dair-v2x/split_datas/val.json"
data_dir: "~/CoCa3D/data/dair-v2x/"

noise_setting:
  add_noise: True
  args: 
    pos_std: 0.2
    rot_std: 0.2
    pos_mean: 0
    rot_mean: 0

comm_range: 100

yaml_parser: "load_point_pillar_params"
train_params:
  batch_size: &batch_size 2
  epoches: 70
  eval_freq: 2
  save_freq: 2
  max_cav: &max_cav 2

fusion:
  core_method: 'IntermediateFusionDatasetDAIR' # LateFusionDataset, EarlyFusionDataset, IntermediateFusionDataset supported
  args:
    proj_first: false
    clip_pc: false

# preprocess-related
preprocess:
  # options: BasePreprocessor, VoxelPreprocessor, BevPreprocessor
  core_method: 'SpVoxelPreprocessor'
  args:
    voxel_size: &voxel_size [0.4, 0.4, 4]
    max_points_per_voxel: 32
    max_voxel_train: 32000
    max_voxel_test: 70000
  # lidar range for each individual cav. Format: xyzxyz minmax
  cav_lidar_range: &cav_lidar [-102.4, -38.4, -3, 102.4, 38.4, 1]

data_augment:
  - NAME: random_world_flip
    ALONG_AXIS_LIST: [ 'x' ]

  - NAME: random_world_rotation
    WORLD_ROT_ANGLE: [ -0.78539816, 0.78539816 ]

  - NAME: random_world_scaling
    WORLD_SCALE_RANGE: [ 0.95, 1.05 ]

# anchor box related
postprocess:
  core_method: 'VoxelPostprocessor' # VoxelPostprocessor, BevPostprocessor supported
  gt_range: *cav_lidar
  anchor_args:
    cav_lidar_range: *cav_lidar
    l: 4.5
    w: 2
    h: 1.56
    r: [0, 90]
    feature_stride: 2
    num: &achor_num 2
  target_args:
    pos_threshold: 0.6
    neg_threshold: 0.45
    score_threshold: 0.4
  order: 'hwl' # hwl or lwh
  max_num: 100 # maximum number of objects in a single frame. use this number to make sure different frames has the same dimension in the same batch
  nms_thresh: 0.15

# model related
model:
  core_method: point_pillar_v2xvit
  args:
    voxel_size: *voxel_size
    lidar_range: *cav_lidar
    anchor_number: *achor_num
    max_cav: *max_cav
    compression: 0 # compression rate
    backbone_fix: false

    pillar_vfe:
      use_norm: true
      with_distance: false
      use_absolute_xyz: true
      num_filters: [64]
    point_pillar_scatter:
      num_features: 64

    base_bev_backbone:
      layer_nums: [3, 5, 8]
      layer_strides: [2, 2, 2]
      num_filters: [64, 128, 256]
      upsample_strides: [1, 2, 4]
      num_upsample_filter: [128, 128, 128]
    shrink_header:
      kernal_size: [3]
      stride: [1]
      padding: [1]
      dim: [256]
      input_dim: 384 # 128 * 3

    transformer:
      encoder: &encoder
        # number of fusion blocks per encoder layer
        num_blocks: 1
        # number of encoder layers
        depth: 3
        use_roi_mask: true
        use_RTE: &use_RTE False # !!!!!! we set it false
        RTE_ratio: &RTE_ratio 2 # 2 means the dt has 100ms interval while 1 means 50 ms interval
        # agent-wise attention
        cav_att_config: &cav_att_config
          dim: 256
          use_hetero: true
          use_RTE: *use_RTE
          RTE_ratio: *RTE_ratio
          heads: 8
          dim_head: 32
          dropout: 0.3
        # spatial-wise attention
        pwindow_att_config: &pwindow_att_config
          dim: 256
          heads: [16, 8, 4]
          dim_head: [16, 32, 64]
          dropout: 0.3
          window_size: [4, 8, 16]
          relative_pos_embedding: true
          fusion_method: 'split_attn'
        # feedforward condition
        feed_forward: &feed_forward
          mlp_dim: 256
          dropout: 0.3
        sttf: &sttf
          voxel_size: *voxel_size
          downsample_rate: 2

      # add decoder later

loss:
  core_method: point_pillar_loss
  args:
    cls_weight: 1.0
    reg: 2.0

optimizer:
  core_method: Adam
  lr: 0.002
  args:
    eps: 1e-10
    weight_decay: 1e-4

lr_scheduler:
  core_method: multistep #step, multistep and Exponential support
  gamma: 0.1
  step_size: [10, 25, 40]

MediaBrain-SJTU / CoCa3D

Training Issues about DAIR-V2X & V2X-ViT #3