【Bug】BertLayer should be used as a decoder model if cross attention is added

Thanks for your great work! But this error stucks me for a week......Please tell me how to fix it!

Command：

export CUDA_VISIBLE_DEVICES=0
python automatic_label_ram_demo.py \
  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --ram_checkpoint ram_swin_large_14m.pth \
  --grounded_checkpoint groundingdino_swint_ogc.pth \
  --sam_checkpoint sam_vit_h_4b8939.pth \
  --input_image assets/demo9.jpg \
  --output_dir "outputs" \
  --box_threshold 0.25 \
  --text_threshold 0.2 \
  --iou_threshold 0.5 \
  --device "cuda"

Log:

final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
_IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])
Traceback (most recent call last):
  File "/DATA/DATA1/renxiaoyu/202405_asd/Grounded-Segment-Anything/automatic_label_ram_demo.py", line 248, in <module>
    ram_model = ram(pretrained=ram_checkpoint,
  File "/DATA/DATA1/renxiaoyu/202405_asd/Grounded-Segment-Anything/recognize-anything/ram/models/ram.py", line 399, in ram
    model = RAM(**kwargs)
  File "/DATA/DATA1/renxiaoyu/202405_asd/Grounded-Segment-Anything/recognize-anything/ram/models/ram.py", line 143, in __init__
    self.tag_encoder = BertModel(config=encoder_config,
  File "/home/huiyu/.conda/envs/intern_clean/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 891, in __init__
    self.encoder = BertEncoder(config)
  File "/home/huiyu/.conda/envs/intern_clean/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 559, in __init__
    self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/huiyu/.conda/envs/intern_clean/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 559, in <listcomp>
    self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/huiyu/.conda/envs/intern_clean/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 479, in __init__
    raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
ValueError: BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=768, out_features=768, bias=True)
      (key): Linear(in_features=768, out_features=768, bias=True)
      (value): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
) should be used as a decoder model if cross attention is added

IDEA-Research / Grounded-Segment-Anything

【Bug】BertLayer should be used as a decoder model if cross attention is added #503