fundamentalvision / Deformable-DETR

Deformable DETR: Deformable Transformers for End-to-End Object Detection.
Apache License 2.0
3.14k stars 513 forks source link

Deformable DETR Issue when training with custom Data #235

Open spatiallysaying opened 3 months ago

spatiallysaying commented 3 months ago

I started with this DETR notebook as base.

Training seems successful as I get : INFO:pytorch_lightning.utilities.rank_zero:Trainer.fit stopped: max_steps=50 reached.

I successfully pushed the model to Huggingface repo:

model.model.push_to_hub("xyz/ddetr-finetuned-balloon-v2")
processor.push_to_hub("xyz/ddetr-finetuned-balloon-v2")

However, when I try to load the model using,I am seeing the issue:

from transformers import AutoImageProcessor, DetrForObjectDetection
import torch

model = DetrForObjectDetection.from_pretrained("xyz/ddetr-finetuned-balloon-v2", id2label={0:"balloon"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
processor = AutoImageProcessor.from_pretrained("xyz/ddetr-finetuned-balloon-v2")

RuntimeError: Error(s) in loading state_dict for DetrForObjectDetection: size mismatch for model.query_position_embeddings.weight: copying a param with shape torch.Size([100, 512]) from checkpoint, the shape in current model is torch.Size([100, 256]). You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

My suspicion is this warning after executing this:

model = Detr(lr=1e-4, lr_backbone=1e-5, weight_decay=1e-4)
outputs = model(pixel_values=batch['pixel_values'], pixel_mask=batch['pixel_mask'])

config.json: 100%  6.60k/6.60k [00:00<00:00, 176kB/s] You are using a model of type detr to instantiate a model of type deformable_detr. This is not supported for all configurations of models and can yield errors. pytorch_model.bin: 100%  167M/167M [00:01<00:00, 98.1MB/s] Some weights of DeformableDetrForObjectDetection were not initialized from the model checkpoint at facebook/detr-resnet-50 and are newly initialized: ['bbox_embed.0.layers.0.bias', 'bbox_embed.0.layers.0.weight', 'bbox_embed.0.layers.1.bias', 'bbox_embed.0.layers.1.weight', 'bbox_embed.0.layers.2.bias', 'bbox_embed.0.layers.2.weight', 'bbox_embed.1.layers.0.bias', 'bbox_embed.1.layers.0.weight', 'bbox_embed.1.layers.1.bias', 'bbox_embed.1.layers.1.weight', 'bbox_embed.1.layers.2.bias', 'bbox_embed.1.layers.2.weight', 'bbox_embed.2.layers.0.bias', 'bbox_embed.2.layers.0.weight', 'bbox_embed.2.layers.1.bias', 'bbox_embed.2.layers.1.weight', 'bbox_embed.2.layers.2.bias', 'bbox_embed.2.layers.2.weight', 'bbox_embed.3.layers.0.bias', 'bbox_embed.3.layers.0.weight', 'bbox_embed.3.layers.1.bias', 'bbox_embed.3.layers.1.weight', 'bbox_embed.3.layers.2.bias', 'bbox_embed.3.layers.2.weight', 'bbox_embed.4.layers.0.bias', 'bbox_embed.4.layers.0.weight', 'bbox_embed.4.layers.1.bias', 'bbox_embed.4.layers.1.weight', 'bbox_embed.4.layers.2.bias', 'bbox_embed.4.layers.2.weight', 'bbox_embed.5.layers.0.bias', 'bbox_embed.5.layers.0.weight', 'bbox_embed.5.layers.1.bias', 'bbox_embed.5.layers.1.weight', 'bbox_embed.5.layers.2.bias', 'bbox_embed.5.layers.2.weight', 'class_embed.0.bias', 'class_embed.0.weight', 'class_embed.1.bias', 'class_embed.1.weight', 'class_embed.2.bias', 'class_embed.2.weight', 'class_embed.3.bias', 'class_embed.3.weight', 'class_embed.4.bias', 'class_embed.4.weight', 'class_embed.5.bias', 'class_embed.5.weight', 'model.decoder.layers.0.encoder_attn.attention_weights.bias', 'model.decoder.layers.0.encoder_attn.attention_weights.weight', 'model.decoder.layers.0.encoder_attn.output_proj.bias', 'model.decoder.layers.0.encoder_attn.output_proj.weight', 'model.decoder.layers.0.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.0.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.0.encoder_attn.value_proj.bias', 'model.decoder.layers.0.encoder_attn.value_proj.weight', 'model.decoder.layers.1.encoder_attn.attention_weights.bias', 'model.decoder.layers.1.encoder_attn.attention_weights.weight', 'model.decoder.layers.1.encoder_attn.output_proj.bias', 'model.decoder.layers.1.encoder_attn.output_proj.weight', 'model.decoder.layers.1.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.1.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.1.encoder_attn.value_proj.bias', 'model.decoder.layers.1.encoder_attn.value_proj.weight', 'model.decoder.layers.2.encoder_attn.attention_weights.bias', 'model.decoder.layers.2.encoder_attn.attention_weights.weight', 'model.decoder.layers.2.encoder_attn.output_proj.bias', 'model.decoder.layers.2.encoder_attn.output_proj.weight', 'model.decoder.layers.2.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.2.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.2.encoder_attn.value_proj.bias', 'model.decoder.layers.2.encoder_attn.value_proj.weight', 'model.decoder.layers.3.encoder_attn.attention_weights.bias', 'model.decoder.layers.3.encoder_attn.attention_weights.weight', 'model.decoder.layers.3.encoder_attn.output_proj.bias', 'model.decoder.layers.3.encoder_attn.output_proj.weight', 'model.decoder.layers.3.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.3.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.3.encoder_attn.value_proj.bias', 'model.decoder.layers.3.encoder_attn.value_proj.weight', 'model.decoder.layers.4.encoder_attn.attention_weights.bias', 'model.decoder.layers.4.encoder_attn.attention_weights.weight', 'model.decoder.layers.4.encoder_attn.output_proj.bias', 'model.decoder.layers.4.encoder_attn.output_proj.weight', 'model.decoder.layers.4.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.4.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.4.encoder_attn.value_proj.bias', 'model.decoder.layers.4.encoder_attn.value_proj.weight', 'model.decoder.layers.5.encoder_attn.attention_weights.bias', 'model.decoder.layers.5.encoder_attn.attention_weights.weight', 'model.decoder.layers.5.encoder_attn.output_proj.bias', 'model.decoder.layers.5.encoder_attn.output_proj.weight', 'model.decoder.layers.5.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.5.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.5.encoder_attn.value_proj.bias', 'model.decoder.layers.5.encoder_attn.value_proj.weight', 'model.encoder.layers.0.self_attn.attention_weights.bias', 'model.encoder.layers.0.self_attn.attention_weights.weight', 'model.encoder.layers.0.self_attn.output_proj.bias', 'model.encoder.layers.0.self_attn.output_proj.weight', 'model.encoder.layers.0.self_attn.sampling_offsets.bias', 'model.encoder.layers.0.self_attn.sampling_offsets.weight', 'model.encoder.layers.0.self_attn.value_proj.bias', 'model.encoder.layers.0.self_attn.value_proj.weight', 'model.encoder.layers.1.self_attn.attention_weights.bias', 'model.encoder.layers.1.self_attn.attention_weights.weight', 'model.encoder.layers.1.self_attn.output_proj.bias', 'model.encoder.layers.1.self_attn.output_proj.weight', 'model.encoder.layers.1.self_attn.sampling_offsets.bias', 'model.encoder.layers.1.self_attn.sampling_offsets.weight', 'model.encoder.layers.1.self_attn.value_proj.bias', 'model.encoder.layers.1.self_attn.value_proj.weight', 'model.encoder.layers.2.self_attn.attention_weights.bias', 'model.encoder.layers.2.self_attn.attention_weights.weight', 'model.encoder.layers.2.self_attn.output_proj.bias', 'model.encoder.layers.2.self_attn.output_proj.weight', 'model.encoder.layers.2.self_attn.sampling_offsets.bias', 'model.encoder.layers.2.self_attn.sampling_offsets.weight', 'model.encoder.layers.2.self_attn.value_proj.bias', 'model.encoder.layers.2.self_attn.value_proj.weight', 'model.encoder.layers.3.self_attn.attention_weights.bias', 'model.encoder.layers.3.self_attn.attention_weights.weight', 'model.encoder.layers.3.self_attn.output_proj.bias', 'model.encoder.layers.3.self_attn.output_proj.weight', 'model.encoder.layers.3.self_attn.sampling_offsets.bias', 'model.encoder.layers.3.self_attn.sampling_offsets.weight', 'model.encoder.layers.3.self_attn.value_proj.bias', 'model.encoder.layers.3.self_attn.value_proj.weight', 'model.encoder.layers.4.self_attn.attention_weights.bias', 'model.encoder.layers.4.self_attn.attention_weights.weight', 'model.encoder.layers.4.self_attn.output_proj.bias', 'model.encoder.layers.4.self_attn.output_proj.weight', 'model.encoder.layers.4.self_attn.sampling_offsets.bias', 'model.encoder.layers.4.self_attn.sampling_offsets.weight', 'model.encoder.layers.4.self_attn.value_proj.bias', 'model.encoder.layers.4.self_attn.value_proj.weight', 'model.encoder.layers.5.self_attn.attention_weights.bias', 'model.encoder.layers.5.self_attn.attention_weights.weight', 'model.encoder.layers.5.self_attn.output_proj.bias', 'model.encoder.layers.5.self_attn.output_proj.weight', 'model.encoder.layers.5.self_attn.sampling_offsets.bias', 'model.encoder.layers.5.self_attn.sampling_offsets.weight', 'model.encoder.layers.5.self_attn.value_proj.bias', 'model.encoder.layers.5.self_attn.value_proj.weight', 'model.input_proj.0.0.bias', 'model.input_proj.0.0.weight', 'model.input_proj.0.1.bias', 'model.input_proj.0.1.weight', 'model.input_proj.1.0.bias', 'model.input_proj.1.0.weight', 'model.input_proj.1.1.bias', 'model.input_proj.1.1.weight', 'model.input_proj.2.0.bias', 'model.input_proj.2.0.weight', 'model.input_proj.2.1.bias', 'model.input_proj.2.1.weight', 'model.input_proj.3.0.bias', 'model.input_proj.3.0.weight', 'model.input_proj.3.1.bias', 'model.input_proj.3.1.weight', 'model.level_embed', 'model.reference_points.bias', 'model.reference_points.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of DeformableDetrForObjectDetection were not initialized from the model checkpoint at facebook/detr-resnet-50 and are newly initialized because the shapes did not match:

  • model.query_position_embeddings.weight: found shape torch.Size([100, 256]) in the checkpoint and torch.Size([100, 512]) in the model instantiated You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
NielsRogge commented 2 days ago

Hi @spatiallysaying,

At inference time, you need to load DeformableDetrForObjectDetection rather than DetrForObjectDetection.