However, when I try to load the model using,I am seeing the issue:
from transformers import AutoImageProcessor, DetrForObjectDetection
import torch
model = DetrForObjectDetection.from_pretrained("xyz/ddetr-finetuned-balloon-v2", id2label={0:"balloon"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
processor = AutoImageProcessor.from_pretrained("xyz/ddetr-finetuned-balloon-v2")
RuntimeError: Error(s) in loading state_dict for DetrForObjectDetection:
size mismatch for model.query_position_embeddings.weight: copying a param with shape torch.Size([100, 512]) from checkpoint, the shape in current model is torch.Size([100, 256]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.
My suspicion is this warning after executing this:
model = Detr(lr=1e-4, lr_backbone=1e-5, weight_decay=1e-4)
outputs = model(pixel_values=batch['pixel_values'], pixel_mask=batch['pixel_mask'])
config.json: 100%
6.60k/6.60k [00:00<00:00, 176kB/s]
You are using a model of type detr to instantiate a model of type deformable_detr. This is not supported for all configurations of models and can yield errors.
pytorch_model.bin: 100%
167M/167M [00:01<00:00, 98.1MB/s]
Some weights of DeformableDetrForObjectDetection were not initialized from the model checkpoint at facebook/detr-resnet-50 and are newly initialized: ['bbox_embed.0.layers.0.bias', 'bbox_embed.0.layers.0.weight', 'bbox_embed.0.layers.1.bias', 'bbox_embed.0.layers.1.weight', 'bbox_embed.0.layers.2.bias', 'bbox_embed.0.layers.2.weight', 'bbox_embed.1.layers.0.bias', 'bbox_embed.1.layers.0.weight', 'bbox_embed.1.layers.1.bias', 'bbox_embed.1.layers.1.weight', 'bbox_embed.1.layers.2.bias', 'bbox_embed.1.layers.2.weight', 'bbox_embed.2.layers.0.bias', 'bbox_embed.2.layers.0.weight', 'bbox_embed.2.layers.1.bias', 'bbox_embed.2.layers.1.weight', 'bbox_embed.2.layers.2.bias', 'bbox_embed.2.layers.2.weight', 'bbox_embed.3.layers.0.bias', 'bbox_embed.3.layers.0.weight', 'bbox_embed.3.layers.1.bias', 'bbox_embed.3.layers.1.weight', 'bbox_embed.3.layers.2.bias', 'bbox_embed.3.layers.2.weight', 'bbox_embed.4.layers.0.bias', 'bbox_embed.4.layers.0.weight', 'bbox_embed.4.layers.1.bias', 'bbox_embed.4.layers.1.weight', 'bbox_embed.4.layers.2.bias', 'bbox_embed.4.layers.2.weight', 'bbox_embed.5.layers.0.bias', 'bbox_embed.5.layers.0.weight', 'bbox_embed.5.layers.1.bias', 'bbox_embed.5.layers.1.weight', 'bbox_embed.5.layers.2.bias', 'bbox_embed.5.layers.2.weight', 'class_embed.0.bias', 'class_embed.0.weight', 'class_embed.1.bias', 'class_embed.1.weight', 'class_embed.2.bias', 'class_embed.2.weight', 'class_embed.3.bias', 'class_embed.3.weight', 'class_embed.4.bias', 'class_embed.4.weight', 'class_embed.5.bias', 'class_embed.5.weight', 'model.decoder.layers.0.encoder_attn.attention_weights.bias', 'model.decoder.layers.0.encoder_attn.attention_weights.weight', 'model.decoder.layers.0.encoder_attn.output_proj.bias', 'model.decoder.layers.0.encoder_attn.output_proj.weight', 'model.decoder.layers.0.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.0.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.0.encoder_attn.value_proj.bias', 'model.decoder.layers.0.encoder_attn.value_proj.weight', 'model.decoder.layers.1.encoder_attn.attention_weights.bias', 'model.decoder.layers.1.encoder_attn.attention_weights.weight', 'model.decoder.layers.1.encoder_attn.output_proj.bias', 'model.decoder.layers.1.encoder_attn.output_proj.weight', 'model.decoder.layers.1.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.1.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.1.encoder_attn.value_proj.bias', 'model.decoder.layers.1.encoder_attn.value_proj.weight', 'model.decoder.layers.2.encoder_attn.attention_weights.bias', 'model.decoder.layers.2.encoder_attn.attention_weights.weight', 'model.decoder.layers.2.encoder_attn.output_proj.bias', 'model.decoder.layers.2.encoder_attn.output_proj.weight', 'model.decoder.layers.2.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.2.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.2.encoder_attn.value_proj.bias', 'model.decoder.layers.2.encoder_attn.value_proj.weight', 'model.decoder.layers.3.encoder_attn.attention_weights.bias', 'model.decoder.layers.3.encoder_attn.attention_weights.weight', 'model.decoder.layers.3.encoder_attn.output_proj.bias', 'model.decoder.layers.3.encoder_attn.output_proj.weight', 'model.decoder.layers.3.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.3.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.3.encoder_attn.value_proj.bias', 'model.decoder.layers.3.encoder_attn.value_proj.weight', 'model.decoder.layers.4.encoder_attn.attention_weights.bias', 'model.decoder.layers.4.encoder_attn.attention_weights.weight', 'model.decoder.layers.4.encoder_attn.output_proj.bias', 'model.decoder.layers.4.encoder_attn.output_proj.weight', 'model.decoder.layers.4.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.4.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.4.encoder_attn.value_proj.bias', 'model.decoder.layers.4.encoder_attn.value_proj.weight', 'model.decoder.layers.5.encoder_attn.attention_weights.bias', 'model.decoder.layers.5.encoder_attn.attention_weights.weight', 'model.decoder.layers.5.encoder_attn.output_proj.bias', 'model.decoder.layers.5.encoder_attn.output_proj.weight', 'model.decoder.layers.5.encoder_attn.sampling_offsets.bias', 'model.decoder.layers.5.encoder_attn.sampling_offsets.weight', 'model.decoder.layers.5.encoder_attn.value_proj.bias', 'model.decoder.layers.5.encoder_attn.value_proj.weight', 'model.encoder.layers.0.self_attn.attention_weights.bias', 'model.encoder.layers.0.self_attn.attention_weights.weight', 'model.encoder.layers.0.self_attn.output_proj.bias', 'model.encoder.layers.0.self_attn.output_proj.weight', 'model.encoder.layers.0.self_attn.sampling_offsets.bias', 'model.encoder.layers.0.self_attn.sampling_offsets.weight', 'model.encoder.layers.0.self_attn.value_proj.bias', 'model.encoder.layers.0.self_attn.value_proj.weight', 'model.encoder.layers.1.self_attn.attention_weights.bias', 'model.encoder.layers.1.self_attn.attention_weights.weight', 'model.encoder.layers.1.self_attn.output_proj.bias', 'model.encoder.layers.1.self_attn.output_proj.weight', 'model.encoder.layers.1.self_attn.sampling_offsets.bias', 'model.encoder.layers.1.self_attn.sampling_offsets.weight', 'model.encoder.layers.1.self_attn.value_proj.bias', 'model.encoder.layers.1.self_attn.value_proj.weight', 'model.encoder.layers.2.self_attn.attention_weights.bias', 'model.encoder.layers.2.self_attn.attention_weights.weight', 'model.encoder.layers.2.self_attn.output_proj.bias', 'model.encoder.layers.2.self_attn.output_proj.weight', 'model.encoder.layers.2.self_attn.sampling_offsets.bias', 'model.encoder.layers.2.self_attn.sampling_offsets.weight', 'model.encoder.layers.2.self_attn.value_proj.bias', 'model.encoder.layers.2.self_attn.value_proj.weight', 'model.encoder.layers.3.self_attn.attention_weights.bias', 'model.encoder.layers.3.self_attn.attention_weights.weight', 'model.encoder.layers.3.self_attn.output_proj.bias', 'model.encoder.layers.3.self_attn.output_proj.weight', 'model.encoder.layers.3.self_attn.sampling_offsets.bias', 'model.encoder.layers.3.self_attn.sampling_offsets.weight', 'model.encoder.layers.3.self_attn.value_proj.bias', 'model.encoder.layers.3.self_attn.value_proj.weight', 'model.encoder.layers.4.self_attn.attention_weights.bias', 'model.encoder.layers.4.self_attn.attention_weights.weight', 'model.encoder.layers.4.self_attn.output_proj.bias', 'model.encoder.layers.4.self_attn.output_proj.weight', 'model.encoder.layers.4.self_attn.sampling_offsets.bias', 'model.encoder.layers.4.self_attn.sampling_offsets.weight', 'model.encoder.layers.4.self_attn.value_proj.bias', 'model.encoder.layers.4.self_attn.value_proj.weight', 'model.encoder.layers.5.self_attn.attention_weights.bias', 'model.encoder.layers.5.self_attn.attention_weights.weight', 'model.encoder.layers.5.self_attn.output_proj.bias', 'model.encoder.layers.5.self_attn.output_proj.weight', 'model.encoder.layers.5.self_attn.sampling_offsets.bias', 'model.encoder.layers.5.self_attn.sampling_offsets.weight', 'model.encoder.layers.5.self_attn.value_proj.bias', 'model.encoder.layers.5.self_attn.value_proj.weight', 'model.input_proj.0.0.bias', 'model.input_proj.0.0.weight', 'model.input_proj.0.1.bias', 'model.input_proj.0.1.weight', 'model.input_proj.1.0.bias', 'model.input_proj.1.0.weight', 'model.input_proj.1.1.bias', 'model.input_proj.1.1.weight', 'model.input_proj.2.0.bias', 'model.input_proj.2.0.weight', 'model.input_proj.2.1.bias', 'model.input_proj.2.1.weight', 'model.input_proj.3.0.bias', 'model.input_proj.3.0.weight', 'model.input_proj.3.1.bias', 'model.input_proj.3.1.weight', 'model.level_embed', 'model.reference_points.bias', 'model.reference_points.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DeformableDetrForObjectDetection were not initialized from the model checkpoint at facebook/detr-resnet-50 and are newly initialized because the shapes did not match:
model.query_position_embeddings.weight: found shape torch.Size([100, 256]) in the checkpoint and torch.Size([100, 512]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I started with this DETR notebook as base.
Training seems successful as I get : INFO:pytorch_lightning.utilities.rank_zero:
Trainer.fit
stopped:max_steps=50
reached.I successfully pushed the model to Huggingface repo:
However, when I try to load the model using,I am seeing the issue:
My suspicion is this warning after executing this: