Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

Size miss match #1476

Closed rgund26 closed 9 months ago

rgund26 commented 1 year ago

💡 Your Question

I have trained the yolo nas custom model. Now I want to sun that model in really time but I am getting error mentioned below Please help me. I cant retrain the whole model, I am in my MS submission.

raise ValueError(f"ckpt layer {ckpt_key} with shape {ckpt_val.shape} does not match {model_key}" f" with shape {model_val.shape} in the model") ValueError: ckpt layer backbone.stage1.blocks.conv1.conv.weight with shape torch.Size([64, 96, 1, 1]) does not match backbone.stage1.blocks.conv1.conv.weight with shape torch.Size([32, 96, 1, 1]) in the model

My python script

import cv2 import torch.cuda from super_gradients.training import models from super_gradients.common.object_names import Models import torch

CLASSES = ['rice']

model = models.get('yolo_nas_m', num_classes= 1, checkpoint_path="ckpt_best.pth")

model = model.to("cuda" if torch.cuda.is_available() else "cpu") model.to(print("cuda") if torch.cuda.is_available() else print("cpu"))

model.predict_webcam()

Versions

S:\Python\Python310\python.exe S:\CV_programming\YOLO-NAS\webcam.py

size mismatch for neck.neck3.blocks.bottlenecks.1.cv2.bn.running_var: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.stem.seq.conv.weight: copying a param with shape torch.Size([96, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 96, 1, 1]).
size mismatch for heads.head1.stem.seq.bn.weight: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.stem.seq.bn.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.stem.seq.bn.running_mean: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.stem.seq.bn.running_var: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.cls_convs.0.seq.conv.weight: copying a param with shape torch.Size([96, 96, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for heads.head1.cls_convs.0.seq.bn.weight: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.cls_convs.0.seq.bn.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.cls_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.cls_convs.0.seq.bn.running_var: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.reg_convs.0.seq.conv.weight: copying a param with shape torch.Size([96, 96, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]).
size mismatch for heads.head1.reg_convs.0.seq.bn.weight: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.reg_convs.0.seq.bn.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.reg_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.reg_convs.0.seq.bn.running_var: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for heads.head1.cls_pred.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 128, 1, 1]).
size mismatch for heads.head1.cls_pred.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for heads.head1.reg_pred.weight: copying a param with shape torch.Size([68, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([68, 128, 1, 1]).
size mismatch for heads.head2.stem.seq.conv.weight: copying a param with shape torch.Size([192, 192, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 192, 1, 1]).
size mismatch for heads.head2.stem.seq.bn.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.stem.seq.bn.bias: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.stem.seq.bn.running_mean: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.stem.seq.bn.running_var: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.cls_convs.0.seq.conv.weight: copying a param with shape torch.Size([192, 192, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for heads.head2.cls_convs.0.seq.bn.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.cls_convs.0.seq.bn.bias: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.cls_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.cls_convs.0.seq.bn.running_var: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.reg_convs.0.seq.conv.weight: copying a param with shape torch.Size([192, 192, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
size mismatch for heads.head2.reg_convs.0.seq.bn.weight: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.reg_convs.0.seq.bn.bias: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.reg_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.reg_convs.0.seq.bn.running_var: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for heads.head2.cls_pred.weight: copying a param with shape torch.Size([80, 192, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]).
size mismatch for heads.head2.cls_pred.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for heads.head2.reg_pred.weight: copying a param with shape torch.Size([68, 192, 1, 1]) from checkpoint, the shape in current model is torch.Size([68, 256, 1, 1]).
size mismatch for heads.head3.stem.seq.conv.weight: copying a param with shape torch.Size([384, 384, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 384, 1, 1]).
size mismatch for heads.head3.stem.seq.bn.weight: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.stem.seq.bn.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.stem.seq.bn.running_mean: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.stem.seq.bn.running_var: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.cls_convs.0.seq.conv.weight: copying a param with shape torch.Size([384, 384, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for heads.head3.cls_convs.0.seq.bn.weight: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.cls_convs.0.seq.bn.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.cls_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.cls_convs.0.seq.bn.running_var: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.reg_convs.0.seq.conv.weight: copying a param with shape torch.Size([384, 384, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
size mismatch for heads.head3.reg_convs.0.seq.bn.weight: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.reg_convs.0.seq.bn.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.reg_convs.0.seq.bn.running_mean: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.reg_convs.0.seq.bn.running_var: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for heads.head3.cls_pred.weight: copying a param with shape torch.Size([80, 384, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]).
size mismatch for heads.head3.cls_pred.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
size mismatch for heads.head3.reg_pred.weight: copying a param with shape torch.Size([68, 384, 1, 1]) from checkpoint, the shape in current model is torch.Size([68, 512, 1, 1]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "S:\CV_programming\YOLO-NAS\webcam.py", line 11, in model = models.get('yolo_nas_l', num_classes= 1, checkpoint_path="ckpt_best.pth") File "S:\Python\Python310\lib\site-packages\super_gradients\training\models\modelfactory.py", line 208, in get = load_checkpoint_to_model( File "S:\Python\Python310\lib\site-packages\super_gradients\training\utils\checkpoint_utils.py", line 229, in load_checkpoint_to_model adaptive_load_state_dict(net, checkpoint, strict) File "S:\Python\Python310\lib\site-packages\super_gradients\training\utils\checkpoint_utils.py", line 61, in adaptive_load_state_dict adapted_state_dict = adapt_state_dict_to_fit_model_layer_names(net.state_dict(), state_dict, solver=solver) File "S:\Python\Python310\lib\site-packages\super_gradients\training\utils\checkpoint_utils.py", line 159, in adapt_state_dict_to_fit_model_layer_names raise ValueError(f"ckpt layer {ckpt_key} with shape {ckpt_val.shape} does not match {model_key}" f" with shape {model_val.shape} in the model") ValueError: ckpt layer backbone.stage1.blocks.conv1.conv.weight with shape torch.Size([64, 96, 1, 1]) does not match backbone.stage1.blocks.conv1.conv.weight with shape torch.Size([96, 96, 1, 1]) in the model

Process finished with exit code 1

BloodAxe commented 1 year ago

The error suggest that you have mismatch in architectures in your checkpoint and in the model you are trying to load checkpoint into. E.g you trained M variant but trying to load into S or L version. Please double-check what model architecture you actually trained and ensure that it matches with the model you are instantiating using models.get.

Even your traceback has discrepancy.

At the start of post you provide snippet model = models.get('yolo_nas_m', num_classes= 1, checkpoint_path="ckpt_best.pth") and down the road it becomes model = models.get('yolo_nas_l', num_classes= 1, checkpoint_path="ckpt_best.pth")

Louis-Dupont commented 9 months ago

@rgund26 I'm closing this issue due to inactivity. If the proposed solution did not solve your issue feel free to reopen it.

bhautik-pithadiya commented 8 months ago

Same here i have fine-tuned yolo_nas_l model and now not able to load the best_ckpt.pth. this was during fine-tunning Screenshot from 2024-02-06 13-12-27

now i'm loading the weights and i'm getting this error image image

can you help me..

rgund26 commented 8 months ago

Hello Bhautik,

Most of this happens when you trained model mismatches with real-time Python script

Sorry I'm not able to recall now its been a while.

On Tue, Feb 6, 2024 at 1:14 PM Bhautik Pithadiya @.***> wrote:

Same here i have fine-tuned yolo_nas_l model and now not able to load the best_ckpt.pth. this was during fine-tunning Screenshot.from.2024-02-06.13-12-27.png (view on web) https://github.com/Deci-AI/super-gradients/assets/82807312/c0c98297-b491-4e6d-b69c-3c5648b05022

now i'm loading the weights and i'm getting this error image.png (view on web) https://github.com/Deci-AI/super-gradients/assets/82807312/5a6b6beb-0278-402c-a8e7-5df1d00cb48c image.png (view on web) https://github.com/Deci-AI/super-gradients/assets/82807312/a37c97a5-7fea-42b8-984c-bde497781cdd

can you help me..

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/1476#issuecomment-1928946117, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7THXJE35LWR55UQNJM5GL3YSHNPNAVCNFSM6AAAAAA472VBD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHE2DMMJRG4 . You are receiving this because you were mentioned.Message ID: @.***>

electroendjneer commented 6 months ago

Try to load the model with the attribute "checkpoint_num_classes":

model = models.get('yolo_nas_m', num_classes= 1, checkpoint_num_classes="OLD CLASS NUMBER, checkpoint_path="ckpt_best.pth")