Open charlescwwang opened 3 months ago
Hi @charlescwwang , I tried to train my model with 1 labels. However I'm unable to load the pretrained weights with 80 classes to my 1 class model. I could see that you have passed that layer in your error logs. Can you help me figure out how to do that.
`[07/16 10:00:40] INFO | π Created log folder: runs/train/v9-dev [07/16 10:00:40] INFO | π¦ Loaded train cache [07/16 10:00:40] INFO | π Building YOLO [07/16 10:00:40] INFO | ποΈ Building backbone [07/16 10:00:40] INFO | ποΈ Building neck [07/16 10:00:41] INFO | ποΈ Building head [07/16 10:00:41] INFO | ποΈ Building detection [07/16 10:00:41] INFO | ποΈ Building auxiliary [07/16 10:00:41] INFO | π Weight weights/v9-c.pt not found, try downloading π₯ Downloading v9-c.pt... βββββββββββββββββββββββββββββ 100.0% β’ 102895262/102895262 bytes β’ 0:00:00 [07/16 10:00:42] INFO | β Download completed. Error executing job with overrides: ['task=train', 'task.data.batch_size=8', 'task.epoch=10', 'model=v9-c', 'class_num=1', 'dataset=dev.yaml', 'device=cuda'] Traceback (most recent call last): File "/content/drive/MyDrive/Colab-Notebooks/yolov9/YOLO/yolo/lazy.py", line 27, in main model = create_model(cfg.model, class_num=cfg.class_num, weight_path=cfg.weight) File "/usr/local/lib/python3.10/dist-packages/yolo/model/yolo.py", line 136, in create_model model.model.load_state_dict(torch.load(weight_path, map_location=torch.device("cpu")), strict=False) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2189, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ModuleList: size mismatch for 22.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 22.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 22.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`
sorry, I have no idea. This is my package list, maybe it could help.
aiofiles==24.1.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
argcomplete==3.4.0
attrs==23.2.0
beautifulsoup4==4.12.3
boto3==1.34.135
botocore==1.34.135
Brotli==1.1.0
cachetools==5.3.3
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.2.1
cycler==0.12.1
dacite==1.7.0
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker-pycreds==0.4.0
einops==0.8.0
exceptiongroup==1.2.1
fiftyone==0.24.1
fiftyone-brain==0.16.1
fiftyone_db==1.1.4
filelock==3.15.4
fonttools==4.53.0
fsspec==2024.6.0
ftfy==6.2.0
future==1.0.0
gitdb==4.0.11
GitPython==3.1.43
glob2==0.7
graphql-core==3.2.3
graphviz==0.20.3
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.5
httpx==0.27.0
humanize==4.9.0
hydra-core==1.3.2
Hypercorn==0.17.3
hyperframe==6.0.1
idna==3.7
imageio==2.34.2
importlib_resources==6.4.0
inflate64==1.0.0
iniconfig==2.0.0
Jinja2==3.0.3
jmespath==1.0.1
joblib==1.4.2
jsonlines==4.0.0
kaleido==0.2.1
kiwisolver==1.4.5
lazy_loader==0.4
loguru==0.7.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mongoengine==0.24.2
motor==3.5.0
mpmath==1.3.0
multivolumefile==0.2.3
networkx==3.2.1
numpy==2.0.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
packaging==24.1
pandas==2.2.2
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
pprintpp==0.4.0
priority==2.0.0
protobuf==5.27.2
psutil==6.0.0
py7zr==0.21.0
pybcj==1.0.2
pycocotools==2.0.8
pycryptodomex==3.20.0
Pygments==2.18.0
pymongo==4.8.0
pyparsing==3.1.2
pyppmd==1.1.0
pytest==8.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
pyzstd==0.16.0
rarfile==4.2
regex==2024.5.15
requests==2.32.3
retrying==1.3.4
rich==13.7.1
s3transfer==0.10.2
scikit-image==0.24.0
scikit-learn==1.5.0
scipy==1.13.1
sentry-sdk==2.7.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
sse-starlette==0.10.3
sseclient-py==1.8.0
starlette==0.37.2
strawberry-graphql==0.138.1
sympy==1.12.1
tabulate==0.9.0
taskgroup==0.0.0a4
tenacity==8.4.2
texttable==1.7.0
threadpoolctl==3.5.0
tifffile==2024.6.18
tomli==2.0.1
torch==2.3.1
torchvision==0.18.1
tqdm==4.66.4
triton==2.3.1
typing_extensions==4.12.2
tzdata==2024.1
tzlocal==5.2
universal-analytics-python3==1.1.1
urllib3==1.26.19
voxel51-eta==0.12.6
wandb==0.17.3
wcwidth==0.2.13
wrapt==1.16.0
wsproto==1.2.0
xmltodict==0.13.0
zipp==3.19.2
@charlescwwang same issue as #67. In short, your image file names probably contain characters other than just numbers.
The root cause is the way the calculate_ap
function is written. It should be something like this instead: https://lightning.ai/docs/torchmetrics/stable/detection/mean_average_precision.html
If ap didn't need image ids to begin with then data loader would not need to return image paths at every step.
PR https://github.com/WongKinYiu/YOLO/pull/79 should fix this. @charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?
PR #79 should fix this. @charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?
@Abdul-Mukit I tried the branch, and the training was successfully completed.
Issue Description I tried to train model using my data with 12 labels. (coco dataset format) When I try to train the model, the following error occurs.
Additional Context This is my command
This is log
Future Considerations Please suggest any potential future improvements related to this issue.