How to train model using custom data?

charlescwwang commented 3 months ago

Issue Description I tried to train model using my data with 12 labels. (coco dataset format) When I try to train the model, the following error occurs.

Additional Context This is my command

python yolo/lazy.py task=train task.epoch=10 task.data.batch_size=8 model=v9-m dataset=data device=cuda name=test-2

This is log

[06/28 18:19:05]   INFO  | 📄 Created log folder: runs/train/test-2
[06/28 18:19:05]   INFO  | 📦 Loaded train cache
[06/28 18:19:05]   INFO  | 🚜 Building YOLO
[06/28 18:19:05]   INFO  |   🏗️  Building backbone
[06/28 18:19:05]   INFO  |   🏗️  Building neck
[06/28 18:19:05]   INFO  |   🏗️  Building head
[06/28 18:19:05]   INFO  |   🏗️  Building detection
[06/28 18:19:05]   INFO  |   🏗️  Building auxiliary
[06/28 18:19:05]   INFO  | ✅ Success load model & weight
[06/28 18:19:06]   INFO  | 🧸 Found no stride of model, performed a dummy test for auto-anchor size
[06/28 18:19:08]   INFO  | ✅ Success load loss function
                             Model Layers                             
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Index ┃     Layer Type     ┃ Tags ┃    Params ┃ Channels (IN->OUT) ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│   1   │        Conv        │      │       928 │       3 ->   32    │
│   2   │        Conv        │      │    18,560 │      32 ->   64    │
│   3   │    RepNCSPELAN     │      │   171,648 │      64 ->  128    │
│   4   │       AConv        │      │   276,960 │     128 ->  240    │
│   5   │    RepNCSPELAN     │  B3  │   629,520 │     240 ->  240    │
│   6   │       AConv        │      │   778,320 │     240 ->  360    │
│   7   │    RepNCSPELAN     │  B4  │ 1,414,080 │     360 ->  360    │
│   8   │       AConv        │      │ 1,556,160 │     360 ->  480    │
│   9   │    RepNCSPELAN     │  B5  │ 2,511,840 │     480 ->  480    │
│  10   │      SPPELAN       │  N3  │   577,440 │     480 ->  480    │
│  11   │      UpSample      │      │         0 │         -          │
│  12   │       Concat       │      │         0 │         -          │
│  13   │    RepNCSPELAN     │  N4  │ 1,586,880 │     840 ->  360    │
│  14   │      UpSample      │      │         0 │         -          │
│  15   │       Concat       │      │         0 │         -          │
│  16   │    RepNCSPELAN     │  P3  │   715,920 │     600 ->  240    │
│  17   │       AConv        │      │   397,808 │     240 ->  184    │
│  18   │       Concat       │      │         0 │         -          │
│  19   │    RepNCSPELAN     │  P4  │ 1,480,320 │     544 ->  360    │
│  20   │       AConv        │      │   778,080 │     360 ->  240    │
│  21   │       Concat       │      │         0 │         -          │
│  22   │    RepNCSPELAN     │  P5  │ 2,627,040 │     720 ->  480    │
│  23   │ MultiheadDetection │ Main │ 4,602,528 │       M -> 1080    │
│  24   │      CBLinear      │  R3  │    57,840 │     240 ->    M    │
│  25   │      CBLinear      │  R4  │   216,600 │     360 ->    M    │
│  26   │      CBLinear      │  R5  │   519,480 │     480 ->    M    │
│  27   │        Conv        │      │       928 │       3 ->   32    │
│  28   │        Conv        │      │    18,560 │      32 ->   64    │
│  29   │    RepNCSPELAN     │      │   171,648 │      64 ->  128    │
│  30   │       AConv        │      │   276,960 │     128 ->  240    │
│  31   │       CBFuse       │      │         0 │         -          │
│  32   │    RepNCSPELAN     │  A3  │   629,520 │     240 ->  240    │
│  33   │       AConv        │      │   778,320 │     240 ->  360    │
│  34   │       CBFuse       │      │         0 │         -          │
│  35   │    RepNCSPELAN     │  A4  │ 1,414,080 │     360 ->  360    │
│  36   │       AConv        │      │ 1,556,160 │     360 ->  480    │
│  37   │       CBFuse       │      │         0 │         -          │
│  38   │    RepNCSPELAN     │  A5  │ 2,511,840 │     480 ->  480    │
│  39   │ MultiheadDetection │ AUX  │ 4,602,528 │       M -> 1080    │
└───────┴────────────────────┴──────┴───────────┴────────────────────┘
[06/28 18:19:08] WARNING | ⚠️ Could not find graphviz backend, continue without drawing the model architecture
[06/28 18:19:08]   INFO  | 📦 Loaded validation cache
[06/28 18:19:08]   INFO  | 🚄 Start Training!
/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:143: UserWarning: Detected call of 
`lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before 
`lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at 
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
⠧ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10 0:01:02
⠧ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
💾 success save at runs/train/test-2/weights/E000.pt/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in 
`scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the 
deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you 
are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
  warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
⠸ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/10 0:01:05
⠸ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
💾 success save at runs/train/test-2/weights/E001.pt
⠙ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/10 0:01:00
⠙ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
Error executing job with overrides: ['task=train', 'task.epoch=10', 'task.data.batch_size=8', 'model=v9-m', 'dataset=data', 'device=cuda', 
'name=test-2']
Traceback (most recent call last):  File "/home/localadmin/YOLO/yolo/lazy.py", line 42, in <module>
    main()  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app    run_and_report(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report    raise ex
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/localadmin/YOLO/yolo/lazy.py", line 38, in main
    solver.solve(dataloader)
  File "/home/localadmin/YOLO/yolo/tools/solver.py", line 145, in solve
    mAPs = self.validator.solve(self.validation_dataloader, epoch_idx=epoch_idx)
  File "/home/localadmin/YOLO/yolo/tools/solver.py", line 256, in solve
    result = calculate_ap(self.coco_gt, predict_json)
  File "/home/localadmin/YOLO/yolo/utils/solver_utils.py", line 12, in calculate_ap
    coco_dt = coco_gt.loadRes(pd_path)
  File "/home/localadmin/anaconda3/envs/yolo-MIT/lib/python3.9/site-packages/pycocotools/coco.py", line 332, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set
⠙ Validate |  mAP.5  |mAP.5:.95| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/10 0:01:00
⠙ Run pycocotools                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1  -:--:--
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃       ┃ Avg. Recall    ┃       ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│    0  │ AP @ .5:.95    │  0.00 │ AP @        .5 │  0.00 │
│       │                │       │                │       │
│    1  │ AP @ .5:.95    │  0.00 │ AR maxDets   1 │  0.00 │
│    1  │ AP @     .5    │  0.00 │ AR maxDets  10 │  0.00 │
│    1  │ AP @    .75    │  0.00 │ AR maxDets 100 │  0.00 │
│    1  │ AP  (small)    │  0.00 │ AR     (small) │  0.00 │
│    1  │ AP (medium)    │  0.00 │ AR    (medium) │  0.00 │
│    1  │ AP  (large)    │  0.00 │ AR     (large) │  0.00 │
└───────┴────────────────┴───────┴────────────────┴───────┘

Future Considerations Please suggest any potential future improvements related to this issue.

prithivi1 commented 2 months ago

Hi @charlescwwang , I tried to train my model with 1 labels. However I'm unable to load the pretrained weights with 80 classes to my 1 class model. I could see that you have passed that layer in your error logs. Can you help me figure out how to do that.

`[07/16 10:00:40] INFO | 📄 Created log folder: runs/train/v9-dev [07/16 10:00:40] INFO | 📦 Loaded train cache [07/16 10:00:40] INFO | 🚜 Building YOLO [07/16 10:00:40] INFO | 🏗️ Building backbone [07/16 10:00:40] INFO | 🏗️ Building neck [07/16 10:00:41] INFO | 🏗️ Building head [07/16 10:00:41] INFO | 🏗️ Building detection [07/16 10:00:41] INFO | 🏗️ Building auxiliary [07/16 10:00:41] INFO | 🌐 Weight weights/v9-c.pt not found, try downloading 📥 Downloading v9-c.pt... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 102895262/102895262 bytes • 0:00:00 [07/16 10:00:42] INFO | ✅ Download completed. Error executing job with overrides: ['task=train', 'task.data.batch_size=8', 'task.epoch=10', 'model=v9-c', 'class_num=1', 'dataset=dev.yaml', 'device=cuda'] Traceback (most recent call last): File "/content/drive/MyDrive/Colab-Notebooks/yolov9/YOLO/yolo/lazy.py", line 27, in main model = create_model(cfg.model, class_num=cfg.class_num, weight_path=cfg.weight) File "/usr/local/lib/python3.10/dist-packages/yolo/model/yolo.py", line 136, in create_model model.model.load_state_dict(torch.load(weight_path, map_location=torch.device("cpu")), strict=False) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2189, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ModuleList: size mismatch for 22.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 22.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 22.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 256, 1, 1]). size mismatch for 22.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.0.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.0.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.1.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.1.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for 38.heads.2.class_conv.2.weight: copying a param with shape torch.Size([80, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([1, 512, 1, 1]). size mismatch for 38.heads.2.class_conv.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([1]).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`

charlescwwang commented 2 months ago

sorry, I have no idea. This is my package list, maybe it could help.

aiofiles==24.1.0
antlr4-python3-runtime==4.9.3
anyio==4.4.0
argcomplete==3.4.0
attrs==23.2.0
beautifulsoup4==4.12.3
boto3==1.34.135
botocore==1.34.135
Brotli==1.1.0
cachetools==5.3.3
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.2.1
cycler==0.12.1
dacite==1.7.0
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker-pycreds==0.4.0
einops==0.8.0
exceptiongroup==1.2.1
fiftyone==0.24.1
fiftyone-brain==0.16.1
fiftyone_db==1.1.4
filelock==3.15.4
fonttools==4.53.0
fsspec==2024.6.0
ftfy==6.2.0
future==1.0.0
gitdb==4.0.11
GitPython==3.1.43
glob2==0.7
graphql-core==3.2.3
graphviz==0.20.3
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.5
httpx==0.27.0
humanize==4.9.0
hydra-core==1.3.2
Hypercorn==0.17.3
hyperframe==6.0.1
idna==3.7
imageio==2.34.2
importlib_resources==6.4.0
inflate64==1.0.0
iniconfig==2.0.0
Jinja2==3.0.3
jmespath==1.0.1
joblib==1.4.2
jsonlines==4.0.0
kaleido==0.2.1
kiwisolver==1.4.5
lazy_loader==0.4
loguru==0.7.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mongoengine==0.24.2
motor==3.5.0
mpmath==1.3.0
multivolumefile==0.2.3
networkx==3.2.1
numpy==2.0.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
packaging==24.1
pandas==2.2.2
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
pprintpp==0.4.0
priority==2.0.0
protobuf==5.27.2
psutil==6.0.0
py7zr==0.21.0
pybcj==1.0.2
pycocotools==2.0.8
pycryptodomex==3.20.0
Pygments==2.18.0
pymongo==4.8.0
pyparsing==3.1.2
pyppmd==1.1.0
pytest==8.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
pyzstd==0.16.0
rarfile==4.2
regex==2024.5.15
requests==2.32.3
retrying==1.3.4
rich==13.7.1
s3transfer==0.10.2
scikit-image==0.24.0
scikit-learn==1.5.0
scipy==1.13.1
sentry-sdk==2.7.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
sse-starlette==0.10.3
sseclient-py==1.8.0
starlette==0.37.2
strawberry-graphql==0.138.1
sympy==1.12.1
tabulate==0.9.0
taskgroup==0.0.0a4
tenacity==8.4.2
texttable==1.7.0
threadpoolctl==3.5.0
tifffile==2024.6.18
tomli==2.0.1
torch==2.3.1
torchvision==0.18.1
tqdm==4.66.4
triton==2.3.1
typing_extensions==4.12.2
tzdata==2024.1
tzlocal==5.2
universal-analytics-python3==1.1.1
urllib3==1.26.19
voxel51-eta==0.12.6
wandb==0.17.3
wcwidth==0.2.13
wrapt==1.16.0
wsproto==1.2.0
xmltodict==0.13.0
zipp==3.19.2

Abdul-Mukit commented 1 month ago

@charlescwwang same issue as #67. In short, your image file names probably contain characters other than just numbers. The root cause is the way the calculate_ap function is written. It should be something like this instead: https://lightning.ai/docs/torchmetrics/stable/detection/mean_average_precision.html If ap didn't need image ids to begin with then data loader would not need to return image paths at every step.

Abdul-Mukit commented 1 month ago

PR https://github.com/WongKinYiu/YOLO/pull/79 should fix this. @charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?

charlescwwang commented 1 month ago

PR #79 should fix this. @charlescwwang can you please try the branch https://github.com/Abdul-Mukit/YOLO/tree/67-fix-image-id-usage-consistency and let me know if you still face the same problem?

@Abdul-Mukit I tried the branch, and the training was successfully completed.

WongKinYiu / YOLO

How to train model using custom data? #36