Open danieltog opened 10 months ago
Usually, these issues arise due to the pre-processing or post-processing. If they do not match exactly what your model was trained to expect, you will get correctness bugs like this. In object detection, there is not a clear standard every model follows so you may have to change it for the model. Take a look at the code used during training vs the DJL Translator you are using. Maybe even try running them in parallel on the same input to verify they are the same
@danieltog A few issues with your code:
Hi @zachgk and @frankfliu,
I've relocated the model to the "onnx_custom_model" directory at the root. During my testing, I identified an issue related to the translator. The problem stems from using the translator initially defined for TensorFlow, resulting in an error:
ERROR:
Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is ai.djl.translate.TranslateException: ai.djl.engine.EngineException: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Got invalid dimensions for input: images for the following indices
index: 1 Got: 640 Expected: 3
index: 3 Got: 3 Expected: 640
Please fix either the inputs or the model.
I trained the model with YOLOV5 using the following parameters:
!python train.py --data data.yaml --weights yolov5s.pt --img 640 --batch-size 8 --name Model --epochs 15
Here is the output of the model training on YOLOV5:
Overriding model.yaml nc=80 with nc=12
from n params module arguments
0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 2 115712 models.common.C3 [128, 128, 2]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 3 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 1182720 models.common.C3 [512, 512, 1]
9 -1 1 656896 models.common.SPPF [512, 512, 5]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 45849 models.yolo.Detect [12, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 214 layers, 7051993 parameters, 7051993 gradients, 16.0 GFLOPs
Transferred 343/349 items from yolov5s.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning /content/drive/MyDrive/Yolo_training/yolov5/data_images/train.cache... 166 images, 0 backgrounds, 0 corrupt: 100% 166/166 [00:00<?, ?it/s]
val: Scanning /content/drive/MyDrive/Yolo_training/yolov5/data_images/test.cache... 41 images, 0 backgrounds, 0 corrupt: 100% 41/41 [00:00<?, ?it/s]
AutoAnchor: 4.41 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/Model3/labels.jpg...
Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to runs/train/Model3
Starting training for 50 epochs...
Could you please provide some guidance on creating a custom translator? I've uploaded the project to this repository: GitHub Repository.
Thank you!
Hello,
I've trained a YOLOV5 model to recognize various networking device ports like Ethernet and RJ-45. Post-training, I obtained positive predictions and accuracy results in a Colab notebook. I also generated images with correct bounding boxes. Subsequently, I exported the model as an .onnx file to DJL's local Spring Boot directory.
Here's the problem: When I attempt to make predictions with the model, it returns thousands of seemingly random predictions that lack any logical structure. Furthermore, some predictions yield object probabilities exceeding 1, often in the thousands, e.g., "probability": 21474.83594.
The expected behavior is for the model to return accurate predictions with appropriate bounding boxes on the images I provide.
Though there isn't an explicit error message, here's a snippet of the returned data as an example:
To reproduce this issue, I can provide you with the model.
The code snippet below is the class I used for running predictions:
If you need more information or specific steps for debugging, please don't hesitate to ask.
Thank you for your attention to this matter.