deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 649 forks source link

Model trained for OnnxRuntime is returning thousands of random predictions #2834

Open danieltog opened 10 months ago

danieltog commented 10 months ago

Hello,

I've trained a YOLOV5 model to recognize various networking device ports like Ethernet and RJ-45. Post-training, I obtained positive predictions and accuracy results in a Colab notebook. I also generated images with correct bounding boxes. Subsequently, I exported the model as an .onnx file to DJL's local Spring Boot directory.

Here's the problem: When I attempt to make predictions with the model, it returns thousands of seemingly random predictions that lack any logical structure. Furthermore, some predictions yield object probabilities exceeding 1, often in the thousands, e.g., "probability": 21474.83594.

The expected behavior is for the model to return accurate predictions with appropriate bounding boxes on the images I provide.

Though there isn't an explicit error message, here's a snippet of the returned data as an example:

{"class": "ethernet", "probability": 21474.83594, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
{"class": "ethernet", "probability": 21474.83594, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
{"class": "ethernet", "probability": 21474.83594, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
{"class": "ethernet", "probability": 21474.83594, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
...
{"class": "energy", "probability": 0.20193, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
{"class": "fiber-3", "probability": 0.20158, "bounds": {"x"=0.000, "y"=0.001, "width"=0.000, "height"=0.000}}
{"class": "VGA", "probability": 0.20140, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}
{"class": "VGA", "probability": 0.20133, "bounds": {"x"=0.000, "y"=0.000, "width"=0.000, "height"=0.000}}

To reproduce this issue, I can provide you with the model.

The code snippet below is the class I used for running predictions:

@GetMapping("/predict_custom_object_onnx")
public static String predictCustomObjectOnnx() throws IOException, ModelException, TranslateException {
    Path imageFile = Paths.get("input/object_recognition/device4.jpg");
    Image img = ImageFactory.getInstance().fromFile(imageFile);

    Criteria<Image, DetectedObjects> criteria =
            Criteria.builder()
                    .optApplication(Application.CV.OBJECT_DETECTION)
                    .setTypes(Image.class, DetectedObjects.class)
                    .optEngine("OnnxRuntime")
                    .optProgress(new ProgressBar())
                    .build();

    try (ZooModel<Image, DetectedObjects> model = criteria.loadModel();
         Predictor<Image, DetectedObjects> predictor = model.newPredictor()) {
        DetectedObjects detection = predictor.predict(img);
        examplesService.saveBoundingBoxImage(img, detection);
        return detection.toString();
    }
}

If you need more information or specific steps for debugging, please don't hesitate to ask.

Thank you for your attention to this matter.

zachgk commented 10 months ago

Usually, these issues arise due to the pre-processing or post-processing. If they do not match exactly what your model was trained to expect, you will get correctness bugs like this. In object detection, there is not a clear standard every model follows so you may have to change it for the model. Take a look at the code used during training vs the DJL Translator you are using. Maybe even try running them in parallel on the same input to verify they are the same

frankfliu commented 10 months ago

@danieltog A few issues with your code:

  1. The code above is not using your model, can you create a mini-reproduce project on github?
  2. You are not using a translator that matches your pre/post processing.
  3. Can you try your model with python, and see it give the correct result?
danieltog commented 10 months ago

Hi @zachgk and @frankfliu,

I've relocated the model to the "onnx_custom_model" directory at the root. During my testing, I identified an issue related to the translator. The problem stems from using the translator initially defined for TensorFlow, resulting in an error:

ERROR:

Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is ai.djl.translate.TranslateException: ai.djl.engine.EngineException: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Got invalid dimensions for input: images for the following indices
index: 1 Got: 640 Expected: 3
index: 3 Got: 3 Expected: 640
Please fix either the inputs or the model.

I trained the model with YOLOV5 using the following parameters:

!python train.py --data data.yaml --weights yolov5s.pt --img 640 --batch-size 8 --name Model --epochs 15

Here is the output of the model training on YOLOV5:

Overriding model.yaml nc=80 with nc=12

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     45849  models.yolo.Detect                      [12, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model summary: 214 layers, 7051993 parameters, 7051993 gradients, 16.0 GFLOPs

Transferred 343/349 items from yolov5s.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning /content/drive/MyDrive/Yolo_training/yolov5/data_images/train.cache... 166 images, 0 backgrounds, 0 corrupt: 100% 166/166 [00:00<?, ?it/s]
val: Scanning /content/drive/MyDrive/Yolo_training/yolov5/data_images/test.cache... 41 images, 0 backgrounds, 0 corrupt: 100% 41/41 [00:00<?, ?it/s]

AutoAnchor: 4.41 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/Model3/labels.jpg... 
Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to runs/train/Model3
Starting training for 50 epochs...

Could you please provide some guidance on creating a custom translator? I've uploaded the project to this repository: GitHub Repository.

Thank you!