Onnx trained model not working in inference

davidgonzaleztreelogic commented 4 years ago

Hi Everybody,

I try to run Padchest example. We have trained the model padchest_VGG16_adam_lr-0.0001.onnx. The structured of the folder is

sh-4.4# pwd
/data/padchest/pipeline
sh-4.4# ls
__pycache__        model_ARF_eddl.py  padchest_VGG16_adam_lr-0.0001.onnx  padchest_train.py
data_generator.py  models.py          padchest_inference.py               src
sh-4.4#

The output of training model is

CS with low memory setup 
Selecting GPU device 0 
EDDLL is running on GPU device 0, Tesla T4 
CuBlas initialized on GPU device 0, Tesla T4 
CuRand initialized on GPU device 0, Tesla T4 
Starting training: 
Epoch 0/50 (batch    0/112) - Generating Random Table 
--------------------------------------------- 
input1                        |  (1, 256, 256)=>      (1, 256, 256) 
conv1                         |  (1, 256, 256)=>      (32, 256, 256) 
relu1                         |  (32, 256, 256)=>      (32, 256, 256) 
maxpool2                      |  (32, 256, 256)=>      (32, 127, 127) 
conv2                         |  (32, 127, 127)=>      (64, 127, 127) 
relu2                         |  (64, 127, 127)=>      (64, 127, 127) 
maxpool4                      |  (64, 127, 127)=>      (64, 63, 63) 
conv3                         |  (64, 63, 63)=>      (128, 63, 63) 
relu3                         |  (128, 63, 63)=>      (128, 63, 63) 
maxpool6                      |  (128, 63, 63)=>      (128, 31, 31) 
conv4                         |  (128, 31, 31)=>      (128, 31, 31) 
relu4                         |  (128, 31, 31)=>      (128, 31, 31) 
maxpool8                      |  (128, 31, 31)=>      (128, 15, 15) 
conv5                         |  (128, 15, 15)=>      (32, 15, 15) 
relu5                         |  (32, 15, 15)=>      (32, 15, 15) 
maxpool10                     |  (32, 15, 15)=>      (32, 7, 7) 
reshape1                      |  (32, 7, 7)=>      (1568) 
dropout1                      |  (1568)    =>      (1568) 
dense1                        |  (1568)    =>      (512) 
relu6                         |  (512)     =>      (512) 
dense2                        |  (512)     =>      (2) 
softmax7                      |  (2)       =>      (2) 
--------------------------------------------- 

Batch 0 softmax7(cross_entropy=1.395,categorical_accuracy=0.470) - Elapsed time: 3.93 seconds 
Epoch 0/50 (batch    1/112) - Batch 1 softmax7(cross_entropy=1.396,categorical_accuracy=0.495) - Elapsed time: 3.91 seconds 
Epoch 0/50 (batch    2/112) - Batch 2 softmax7(cross_entropy=1.470,categorical_accuracy=0.473) - Elapsed time: 4.04 seconds 
Epoch 0/50 (batch    3/112) - Batch 3 softmax7(cross_entropy=1.451,categorical_accuracy=0.490) - Elapsed time: 4.06 seconds 
Epoch 0/50 (batch    4/112) - Batch 4 softmax7(cross_entropy=1.466,categorical_accuracy=0.492) - Elapsed time: 3.82 seconds 
Epoch 0/50 (batch    5/112) - Batch 5 softmax7(cross_entropy=1.453,categorical_accuracy=0.502) - Elapsed time: 3.88 seconds 
....
Validation - Epoch 49/50 (batch   30/37) Batch 30 softmax7(cross_entropy=0.773,categorical_accuracy=0.843) 
Validation - Epoch 49/50 (batch   31/37) Batch 31 softmax7(cross_entropy=0.787,categorical_accuracy=0.839) 
Validation - Epoch 49/50 (batch   32/37) Batch 32 softmax7(cross_entropy=0.797,categorical_accuracy=0.838) 
Validation - Epoch 49/50 (batch   33/37) Batch 33 softmax7(cross_entropy=0.802,categorical_accuracy=0.836) 
Validation - Epoch 49/50 (batch   34/37) Batch 34 softmax7(cross_entropy=0.810,categorical_accuracy=0.835) 
Validation - Epoch 49/50 (batch   35/37) Batch 35 softmax7(cross_entropy=0.814,categorical_accuracy=0.833) 
Validation - Epoch 49/50 (batch   36/37) Batch 36 softmax7(cross_entropy=0.818,categorical_accuracy=0.833)

We run the following line cd /data/padchest/pipeline; python3 padchest_inference.py

The log of this execution is

Producer_name: EDDL 
Producer_version: 0.1 
Domain: 
Model_version: 0 
CS with low memory setup 
Selecting GPU device 0 
EDDLL is running on GPU device 0, Tesla T4 
CuBlas initialized on GPU device 0, Tesla T4 
CuRand initialized on GPU device 0, Tesla T4 
Starting test: 
Test: batch    0/374 - Generating Random Table 
--------------------------------------------- 
input1                        |  (1, 256, 256)=>      (1, 256, 256) 
conv1                         |  (1, 256, 256)=>      (32, 256, 256) 
relu1                         |  (32, 256, 256)=>      (32, 256, 256) 
maxpool2                      |  (32, 256, 256)=>      (32, 127, 127) 
conv2                         |  (32, 127, 127)=>      (64, 127, 127) 
relu2                         |  (64, 127, 127)=>      (64, 127, 127) 
maxpool4                      |  (64, 127, 127)=>      (64, 63, 63) 
conv3                         |  (64, 63, 63)=>      (128, 63, 63) 
relu3                         |  (128, 63, 63)=>      (128, 63, 63) 
maxpool6                      |  (128, 63, 63)=>      (128, 31, 31) 
conv4                         |  (128, 31, 31)=>      (128, 31, 31) 
relu4                         |  (128, 31, 31)=>      (128, 31, 31) 
maxpool8                      |  (128, 31, 31)=>      (128, 15, 15) 
conv5                         |  (128, 15, 15)=>      (32, 15, 15) 
relu5                         |  (32, 15, 15)=>      (32, 15, 15) 
maxpool10                     |  (32, 15, 15)=>      (32, 7, 7) 
reshape1                      |  (32, 7, 7)=>      (1568) 
dropout1                      |  (1568)    =>      (1568) 
dense1                        |  (1568)    =>      (512) 
relu6                         |  (512)     =>      (512) 
dense2                        |  (512)     =>      (2) 
softmax7                      |  (2)       =>      (2) 
--------------------------------------------- 

Batch 0 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.29 seconds 
Test: batch    1/374 - Batch 1 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.27 seconds 
Test: batch    2/374 - Batch 2 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.29 seconds 
Test: batch    3/374 - Batch 3 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.30 seconds 
Test: batch    4/374 - Batch 4 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.30 seconds 
Test: batch    5/374 - Batch 5 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.31 seconds 
Test: batch    6/374 - Batch 6 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.30 seconds 
Test: batch    7/374 - Batch 7 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.32 seconds 
Test: batch    8/374 - Batch 8 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.30 seconds 
Test: batch    9/374 - Batch 9 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.31 seconds 
Test: batch   10/374 - Batch 10 softmax7(cross_entropy=-0.000,categorical_accuracy=-0.000) - Elapsed time: 0.30 seconds

If we launch the "pip list" command we have the following:

Package Version
------------------ -------

attrs 19.3.0
importlib-metadata 1.6.0
more-itertools 8.3.0
numpy 1.18.4
packaging 20.4
pip 20.1.1
pluggy 0.13.1
py 1.8.1
pybind11 2.5.0
pyparsing 2.4.7
pytest 5.4.2
setuptools 46.4.0
six 1.15.0
wcwidth 0.1.9
zipp 3.1.0

All this is being thrown into a container with the image of the library in Kubernetes' pod:

containers:
    - name: dhealth-pylibs
      image: dhealth/pylibs:latest

Does anyone know what's going on?

Thanks in advance.

Seraphid commented 4 years ago

Hi, We have replicated your problem and it also happens in our machines. We will solve this problem as soon as possible.

Thanks for your patience.

Seraphid commented 4 years ago

It seems like you are training the models in GPU. The current PyEddl release is built using a Eddl version without ONNX support for models trained in GPU. Soon we will make a new release and the PyEddl will release a new version with this feature avaliable.

Sorry for the inconvenience.

davidgonzaleztreelogic commented 4 years ago

Thanks @Seraphid for your answer

deephealthproject / eddl

Onnx trained model not working in inference #179