N2D2 is an open source CAD framework for Deep Neural Network simulation and full DNN-based applications building.
146 stars 36 forks source link

ONNX accuracy discrepancy with respect to Pytorch and calibration error #78

Closed andreistoian closed 3 years ago

andreistoian commented 3 years ago


I'm trying to compile a sound classification network using 1D convolution with the CPP export. With pytorch I get 81% accuracy on a small subset of data on which I also test and calibrate the N2D2 export.

232.00/343 (67.64%)
233.00/344 (67.73%)
233.00/345 (67.54%)
234.00/346 (67.63%)
235.00/347 (67.72%)
236.00/348 (67.82%)
236.00/349 (67.62%)

Score: 67.62%

The input WAV files are floating point values between -1 and 1 (mostly in the -0.05 and 0.05 range), loaded from FLOAT32 WAV files using the code given in #77 .

Here is the code that exports the dataset, computes accuracy for the pytorch model and exports the ONNX. sound_demo.zip

olivierbichler-cea commented 3 years ago

Hi, Regarding the accuracy, how is the accuracy in N2D2, before the export (with ./n2d2 onnx.ini -test)?

andreistoian commented 3 years ago

I ran it with:

bin/n2d2 ~sound1d-onnx.ini -test -seed 1 -w /dev/null

and I get

Testing #348   38.40% 
Final recognition rate: 38.40%    (error rate: 61.60%)
    Sensitivity: 55.95% / Specificity: 94.37% / Precision: 44.80%
    Accuracy: 89.73% / F1-score: 47.24% / Informedness: 50.32%

What is the recognition rate and how is it different from 'Accuracy'?

olivierbichler-cea commented 3 years ago


The accuracy problem comes from a bad label mapping of the output of the network. The label mapping should be the following:

/down 0
/go 1
/left 2
/no 3
/off 4
/on 5
/right 6
/silence 7
/stop 8
/unknown 9
/up 10
/yes 11

But in fact, since the /silence folder is empty, no image with label /silence is loaded in the database driver and this label is not created (this is the current behaviour of N2D2, which does not create label for empty folder). As a result, the following classes are shifted and mapped to the wrong output.

Regarding the score metrics, some remarks:

Finally, the calibration issue is the same as the one explained in issue #80. We are still thinking about possible solutions in this case that would not cause precision loss.

olivierbichler-cea commented 3 years ago

Actually, I just tested the CPP export and it works fine! Using the command: ./n2d2 sound1d-onnx.ini -seed 1 -w /dev/null -test -export CPP -calib -1 The average recall is 80% in INT8 vs. 83% before quantization. No calibration issue here (which should not happen for mono-branch network).

andreistoian commented 3 years ago

I'm sorry but I'm not able to fully reproduce the working behavior:

To fix the 'silence' class issue I added 60 wavs with silence to the directory.


The pytorch model has 81% accuracy while running with N2D2 -test -seed 1 -w /dev/null gives

Testing database size: 871 images
Notice: stimuli depth is 64F (according to database first stimulus)
[LOG] Stimuli transformations flow (transformations.png)
[LOG] Network graph (sound1d-onnx.ini.png)
Warning: using box for unknown shape cylinder
[LOG] Network SVG graph (sound1d-onnx.ini.svg)
[LOG] Network stats (stats/*)
[LOG] Solvers scheduling (schedule/*)
[LOG] Layer's receptive fields (receptive_fields.log)
[LOG] Labels mapping (*.Target/labels_mapping.log)
[LOG] Labels legend (*.Target/labels_legend.png)
[LOG] Learn frame samples (frames/frame*)
[LOG] Test frame samples (frames/test_frame*)
[10:17.89 4:7.73 7:7.62 5:2.84 9:1.87 ]
Testing #100   93.07% 
Testing #200   94.53% 
Testing #300   95.02% 
Testing #400   86.03% 
Testing #500   81.84% 
Testing #600   82.70% 
Testing #700   79.46% 
Testing #800   76.65% 
Testing #870   75.43% 
Final recognition rate: 75.43%    (error rate: 24.57%)
    Sensitivity: 83.67% / Specificity: 97.79% / Precision: 72.44%
    Accuracy: 95.91% / F1-score: 75.57% / Informedness: 81.46%

I export the model to float32: models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -nbbits -32 -w /dev/null . When I run 'run_export' (note I need to change the make file to -O0 -g so it does not crash) I get

649.000000/866 (74.942263%)
650.000000/867 (74.971165%)
651.000000/868 (75.000000%)
651.000000/869 (74.913694%)
651.000000/870 (74.827586%)
652.000000/871 (74.856487%)

Score: 74.856487%

I export to int8 with calibration on the whole validation set. models/ONNX/sound1d-onnx.ini -test -seed 1 -export CPP -calib -1 -w /dev/null. N2D2 takes 312 stimuli for calibration (I guess Nclasses * min(card(class_i)) ?)

and I get, after a long time:

Notice: stimuli depth is 64F (according to database first stimulus)
Remove Dropout...
Fuse BatchNorm with Conv...
export_CPP_int8/stimuli_stats processing 312 stimuli
Fuse Padding...
  Cross-layer equalization:
    - eq. 35 and 33
    - eq. 37 and 35
    quant. range delta = 0.491025
export_CPP_int8/stimuli_stats processing 312 stimuli
Calculating calibration data range and histogram...
Calibration data 100/312
Calibration data 200/312
Calibration data 300/312
Quantization (8 bits)...
  Quantizing free parameters:
  - 17: 1.57456
  - 19: 1.57456
  - 20: 1.10175
  - 22: 1.10175
  - 23: 0.77638
  - 25: 0.77638
  - 26: 0.445755
  - 28: 0.445755
  - 29: 0.24595
  - 31: 0.24595
  - 33: 0.107331
  - 35: 0.0528017
  - 37: 0.0259759
  Fuse scaling cells:
  Quantizing activations:
  - 17: prev=1, act=605.467, bias=1.57456
      quant=63.251, global scaling=384.532 -> cell scaling=4.1115e-05
  - 20: prev=384.532, act=885.995, bias=1.10175
      quant=127, global scaling=804.168 -> cell scaling=0.00376515
  - 23: prev=804.168, act=1939.13, bias=0.77638
      quant=127, global scaling=2497.66 -> cell scaling=0.00253519
  - 26: prev=2497.66, act=3749.77, bias=0.445755
      quant=127, global scaling=8412.17 -> cell scaling=0.00233787
  - 29: prev=8412.17, act=2682.97, bias=0.24595
      quant=127, global scaling=10908.6 -> cell scaling=0.00607205
  - 33: prev=10908.6, act=10430.1, bias=0.107331
      quant=127, global scaling=97176.4 -> cell scaling=0.000883903
  - 35: prev=97176.4, act=2751.02, bias=0.0528017
      quant=127, global scaling=52100.9 -> cell scaling=0.0146863
  - 37: prev=52100.9, act=3834.95, bias=0.0259759
      quant=255, global scaling=147635 -> cell scaling=0.00138393
  Fuse scaling cells:
  - fuse: 17_rescale_act
  - fuse: 20_rescale_act
  - fuse: 23_rescale_act
  - fuse: 26_rescale_act
  - fuse: 29_rescale_act
  - fuse: 33_rescale_act
  - fuse: 35_rescale_act
  - fuse: 37_rescale_act
  Scaling approximation [3]:
  - 17: 4.1115e-05
    SINGLE_SHIFT: 2 ^ [- 14]
  - 20: 0.00376515
    SINGLE_SHIFT: 2 ^ [- 8]
  - 23: 0.00253519
    SINGLE_SHIFT: 2 ^ [- 8]
  - 26: 0.00233787
    SINGLE_SHIFT: 2 ^ [- 8]
  - 29: 0.00607205
    SINGLE_SHIFT: 2 ^ [- 7]
  - 33: 0.000883903
    SINGLE_SHIFT: 2 ^ [- 10]
  - 35: 0.0146863
    SINGLE_SHIFT: 2 ^ [- 6]
  - 37: 0.00138393
    SINGLE_SHIFT: 2 ^ [- 9]
  Inputs quantization


[3:0.00 4:0.00 1:0.00 0:0.00 2:0.00 ]
Testing #100   8.91% 
Testing #200   7.46% 
Testing #300   12.96% 
Testing #400   11.22% 
Testing #500   10.18% 
Testing #558   9.48% 
Final recognition rate: 9.48%    (error rate: 90.52%)
    Sensitivity: 12.37% / Specificity: 91.86% / Precision: 11.20%
    Accuracy: 84.91% / F1-score: 7.73% / Informedness: 4.23%

Time elapsed: 17281.58 s

When I compile and run 'run_export' I get

Score: 14.625000%
olivierbichler-cea commented 3 years ago

Please don't forget to delete the export_CPP_int8 folder before running a new export when a change has been made to the dataset partitioning or pre-processing. The problem was due to faulty stimuli in the dataset and bad partitioning compared to PyTorch. Considering the issue solved. Closing.