Import Onnx give a segfault

rattokiller commented 2 years ago

Hi,

I am trying to do, n2d2 resnet-18-v1-onnx.ini -seed 1 -w /dev/null -export CPP -nbbits -32

but after compiling the network Accuracy is 0%

but if I test the onnx directly I get 70%

n2d2 resnet-18-v1-onnx.ini -seed 1 -w /dev/null -test

if I run

sudo ./n2d2 ResNet_ONNX.ini -seed 1 -w /dev/null -export CPP -nbbits -32 -act-rescaling-mode Floating-point -no-unsigned I get segfault

with mobilnet v2 the executable has core dumps

Can anyone tell me what I'm wrong?

Cheers Filippo Ferrandino

davidbriand-cea commented 2 years ago

Hi Filippo, Thank you very much for the report. It is effectively an important point that need to be fix. I can reproduce your error (even worse). Now the command n2d2 resnet-18-v1-onnx.ini -seed 1 -w /dev/null -export CPP -nbbits -32 give me a segmentation fault.

The error went through our continuous integration system… We will let you know once the issue is resolved! Cheers, David

rattokiller commented 2 years ago

Hi,

can you tell me if it is possible to export a resnet (already trained for ILSVRC2012) in cpp / c and with what parameters?

I would need the code of a working network.

Cheers, Filippo Ferrandino

davidbriand-cea commented 2 years ago

Hi, The error is due to a bad access array and it as been fixed in the latest commit. The CPP export now work with resnet topology. The C export is less maintained so i cannot guarantee you that resnet is working (residual connection is often source of error...)

It will be pushed on Github shortly, after our internal CI passes, please be patient.

Let me know if all is ok

Cheers,

David

rattokiller commented 2 years ago

Hi,

the segfault with resnet 18 is solved,,

i am trying to export my network resnet-8 onnx and i get

Segmentation fault
backtrace() returned 9 addresses
./n2d2(_ZN4N2D216exceptionHandlerEiP9siginfo_tPv+0x35)[0x556c5d229545]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fd48754d980]
./n2d2(_ZN4N2D216DeepNetGenerator17ONNX_processGraphESt10shared_ptrINS_7DeepNetEERKSt6vectorIS1_INS_4CellEESaIS6_EERKN4onnx10GraphProtoEiRNS_9IniParserE+0xde06)[0x556c5d083ab6]
./n2d2(_ZN4N2D216DeepNetGenerator16generateFromONNXERNS_7NetworkERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_9IniParserESt10shared_ptrINS_7DeepNetEERKSt6vectorISD_INS_4CellEESaISI_EE+0x4c6)[0x556c5d089896]
./n2d2(_ZN4N2D216DeepNetGenerator15generateFromINIERNS_7NetworkERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1383)[0x556c5d08b203]
./n2d2(_ZN4N2D216DeepNetGenerator8generateERNS_7NetworkERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb9)[0x556c5d08ceb9]
./n2d2(main+0x89)[0x556c5ca842e9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fd481675bf7]
./n2d2(_start+0x2a)[0x556c5cabd8ba]

attached the necessary files resnet8.zip

Cheers, Filippo

vtemplier commented 2 years ago

Hi @rattokiller,

Thank you for notifying us of this ONNX import problem. We are currently working to resolve this issue. It seems that the BiasAdd layer provides an unexpected tensor format to the biases.

Moreover, it seems you have used tf2onnx to convert your network to the ONNX format. If so, could you re-import your network with the --inputs-as-nchw option ? Indeed, the standard format used by Tensorflow is nhwc. This format requires the creation of Transpose layer which are not supported by the C/CPP exports. Can you also add the --opset 13 option ? This option can guarantee that you use recent operators supported by N2D2 and other main frameworks.

rattokiller commented 2 years ago

Hi,

i exported with the new parameters : resnet8_fix.zip

Cheers, Filippo

rattokiller commented 2 years ago

Hi,

@vtemplier are there any news?

vtemplier commented 2 years ago

Hi @rattokiller,

It seems that the --inputs-as-nchw option you used to generate the ONNX hasn't removed the Transpose layer before the FC layers. This implies working on a generic method to solve this problem automatically if this case should occur again.

We are trying to solve this problem as soon as possible.

rattokiller commented 2 years ago

hi @vtemplier

unfortunately this Transpose layer is generated by the Flatten() layer, needed in Tensorflow to connect layers with different shapes (even if the number of parameters is equal)

By removing this layer it would not be possible to make net.fit()

vtemplier commented 2 years ago

I do understand your problem and why it is important to keep this layer in the model. In any case, thank you for bringing the problem to our attention.

rattokiller commented 2 years ago

hi @vtemplier @davidbriand-cea,

there are news?

rattokiller commented 2 years ago

hi @vtemplier @davidbriand-cea,

I tried again to export the network, it no longer gives segfault sudo n2d2 resnet-8-onnx.ini -seed 1 -w /dev/null -export CPP -nbbits -32 -calib 500 -db-export 500 it returns this new error:

... Notice: Unused section model_3/average_pooling2d_14/AvgPool:0 in INI file Notice: Unused section model_3/dense_7/MatMul:0 in INI file Notice: Unused section onnx:Fc_def in INI file Time elapsed: 4.99935 s Error: Unsupported operation : Add for constant size = nbOutputs.

cmoineau commented 2 years ago

Hi @rattokiller,

It looks like this error comes from the fact that you do not the fuse the bias into the fully connected at the end of your network.

To do so, you can update your INI file with the parameter CNTK=1 :

[onnx]
Input=trans
Type=ONNX
File=resnet8.onnx
Transpose=1
CNTK=1

Let me know if this fixes your issue.

A lot of work have been made to N2D2 and some of our examples are not up to date. We are currently working on updating them and adding them to our CI environnement to avoid this kind of issue.

Thanks for your interrest in our project.

Cheers, Cyril

rattokiller commented 2 years ago

Hi @cmoineau now goes on, but segfoutl returns

Layer: model_1/dense_3/BiasAdd:0 [Add]
  model_1/dense_3/BiasAdd:0 -> model_1/dense_3/MatMul:0
Layer: dense_3 [Softmax]
  Added transpose before: 2 1 0 3 
  Added transpose after: 2 1 0 3 
  # Inputs dims: 10 1 1 
  # Outputs dims: 1 1 10 
Segmentation fault
backtrace() returned 9 addresses
/home/uc/Documenti/N2D2/build/bin/n2d2(_ZN4N2D216exceptionHandlerEiP9siginfo_tPv+0x35)[0x5622fa71f1e5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fcb07551980]
/home/uc/Documenti/N2D2/build/bin/n2d2(_ZN4N2D27DeepNet20removeExtraTransposeEv+0x422)[0x5622fa3f4292]
/home/uc/Documenti/N2D2/build/bin/n2d2(ZN4N2D216DeepNetGenerator16generateFromONNXERNS_7NetworkERKNSt7_cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_9IniParserESt10shared_ptrINS_7DeepNetEERKSt6vectorISD_INS_4CellEESaISI_EE+0x4de)[0x5622fa558a2e]
/home/uc/Documenti/N2D2/build/bin/n2d2(ZN4N2D216DeepNetGenerator15generateFromINIERNS_7NetworkERKNSt7_cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x1383)[0x5622fa55a383]
/home/uc/Documenti/N2D2/build/bin/n2d2(ZN4N2D216DeepNetGenerator8generateERNS_7NetworkERKNSt7_cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb9)[0x5622fa55c089]
/home/uc/Documenti/N2D2/build/bin/n2d2(main+0x89)[0x5622f9f49299]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fcb01679c87]
/home/uc/Documenti/N2D2/build/bin/n2d2(_start+0x2a)[0x5622f9f838aa]

rattokiller commented 2 years ago

Hi @cmoineau @vtemplier @davidbriand-cea @olivierbichler-cea

Going into debug mode I found that segfoult is generated by cellFrame->getActivation()->getType()

(gdb) continue
Continuing.

Thread 1 "n2d2" received signal SIGSEGV, Segmentation fault.
0x0000555556b2190c in N2D2::DeepNet::removeExtraTranspose (this=0x555559ae9430)
    at /home/uc/Documenti/N2D2/src/DeepNet.cpp:1630
1630                 if (cellFrame->getActivation()->getType() != LinearActivation::Type){
(gdb)

this is because just before a generic Cell object is converted to Cell_Frame_Top

Reading the code the conditions for getting to line 1630 are: 1)the cell must possess a child and always of type transpose (otherwise there is nothing to simplify) 2)the permutation generates the starting combination

excluding the conditions on the permutations, it remains: if a class Cell has a child this class is a Cell_Frame_Top (and therefore I could always do the conversion without problems)

Is this proposition correct?


void N2D2::DeepNet::removeExtraTranspose() {
    std::map<std::string, std::shared_ptr<Cell> > cloneCells;
    cloneCells.insert(mCells.begin(), mCells.end());

       (remove code)

                const std::vector<std::shared_ptr<Cell> > heads = getParentCells(cell->getName());
                const std::vector<std::shared_ptr<Cell> > tails = getChildCells(childs[0]->getName());
                std::shared_ptr<Cell_Frame_Top> cellFrame = std::dynamic_pointer_cast<Cell_Frame_Top>(cell);

                // When imported from ONNX the activation is placed on the Transpose layer instead of the head
                // So we move the activation back to the head !
                if (cellFrame->getActivation()->getType() != LinearActivation::Type){

Cheers, Filippo Ferrandino

cmoineau commented 1 year ago

Hi @rattokiller,

The latest commit should fix the error you have encoutered.

I have made some changes in your ini file in order to generate a Cpp export :

The SoftMax layer is not supported by the export so I ignored it. I have also ignored the Transpose layer which was at the beginning of your ONNX file, this way we no longer need to add a Transpose layer at the beginning of the network.

However, I found someting odd : when I load the latest ONNX you gave us I can only get a 10% accuracy (did you trained your model before generating th ONNX ?).

I can try to investigate this further if I have access to the script you have used to train your keras model.

Thanks again for notifying us and sorry for the delay in the responses !

Cheers, Cyril

rattokiller commented 1 year ago

Hi @cmoineau

thank you very much for the support (^_^)

I'm using kaggle, here is the onnx with 80% accuracy and the notebook.

cmoineau commented 1 year ago

The drop in accuracy seems to come from the loading of the database.

I have downloaded CIFAR10 using this repo : https://github.com/YoongiKim/CIFAR-10-images instead of this one : http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz. I think Keras uses some post processing we are not aware of.

Then I declared the database a little differently (You can find more information here):

[database]
Type=DIR_Database
DataPath=${N2D2_DATA}/CIFAR-10-images/test
RandomPartitioning=1
Depth=1
ValidExtensions=jpg
; split 40% for test
Learn=.4
Validation=.2

; Environment
[sp]
SizeX=${SIZE}
SizeY=${SIZE}
NbChannels=3
BatchSize=${BATCH_SIZE}

[sp.Transformation-1]
Type=ColorSpaceTransformation
ColorSpace=RGB

[sp.Transformation-2]
Type=RangeAffineTransformation
FirstOperator=Divides
FirstValue=255.0

(You can find the new ini file here : model.txt)

With these transformations I have :

$ n2d2 model.ini -seed 1 -w /dev/null -test
...
Final recognition rate: 77.4%    (error rate: 22.6%)
    Sensitivity: 77.4% / Specificity: 97.4889% / Precision: 78.2173%
    Accuracy: 95.48% / F1-score: 77.4917% / Informedness: 74.8889%

Finally, to generate the CPP export of the resnet, it will be necessary to deactivate the memory optimizer of the export generator which currently has a multi-branch bug. (Thanks @vtemplier 😄)

To do so, create a file param.ini containing :

OptimizeBufferMemory=0

You can then generate your CPP export using the command :

n2d2 model.ini -seed 1 -w /dev/null -export CPP -nbbits -32 -db-export 500 -export-parameters param.ini

To test your export :

cd export_CPP_float32/ && make && ./bin/run_export

You should see an accuracy of ~77% !

Let me know if you have any issue to reproduce these steps.

Cheers, Cyril

rattokiller commented 1 year ago

Hi @cmoineau, did you use the onnx I passed on or did you train another one?

To make it work I added ignore layer : Ignore=dense_3 model_1/conv2d_7/BiasAdd__6:0

this my model.txt the test works fine, the export generates the code and the 'latter compiles and run (^_^)

But when I run it:sudo n2d2 model.ini -seed 1 -w /dev/null -export CPP -nbbits 8 -db-export 1000 -export-parameters param.ini -calib -1

  - model_1/conv2d_13/BiasAdd:0: prev=5476.95, act=66.2797, bias=0.0023597
      quant=127, global scaling=28088.2 -> cell scaling=0.00153536
  - model_1/concatenate_5/concat:0: prev=28088.2, act=54.2452, bias=0.00901091
      quant=1, global scaling=6019.95 -> cell scaling=4.66585
Time elapsed: 24.2798 s
Error: Quantization of cell 'model_1/flatten_1/Reshape:0' of type 'Reshape' is not supported yet.

I report it as a new issue?

Cheers, Filippo Ferrandino

cmoineau commented 1 year ago

Sorry for the late answer!

Yes I used an other ONNX that I trained with your script, I forgot to mention it.

I have seen that you have reported the reshape error in a new issue so I am closing this one!

CEA-LIST / N2D2

Import Onnx give a segfault #109