CEA-LIST / N2D2

N2D2 is an open source CAD framework for Deep Neural Network simulation and full DNN-based applications building.
Other
146 stars 35 forks source link

Segfault with CPP exported model [wrong total memory] #83

Closed stephaneburel-cea closed 3 years ago

stephaneburel-cea commented 3 years ago

Hello. I come to report a bug with the exported CPP model. Given ResNet_ONNX.ini available on the repository, and given the ONNX data available on https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v1/resnet18v1.onnx The following command export to CPP the model : n2d2.sh "$N2D2_MODELS/ResNet_ONNX.ini" -seed 1 -w /dev/null -export CPP -calib -1 -nbbits 8 -act-rescaling-mode Floating-point -no-unsigned The export is succesfull and so is the exported model compilation.

But a wild segfault appears with the execution : ./run_export

Give a Segmentation fault (core dumped)

Best regards, Stéphane Burel

olivierbichler-cea commented 3 years ago

Hello, I cannot reproduce this issue with the latest version of N2D2. Could you check the latest commit that you used?

Cheers, Olivier

stephaneburel-cea commented 3 years ago

Hello. Sadly the problem persist even with the latest commit. Regards, Stéphane Burel

olivierbichler-cea commented 3 years ago

Could you send me (or give me the location of) the full generated export_CPP_int8 folder that causes the segfault?

andreistoian commented 3 years ago

I have the same problem, the -O3 or -O2 version crashes when run from the terminal. strace gives

clone(child_stack=0x7f9f3345df30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f9f3345e9d0, tls=0x7f9f3345e700, child_tidptr=0x7f9f3345e9d0) = 2492
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9f3245d000
mprotect(0x7f9f3245d000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f9f32c5cf30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f9f32c5d9d0, tls=0x7f9f32c5d700, child_tidptr=0x7f9f32c5d9d0) = 2493
futex(0x26cb5a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x26cbfc4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x26cb5a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x26cb5a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x26cb5a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x26cb5a4, FUTEX_WAKE_PRIVATE, 2147483647) = ? <unavailable>

There are 7 clone syscalls (on a 8 core machine) so i suspect the crash is immediately after starting openmp threads.

It is very tricky since when running the release version under gdb there is no crash, only 0% correct classification . When compiling without optimization and in debug (-O0 -g) it works fine.

olivierbichler-cea commented 3 years ago

It seems like the total memory reported by the export's memory manager is wrong in some cases. In the generated _export_CPPint8/src/NetworkPropagate.cpp, can you check the value of the #define MEMORY_SIZE? It should be 1017856 for this network. If it is less, you encounter the same issue as @stephaneburel-cea.

andreistoian commented 3 years ago

Sorry, I forgot to mention.. I have this problem on another network (1D sound) in float32 export.

I compiled without openmp and without march=native, with -O2 and -g. I dumped the core and loaded it with gdb. The error seems to be in

#0  N2D2::Network::poolcellPropagate<32, 1, 1990, 32, 1, 995, 0, 0, 1, 2, 1, 2, (Pooling_T)0, (ActivationFunction_T)7, 71768, 56072, 0, 7608, 32, 7608, 31840, 0, 0, 32, float, float> (
    outputs=0x25158e0 <mem+30432>, inputs=0x2554360 <mem+287072>, this=<optimized out>)
    at ./include/Network.hpp:988
988                             if (inputs[iOffset + output + sx * INPUT_MEM_STRIDE]
(gdb) bt
#0  N2D2::Network::poolcellPropagate<32, 1, 1990, 32, 1, 995, 0, 0, 1, 2, 1, 2, (Pooling_T)0, (ActivationFunction_T)7, 71768, 56072, 0, 7608, 32, 7608, 31840, 0, 0, 32, float, float> (
    outputs=0x25158e0 <mem+30432>, inputs=0x2554360 <mem+287072>, this=<optimized out>)
    at ./include/Network.hpp:988
#1  N2D2::Network::propagate<float> (this=<optimized out>, inputs=<optimized out>, 
    outputs=0x3769950) at src/NetworkPropagate.cpp:275
olivierbichler-cea commented 3 years ago

Hi, could you send me the exported project? There is probably an issue with the memory manager, which is the most tricky part of the export...

andreistoian commented 3 years ago

It seems as you mention that the Memory size is computed incorrectly. It seems around 100 bytes less than needed which explains the random crashing as without optimization most likely the memory allocations are somehow padded in the heap or the area around the allocated buffer is not used. I'll send you the exported project.

olivierbichler-cea commented 3 years ago

I think I have a fix for your issue, which is due to a bad handling of memory wrapping. Here is a patch that you can try. N2D2 needs to be patched, recompiled and the export must be regenerated.

mem_manager_fix.log

This issue is not related to the initial issue of this post.

olivierbichler-cea commented 3 years ago

Actually, I finally pushed the patch in the latest commit. @andreistoian please use the new issue I created for your specific issue: #84

olivierbichler-cea commented 3 years ago

No activity, closing this issue, which may be solved by the latest commits. Please re-open if the issue still arise in the latest version.