VeriSilicon / TIM-VX

VeriSilicon Tensor Interface Module
Other
221 stars 84 forks source link

Valgrind Reporting Many Warnings With Graphs and Contexts #544

Open lkaneda opened 1 year ago

lkaneda commented 1 year ago

Running valgrind on our implemented software, we found there were many errors coming from Tim-VX with regards to the graph and context instances. To verify if this was an issue with our software or something happening internally, we ran it against the lenet example provided in this repo and saw the same output. I've attached the valgrind log here. It's hard for me to tell if this is a tim-vx issue or an openvx issue (or potentially an us issue) so I'm hoping this log can help figure out what may be happening.

The trend I see in the log is that it happens on all tim-vx functions: creating, initalizing, validating (compile), executing (run), and destroying.

The command we ran to get this output: valgrind --tool=memcheck --leak-check=full --error-limit=no --log-file="{filename}" ./{program executable filename}

valgrindOutput3.txt

sunshinemyson commented 1 year ago

@lhawana ,

Thanks for sharing. We are working on this internally. Will keep you posted once we addressed them.

sunshinemyson commented 1 year ago

@lhawana ,

We fixed some issue detected by valgrind for tim-vx/vx-delegate in past month. You can check commit history for the fixes.

And we double confirmed most issue in our low-level driver is false alert.

Thanks

BralSLA commented 1 year ago

Hey @sunshinemyson, sorry for the silence on this ticket; I'll be handling it from here. Do you have a commit in particular that I should checkout? I tried merging this commit without success. It still reports the errors/warnings in the valgrind output after merging just the change in this file.

If I try merging the entire file, there are a lot of other dependencies I have to merge as well to get it to compile in our version, and even still I'm unable to get it to load successfully.

Is there either a commit you can point me to that should have this addressed, or are you able to tell me what I need to merge between the version of tim-vx we are using, and this version?

Thanks

sunshinemyson commented 1 year ago

@BralSLA B,

Can you update your version to latest version? We didn't maintain legacy version yet.

Thanks

BralSLA commented 1 year ago

Hey @sunshinemyson ,

I've updated to the latest version, but it's failing to compile in my yocto build. I'm getting the 2 following errors:

In constructor 'tim::vx::ops::Topk::Topk(tim::vx::Graph*, uint32_t, int32_t)': | /home/slroot/build_001/NXPBuild/build-ucm-imx8m-plus/workspace/sources/tim-vx/src/tim/vx/ops/topk.cc:37:39: error: 'vsi_nn_topk_param' {aka 'struct _vsi_nn_topk_param'} has no member named 'axis' | 37 | this->impl()->node()->nn_param.topk.axis = axis;

If I look at the definition for that vsi_nn-topk_param struct, it does have an axis member, but I'm not sure what the hierarchy of inclusions for the struct is at the moment, so I'm unable to tell if it's actually associated with the Topk class like it appears it should be. I suppose it's also possible something in our yocto build process could be messing something up. Have you run into this error?

Thanks again

BralSLA commented 1 year ago

Hey @sunshinemyson ,

Update: Needed to update to the latest GPU drivers available, and then clean my build environment. I've gotten the latest version to build, and have it on our system; however, we are experiencing a segfault since upgrading to the latest tim-vx, and updating our GPU drivers to be compatible. We are seeing a lot of "Create Tensor Fail" messages in the output, followed by a segfault. At the top of the call stack where this segfault happens, is when CreateOperation() is being called. Below is the output:

Program received signal SIGSEGV, Segmentation fault.
0x0000fffff7ddcca0 in tim::vx::BuiltinOpImpl::SetRoundingPolicy(tim::vx::OverflowPolicy, tim::vx::RoundingPolicy, tim::vx::RoundType, unsigned int) () from /usr/lib/libtim-vx.so
Segmentation fault

Here is the line where we are calling CreateOperation()

auto conv1 = this->graph->CreateOperation<tim::vx::ops::Conv2d>(conv1_weight_shape[3], conv1_pad_type, conv1_ksize, conv1_stride, conv1_dilation, conv1_pad);

Do you know why this may be happening? Let me know if you need anymore information.

Thanks again

sunshinemyson commented 1 year ago

@BralSLA

I suppose you meeting issue with NXP platform. I didn't receive such report internally since we have NXP platform daily test.

Can you provide more version information about system and driver so that i can forward it to NXP ?

BralSLA commented 1 year ago

@sunshinemyson Thanks for getting back to me. We are running Yocto version 5.15.71, with the GPU driver version imx-gpu-viv-6.4.11.
Let me know if you need anymore info.

sunshinemyson commented 1 year ago

@BralSLA ,

We don't have such issue from internal test or nxp. Since your crash point is strange, i prefer the problem is your build not clean. Please double check if you build tim-vx with external sdk correctly, it seems a binary incompatible issue.

BTW, we have CI verify TIM-VX with NXP imx.8mp silicon board with 6.4.11 driver for each patch. No such issue.