AxisCommunications / acap-computer-vision-sdk-examples

Example applications that provide developers with the tools and knowledge to use Axis Camera Application Platform (ACAP) Computer Vision solution
Apache License 2.0
50 stars 22 forks source link

Unable to load custom model on firmware 11 #160

Closed Duckypu closed 1 year ago

Duckypu commented 1 year ago

Description

Thank you for your attention. I've trained custom ssdlite_mobiledet model using the TensorFlow API. Following the efforts of previous work, I made changes to the Dockerfile.model and env.aarch64.artpec8 paths, and I was able to successfully run it in the following environment:

However, when I upgraded Axis firmware to version 11, I encountered the following issue during inference:

image

inference-server_1            | ERROR in Inference: Failed to load model model.tflite (Could not send message: Transport endpoint is not connected)
object-detector-api-python_1  | <_InactiveRpcError of RPC that terminated with:
object-detector-api-python_1  |         status = StatusCode.CANCELLED
object-detector-api-python_1  |         details = ""
object-detector-api-python_1  |         debug_error_string = "{"created":"@1695725084.818779200","description":"Error received from peer unix:/tmp/acap-runtime.sock","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"","grpc_status":1}"

Issue environment

Please help me, thanks in advance.

Corallo commented 1 year ago

Hi @Duckypu

First, I'd recommend to make sure that you are using the correct Firmware version with the correct SDK version we test only for that: https://axiscommunications.github.io/acap-documentation/docs/api/computer-vision-sdk-apis.html

Do you have this issue also when you try the model provided in the example?

For debugging, upload first your model in the camera and run on the device larod-client -g model_path -c axis-a8-dlpu-tflite This will test the loading of the model. If it fails journalctl -u larod And check what's the output.

Duckypu commented 1 year ago

Hi @Corallo Thank you for your prompt response. I have considered the issue of version compatibility and have paired different firmware versions with corresponding SDK versions as described in the response to the environmental issue mentioned above (in fact, I have paired even more version combinations not listed). However, what I can confirm is that version 10 is able to successfully load the model except version 11.

I did the prompt you mentioned

larod-client -g model_path  -c axis-a8-dlpu-tflite

I got:

2023-09-27T09:44:58.848 Connecting to larod... 2023-09-27T09:44:58.863 Connected 2023-09-27T09:44:59.295 ERROR: When loading model synchronously (-6): Could not load model: Asynchronous connection has been closed

And then

journalctl -u larod

I got:

Sep 27 09:44:58 axis-b8a44f495376 larod[73380]: Created a new session ID: 1, client: :1.519 Sep 27 09:44:58 axis-b8a44f495376 sh[73380]: WARNING: Fallback unsupported op 32 to TfLite Sep 27 09:44:59 axis-b8a44f495376 sh[73380]: double free or corruption (out) Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Main process exited, code=killed, status=6/ABRT Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Failed with result 'signal'. Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Scheduled restart job, restart counter is at 9. Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Stopped Machine learning service. Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Starting Machine learning service... Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Started Machine learning service. Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Service started Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Created a new session ID: 0, client: :1.523 Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Session 0 killed since client's (:1.523) connection has been lost

In addition, I also tried to compare 'ssd_mobilenet_v2_coco_quant_postprocess.tflite" and "my_custom.tflite" in netron. I compared the input, output, and even the structure, and it seems that I can't see anything unusual.

ssd_mobilenet_v2_coco_quant_postprocess.tflite: image

my_custom.tflite: image

Lastly, I'm happy to provide my unweighted model to you personally if you want. Thank you!

Corallo commented 1 year ago

Thanks for the detailed report. This looks like a bug on our side. We would have to investigate and try to replicate.

If you can't share publicly your model, the best is that you open a Ticket here: https://www.axis.com/support/helpdesk/cases Attach an unweighted version of your model, and possibly add a link to this Issue for reference.

Duckypu commented 1 year ago

Thank you, I've already opened a Ticket (#02150437) I provided two versions of models for you (Tensorflow 1 and Tensorflow 2)

If you have any problem attaching models, Please let me know.

Corallo commented 1 year ago

Hi @Duckypu

Could you try to run again the larod-client command, like this: larod-client -g model_path -c axis-a8-dlpu-tflite -R 10 -i '' And this time provide the system log? It might be that you are experiencing an out of memory issue.

You can find the system log in the GUI going in System -> Logs -> View the system log

Duckypu commented 1 year ago

Hi @Corallo

I'm glad to receive your messages.

I did the prompt you mentioned

larod-client -g model_path  -c axis-a8-dlpu-tflite -R 10 -i ''

I got:

2023-10-03T14:47:12.703 Connecting to larod... 2023-10-03T14:47:12.719 Connected 2023-10-03T14:47:13.382 ERROR: When loading model synchronously (-6): Could not load model: Asynchronous connection has been closed

and then check the GUI in System -> Logs -> View the system log

I got:

2023-10-03T14:47:12.902+08:00 axis-b8a44f495376 [ INFO ] sh[1841]: WARNING: Fallback unsupported op 32 to TfLite 2023-10-03T14:47:13.325+08:00 axis-b8a44f495376 [ INFO ] sh[1841]: double free or corruption (out) 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ ERR ] kernel: [ 225.208127][ T1845] larod: singleprocq: potentially unexpected fatal signal 6. 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208168][ T1845] CPU: 0 PID: 1845 Comm: singleprocq Kdump: loaded Tainted: G O 5.15.13-axis9 #1 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208181][ T1845] Hardware name: AXIS P3265/P3267/P3268 Dome Camera (DT) 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208190][ T1845] pstate: 60000000 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208202][ T1845] pc : 0000007f9a7af0f8 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208209][ T1845] lr : 0000007f9a7af0e4 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208216][ T1845] sp : 0000007f987c8100 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208223][ T1845] x29: 0000007f987c8100 x28: 0000007f987c88b8 x27: 0000007f987c85a8 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208242][ T1845] x26: 0000007f9a8b7a60 x25: 0000007f987c8ba8 x24: 0000007f9a880d8a 2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208259][ T1845] x23: 0000007f94026000 x22: 0000000000000001 x21: 0000000000000006 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208275][ T1845] x20: 0000007f9a8b96e0 x19: 0000000000000735 x18: 000000004c41bed4 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208291][ T1845] x17: 0000000000000000 x16: 0000000000000000 x15: 000000005636a287 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208307][ T1845] x14: 0000000000000000 x13: 2974756f28206e6f x12: 6974707572726f63 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208323][ T1845] x11: 6333323930316363 x10: 000000000000000a x9 : 0000007f987c8440 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208339][ T1845] x8 : 0000000000000083 x7 : 6320726f20656572 x6 : 0000000000000020 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208355][ T1845] x5 : 0000000000000001 x4 : 0000007f9a8b96e0 x3 : 0000007f987ca0c0 2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208371][ T1845] x2 : 0000000000000006 x1 : 0000000000000735 x0 : 0000000000000000 2023-10-03T14:47:13.360+08:00 axis-b8a44f495376 [ INFO ] dbg-cgi[1141]: Core dump ID: axis-b8a44f495376_1696315633_1841.core

In my opinion, I don't think the issue is related to memory. After all, in the OS 10 version, the model could be successfully loaded. Or is it the case that there is a default model running on the device after the OS 11 version?

Corallo commented 1 year ago

I have been trying your model on a P3265-LVE with 11.5 and 11.6, it works fine for me. Because that model of camera has only 1 Gb of RAM I was expecting an out of memory issue, because a known difference between 10.x and 11.x is that the peak memory used during the model loading is higher. But looking at the log you provided, it doesn't seem so. Did you try the command with the latest firmware?

The only thing I can reproduce is that warning message about OP32, even tho it doesn't result in a crash for me. Can you elaborate more on how do you make the quantization and the conversion to TFlite?

Duckypu commented 1 year ago

Hi @Corallo

After digging deep into this, I found a mistake on my end. My model is actually P3265-LV, not P3265-LVE. Could you also successfully load model in this type of model?

Now, I've managed to successfully load the model with version 11.5.64 and SDK 1.9 randomly. However, even after making sure I have the correct firmware version, I'm still facing problems loading the model in other versions:

Additionally, I'd be happy to share the conversion method with you privately. Can I send it to you through a private channel?

Corallo commented 1 year ago

Hi @Duckypu

Yes, I actually tested on P3265-LV too, but the two device should be equivalent.

What do you mean with "randomly"? It is not consistent/reproducible?

Duckypu commented 1 year ago

Hi @Corallo I've already sent the mail, please let me know if you didn't receive it.

Corallo commented 1 year ago

@Duckypu Hi, I am sorry for the mistake, I had a typo in my mail.

ThenoobMario commented 1 year ago

Hi @Corallo,

I am facing the same issue when I try to load my custom model as well.

Corallo commented 1 year ago

@ThenoobMario Please open a new discussion or issue and provide some more context :)

Corallo commented 1 year ago

Moving this issue into discussions, as for now it doesn't seem like a bug.