Bad accuracy with compiled elf-file for custom DPU on the MNIST tutorial

chhait07 commented 3 years ago

Hi everybody, I have first downloaded the Xilinx zcu102 image and flashed it to SD card. Then I executed the MNIST-Classification-TensorFlow with default configurations. Now I can successfully run the resulting program on my ZCU102 board.

Now I want to execute the MNIST-Classification-TensorFlow example on a custom DPU. So I have created a DPU (image) via DPU-TRD Vitis flow. To use the custom DPU I copied BOOT.BIN file and dpu.xclbin file to BOOT-partition (of the previous working xilinx image) and dpu.xclbin to /usr/lib. Thus my DPU is successfully recognized, which I approved by dexplorer -wcommand.

For executing the MNIST program on my custom DPU I took the .hwh file (from DPU generation) and used dlet command to generate .dcf file. In step 6 of the tutorial I added --options "{'dcf':'<my dcf file>'}" to command vai_c_tensorflow to get the .elf file that fits to my custom DPU. Now I copy the .elf file to my board and execute the MNIST program. I do not get an error and here is the according output:

Command line options:
 --image_dir :  images
 --threads   :  1
 --model     :  model_B512_LowPerformance/dpu_customcnn.elf
Pre-processing 10000 images...
Starting 1 threads...
FPS=2822.20, total frames = 10000 , time=3.5433 seconds
Correct: 980 Wrong: 9020 Accuracy: 0.098

The accuracy is very low which is not normal, so where is the problem here? When I compare configuration of my custom DPU and output of the ddump command the elf file should fit perfectly to my DPU.

On executing the MNIST program with default configuration on my custom DPU I get an error. On executing the MNIST program fitting to my custom DPU I do not get an error. So I assume that the .elf file is correct for my custom DPU.

So, why is the accuracy for the 'custom' elf file on the custom DPU so bad? What am I doing wrong?

chhait07 commented 3 years ago

I have made some additional tests. I have 3 different DPUs and the 3 .elf files belonging to them. The 3 DPUs are:

From standard xilinx image (called "Xilinx DPU" below): 3DPUs, Architecture B4096, RAM usage low, DSP usage high
"B4096_HP": 1DPU, Architecture B4096, RAM usage high, DSP usage high
"B512_LP": 1DPU, Architecture B512, RAM usage low, DSP usage low

I now tested all three .elf files on all 3 DPUs. Result:

On B512_LP I can execute all 3 .elf files with same result: No error, Accuracy: 9,8%
On B4096_HP I can execute all 3 .elf files with same result: No error, Accuracy: 9,8%
On Xilinx DPU I can execute according .elf file successfully with 97,75% accuracy
On Xilinx DPU I can execute the .elf files for B4096_HP and B512_LP and get the following error:
```
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0213 05:00:37.235242 1151 xrt_cu.cpp:165] Check failed: is_done cu timeout! core_idx 1 handle=0x55abe5fd50 ENV_PARAM(XLNX_DPU_TIMEOUT) 10000 state 1 ERT_CMD_STATE_COMPLETED 4 ms 10010 bo=1is_done 0
*** Check failure stack trace: ***
Aborted
```
So there are 2 questions left:
1. Why is the accuracy with B512_LP and B4096_HP with their fitting .elf files so bad?
2. Why can I execute .elf files that do not fit to the running DPU? Shouldn't I get the "DPU configuration mismatch for kernel customcnn"-error in this case?

chhait07 commented 3 years ago

This time I took the .hwh file from the standard xilinx image, created .dcf file and then generated .elf file via --options "{'dcf':'<file>.dcf'}" parameter for command vai_c_tensorflow. With the resulting .elf file the application can successfully be executed. This tells me that my workflow for creating the dpu kernel is correct. I am using the same app_mt.py skript for starting the dpu kernel which I think should not be the problem.

So, does that mean that my dpu hardware is not correct?

I would really appreciate any help. I am a student in germany and I want to compare execution of NNs on different dpu architectures/configurations for my master thesis. It seems to me that I can not figure out the cause of this problem by myself so getting some support here would be very important to me.

mkmk001 commented 3 years ago

Hello, did you solved this problem? I have the same bad result with you.

Command line options: --image_dir : images --threads : 1 --model : model_dir/customcnn.xmodel Pre-processing 10000 images... Starting 1 threads... Throughput=3990.95 fps, total frames = 10000, time=2.5057 seconds Correct:980, Wrong:9020, Accuracy:0.0980

chhait07 commented 3 years ago

Hi, no I could not solve that problem. The Xilinx Support just advised me to update my system to the most current Vitis AI Version (1.3) because they do not search for problems in older versions. I worked with version 1.2. I still have no idea why this problem occured. Please let me know if you can solve it.

mahmoudazzam408 commented 3 years ago

Hello, Yes the problem was solved. The problem is a mismatch problem. All you have to do is to update the Vitis runtime library to match the dpu image that you have flashed on the zcu104. Please let me know if you solved the issue. Kind regards, Mahmoud

On Tue, Oct 26, 2021, 3:41 PM mkmk001 @.***> wrote:

Hello, did you solved this problem? I have the same bad result with you.

Command line options: --image_dir : images --threads : 1 --model : model_dir/customcnn.xmodel Pre-processing 10000 images... Starting 1 threads... Throughput=3990.95 fps, total frames = 10000, time=2.5057 seconds Correct:980, Wrong:9020, Accuracy:0.0980

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Xilinx/Vitis-AI-Tutorials/issues/20#issuecomment-951952064, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALO54LEENYF5C3KWJV3CZPDUI2VXLANCNFSM436FBSHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bhargavin1872008 commented 2 years ago

when running the requirements.txt i 'm getting error for coremltools.it is showing like "couldn't find a version that satisfies the requirement tensorflow<=1.14 and tensorflow >=1.5(from tfcoremltools -r requirements.txt).(from version :2.2.0,2.2..1, 2.2.2, ...2.7.0rc0,2.7.0.rc1............) like this .can someone help me regarding this. Also ,i have a doubt .can we use ubuntu 20.04 ,cuda 11.7 ,cudnn 8.4.0 for this project. or have to use ubuntu 18.04,cuda 10.0 only which only works.please help me regarding this,i have less time in my hand.

Xilinx / Vitis-AI-Tutorials

Bad accuracy with compiled elf-file for custom DPU on the MNIST tutorial #20