fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.29k stars 418 forks source link

Error in export RTL by "config_array_partition" #260

Closed rubende closed 3 years ago

rubende commented 3 years ago

I am using Vitis HLS 2019.2 to export RTL (generating a .xo file) from this network (https://github.com/fastmachinelearning/hls4ml-tutorial/blob/master/part1_getting_started.ipynb).

When I try it, I have the next output:

ERROR: [HLS 200-642] The 'config_array_partition -maximum_size' command is not supported. config_array_partition -maximum_size=4096 failed command 'config_array_partition' returned error code while executing "source /hls4ml_prj/myproject_prj/solution1/export.tcl" invoked from within "hls::main /hls4ml_prj/myproject_prj/solution1/export.tcl" ("uplevel" body line 1) invoked from within "uplevel 1 hls::main {*}$args" (procedure "hls_proc" line 5) invoked from within "hls_proc $argv" myproject_prj:solution1 Dec 10, 2020, 9:35:19 AM

Could you tell me how to solve it? I'm very new with Vitis HLS, so I don't know if it's a project configuration problem or one related to the hls4ml library.

On the other hand, my goal is to run the network fon an Alveo U200, maybe am I focusing something wrong?

Thank you.

vloncar commented 3 years ago

Use Vivado HLS not Vitis HLS. We don't yet support the Vitis HLS flow to generate .xo file.

simplelins commented 3 years ago

Use Vivado HLS not Vitis HLS. We don't yet support the Vitis HLS flow to generate .xo file.

@vloncar Are there any planning to support vitis hls,?

vloncar commented 3 years ago

It is planned, but I don't have a timline yet. If you are interested in running inference on Alveo, did you look at Vitis AI?

rubende commented 3 years ago

Use Vivado HLS not Vitis HLS. We don't yet support the Vitis HLS flow to generate .xo file.

Thank you, I did it with Vivado HLS, but now I "believe" that network is not being created with the right weights and biases, because by making inference I always get zeros at the output. From what I've seen, the code that does that is in a "ifndef SYNTHESIS" block, and get it out of that block is not enough (generate a lot of compialtion errors). Am I misunderstanding something?

vloncar commented 3 years ago

Two possibilities:

rubende commented 3 years ago

Two possibilities:

* You're getting zeros because of the precision used. Try using more bits

* You're running the testbench without providing any input data, so it defaults to supplying zeros as input. See *_test.cpp to understand what it does

Sorry, I may have explained something incorrectly. I am not making a testbench, I am synthesizing the model in a .xo file. Then I'm passing input data with my host code (I've checked that the input is correct). I have also checked that the .txt files of weights and biases are not empty, and that the values are representable in my configuration (<16.6>).

As I have already indicated, the code ignores the weight and bias loading part when a synthesization is in process, so it is generating the layers with these data at 0. How can I solve it? I can upload the kernel and host code if you need it.

Thank you very much.

vloncar commented 3 years ago

The values of weights should still be present in the header files under weights directory as static arrays. If the SYNTHESIS macro is available, they should be loaded. Maybe Vivado/Vitis doesn't define that macro for .xo flow. That loader function that you tried to use is not meant to be synthesized.

I'd be very interested in seeing if you got it to work. We don't support generating .xo (yet!), so it would be interesting to see what changes were required to make it work.

rubende commented 3 years ago

Okay, now I have seen the weigths directory, and yes, I see there the weigths correctly. I have also confirmed that the macro is declared. So... I don't understand what's going on. Do you know any way to confirm which weights, after applying the precision used, are being used in the FPGA?

I can explain the steps and modifications I have made to create the xo file. I will try to solve this first to make sure everything is correct.

vloncar commented 3 years ago

There are profiling tools in hls4ml that you can use. See the later parts of the tutorial. If the profiling tools don't help, or you are unsure if Vivado just puts zeros for weights, I would suggest creating a simpler model and build up. For example, start with a "model" that has no weights (just applies an activation to inputs), to see if you get any correct output. Then create 1-layer model with weights fixed to 1 and then finally a model with weights set to some precision.

rubende commented 3 years ago

Well, it was a problem with my data entry, solved. Thanks for the help. I have been able to make inference about the fpga, unfortunately when I was trying to use a more "realistic" model I have the problem referring to the unrolling (https://github.com/fastmachinelearning/hls4ml/issues/216)... it is a pity because the project is very interesting, but this makes it unfeasible at the moment. We will see if in the future (perhaps with the support of Intel, do you know anything about the state of that?) can take more advantage of this library.

@vloncar, I leave you the steps that I have followed to generate a xo file in Vivado HLS 2019.2 and then use it in Vitis 2019.2:

1) I have followed this tutorial (https://github.com/fastmachinelearning/hls4ml-tutorial/blob/master/part1_getting_started.ipynb) until the compilation (not synthesize), so I will start from there. I have only changed in the code my "fpga_part". 2) Create a new project in Vivado HLS. Add "myproject.cpp" as top function and "myptoject_test.cpp" as testbench. Select target board and check "Vitis Bottom Up Flow". 3) Modify "myproject.cpp":

vloncar commented 3 years ago

Hi @rubende Thanks a lot for summarizing the changes required! I'll try to create a script to automate this to create Vitis projects.

Regarding the state of hls4ml and support for Intel, we are working on updating the code to the latest branch, but the initial implementation will mirror the functionality of Xilinx one (meaning all weights are stored on-chip). We are interested in using accelerator cards with off-chip memory to offload weights and we will support this type of computation in the future, so stay tuned.

simplelins commented 3 years ago

Hi @rubende @vloncar I had done some work for generating the vitis projects at my branch,please check. it worked same as vivado_his, but only a few models can work, for example: KERAS_3layer.json
for using ,we only change the backend to vitis, or at front,give a parameter as follow: config = hls4ml.utils.fetch_example_model('KERAS_3layer.json',"vitis")

rubende commented 3 years ago

Hi @simplelins, thank you, I will check it.

Thanks @vloncar for the information. I am thinking about the unrolling error, and there are things that I do not understand.

I am aware of the problem about load the weights into the FPGA memory, but, honestly, I don't consider my network to be "big", or at least not big enough that it can't be loaded on a U200. I leave you a model summary. Reading through the Xilinx forums, I get the feeling that there is an artificial limit to unrolling, which has no relation to the physical capacity of the FPGA. Can you confirm if this is true?

Also, I have tried to use the "ReuseFactor" to solve this problem (at the cost of what I understand to be a performance loss), but even with very high values (for example, with 2000 of "ReuseFactor") this error keeps coming up. Is this normal? Do you know any way to calculate the supposed "ReuseFactor" that would be necessary for a specific network?

Thank you.

summary

vloncar commented 3 years ago

Hi @rubende since your weights are larger than partition limit for some weights (for example, 3x3x128x128), you need to use "resource" strategy. Furthermore, for CNNs, you need to use io_stream. The ReuseFactor controls how many times any DSP will be (re)used. While hls4ml doesn't select the optimal ReuseFactor for you right now (but it will do so in the future), it will make sure you use the nearest correct one whatever you specify. So you can try setting it to some large value (beyond the number of multiplications in a layer) and running synthesis. This will use the minimum number of DSPs and if it works, from there you can try reducing ReuseFactor for specific layers.

rubende commented 3 years ago

Thanks for your advice, I think I've got this on the right track now.