fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.28k stars 415 forks source link

FPGA porting issues #282

Closed freddyD77 closed 1 year ago

freddyD77 commented 3 years ago

Has anybody successfully got any components of hls4ml working on an FPGA? I've tried to port simple GNN models, and MLPs to my FPGA using hls4ml. And despite very promising HLS simulation results, the results of me actually putting the generated firmware on my FPGA, do not match and are incredibly inaccurate. Just looking at the results coming out of the first dense layer of my MLP, the FPGA is already inaccurate.

I have troubleshooted a lot on the Vivado side of things and do not believe there is anything I'm explicitly doing wrong there. And all the papers and posts I've read about hls4ml only ever regard the HLS simulation results, and do not address efforts on getting an actual working model on an FPGA.

I would like to know if anybody has has any success putting any hls4ml compoenent on to an FPGA.

thesps commented 3 years ago

Hi, yes we do run hls4ml NN IPs on FPGAs. Can you describe a bit more your workflow and setup? What FPGA/board are you targeting? What toolflow do you use to go from the IP to a bitfile - e.g. Vitis/SDAccel with an accelerator card, BD/IPI with Vivado, some custom flow?

Some guesses about the cause include: the input data used in hardware being different to simulation; the control signals being set out of synch with the data (depending on the flow).

freddyD77 commented 3 years ago

I am using a ZCU102 Eval Board. In general after HLS Synthesis, I use the Export RTL button (in Vivado HLS) and take the generated 'ip' folder and make a block diagram in Vivado, refering to the ip block in that 'ip' folder. I develop my own embedded C code to run a Zynq processor to communicate with the IP. The drivers for the IP, generated by the Export RTL button make the C code relatively simple to develop. I'm using Vivado 2018.3. I'm unfamiliar with SDAccel or BD/IPI. Would you suggest I use those?

freddyD77 commented 3 years ago

To specify further: when my block diagram in Vivado is ready, I run Synthesis, Implementation, and then Generate Bitstream. Then I export the bitstream, and Launch Vivado SDK to develop the C code, and Program the FPGA through Vivado SDK.

thesps commented 3 years ago

The "block diagram" flow is what I meant by "BD/IPI", so I think that should be fine - as long as the blocks are connected correctly. Did you add some HLS interface pragmas to specify the type of interface you're using, e.g. axilite or axi stream? Another thought is whether your C code encodes the data correctly for whatever ap_fixed<W,I> you chose.

freddyD77 commented 3 years ago

Yes I changed the interface to axilite, and I believe I do encode the data correctly. I take my input float data, and multiply it by 2^(W-I) before sending it as an u32 value. Effectively left shifting it to adjust for the decimal bits that the ip will interpret. I've troubleshooted this extensively, although it would be great to verify with an example for me to look at.

thesps commented 3 years ago

Okay that does sound correct. What is W in your case? I found when using the default W=16 I had to pack two words into one u32 value for the axilite interface.

I think the closest thing I have to what you describe is this script of the block diagram - from a work in progress development to automate the type of bitfile creation you're describing: https://github.com/thesps/hls4ml/blob/pynq/hls4ml/templates/pynq/pynq_design.tcl And the corresponding (Python) host code: https://github.com/thesps/pynq_hls4ml/blob/main/part2_pynq.ipynb

freddyD77 commented 3 years ago

Yes I did I also packed two words into one u32 value. W=16, I=6.

freddyD77 commented 3 years ago

I do the same thing encodings and decodings as found in the python code. That jupyterbook also describes how the I/O and the ecnoding/decoding consume most of the processing time, but on other attempts I've made the IP take float values and do the float to ap<W,I> casting within the IP. Was a great improvement to the processing time.

thesps commented 3 years ago

Yeah I've tried that, it was something like a factor 2 faster with float on the interface. It may become an option in that development branch...

Regarding your issue, I'm stuck what to suggest without more detailed debugging. We have run our IPs in very similar setups to what you're using, and see correct output. But, I really don't know what part of your flow to look at next since it seems like you did most of the things I could think of already.

freddyD77 commented 3 years ago

Interesting. Is there any project that works end to end that I could look at? It doesn't need to be axilite, but I don't have a Pynq board. Anything that I could take from python or HLS and then to my local FPGA, to compare to what I've done?

And thanks again for the timely and informative replies, I really appreciate it!

thesps commented 3 years ago

You might be able to try the flow in the PYNQ repo here: https://github.com/thesps/pynq_hls4ml - the part 1 notebook - to create a bitfile. The network it trains is our basic 3 hidden layers jet tagging model from the first paper, that also features in the tutorials. That uses the branch of hls4ml I mentioned, and it is fixed to targeting the Pynq z2 for now. It should be possible however to adapt the board design script from Pynq z2 to ZCU102. I may be able to try creating the relevant script for your part tomorrow.

freddyD77 commented 3 years ago

Thanks! I will look further in to it.

freddyD77 commented 3 years ago

Wanted to give you a heads-up with some problems I had following the notebooks you referred to:

1) Since this was my first pynq project, I didn't have the pynq board files downloaded and placed into Vivado beforehand. This lead to an error when trying to generate the bitfile in part 1. Also I had to change the line in pynq_design.tcl that sets the board part, to match what Vivado does (on my machine) when I select the pynq board for a project.

2) The print_dt function prints misleading results during the 5000 sample checkpoints in part 2. Since it refers to the len(X) instead of just 5000 samples, the inference/s is much higher than what it actually is.

I'm trying to port a smaller version of the GNN found in https://github.com/vesal-rm/hls4ml/tree/graph_pipeline/example-prjs/graph/gnn_simple. hopefully using this pynq setup I will get better results than my efforts with the ZCU102