fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.17k stars 388 forks source link

Weights in verilog file different from one generated by model #893

Closed freaksie closed 1 month ago

freaksie commented 8 months ago

Prerequisites

Please make sure to check off these prerequisites before submitting a bug report.

Quick Summary

Trying to build a HLS project and during C Simulation it says "Unable to open input/predictions file, using default input". image In "myproject_test.cpp" it is trying to find the .dat file in "tb_data/" which only consists of csim_result.log.

Now when it is choosing a random value for input image why does it choose only one value? While input to my neural network model is of shape (2,).

Also when I went through the main verilog file i.e. "myproject.v" image It takes only one [31:0] input_2_V.

Please Let me know if I am missing something.

I have attached the architecture of the neural network for a better understanding of the problem. image

Inside the first hidden layer module of the network

image

(26'd406) is supposed to be my weight for node 4 and input feature 1. Now " $signed(r_V_10_0_4_fu_103_p1))" consists of the first input feature i.e. 16-bit decimal. but I don't find weight (26'd406) in the original w2.h file.

Below is w2.h file

//Numpy array shape [2, 8]
//Min -0.643949806690
//Max 0.849546194077
//Number of zeros 0

#ifndef W2_H_
#define W2_H_

#ifndef __SYNTHESIS__
hiddenlayer1_weight_t w2[16];
#else
hiddenlayer1_weight_t w2[16] = {-0.2819478810, -0.2423544228, -0.4932917356, 0.3381284177, 0.3965316117, 0.4984088242, 0.8495461941, 0.5675451756, 0.2973865569, -0.4815380573, -0.2879633009, 0.3123586774, -0.6439498067, 0.1229862347, -0.3351014256, -0.1838431954};
#endif

#endif

Question Summary

1) Why is tb_data/ don't have an input file? 2) Why is it only taking one random value while the input layer has 2 inputs? 3) Why input to the main module is just one [31:0] instead of 2? 4) Why do weights not match with one in the model? 5) Why these weights are 26'd? according to the hls4ml config it should be <16,6>

bo3z commented 8 months ago
  1. If you want to verify your model against some pre-determined inputs/outputs, you need to specify it. Have a look at the converter function: https://github.com/fastmachinelearning/hls4ml/blob/b67e730dcc0c28e8441f787a01045cbd9c8cf6b4/hls4ml/converters/__init__.py#L177 and the input_data/output_data parameters. If not specified, hls4ml uses a default input of all zeros. The converter function will internally store the inputs/outputs to tb_data if you specify it, either as NumPy array or DAT file.
  2. (and 3) - if you kept the default precision, it is 16 bits. All the inputs are concatenated into a big word in hardware. 2 inputs x 16 bits = 32 bit input. The value displayed below INFO (0.756...) is, I believe the output of the all-zero prediction, but please verify it by comparing it with Keras.
freaksie commented 8 months ago

@bo3z Thanks for the response. This clears up all the questions I had. We are implementing it on the Xilinx XCU216 Board, but I see that HLS4ML doesn't support Vivado version after 2020.1 and ZCU216 is a newer board that works best with the latest version of Vivado.

With the current version of Vivado i.e. (2018.3), there are many dependency errors while co-simulation. So are you planning to roll out a newer version of HLS4ML with additional dependencies?

Thank you again.

freaksie commented 8 months ago

Questions 1-3 are solved. Thanks to @bo3z Questions 4 and 5 are remaining.

Thanks in advance

bo3z commented 8 months ago

4-5 are significantly harder questions - in short, it's the HLS compiler.

Depending where the weights are stored (Latency or Resource strategy) the HLS compiler (and algorithm) will do different optimisations. From here, it seems you are using Latency strategy - weights are stored in registers and loops are unrolled while limiting the number of parallel products. In this case, the compiler could do several things - e.g. reduce the precision if the weight can be represented in less bits than needed; reorganise the order of multiplications to reduce the length of adder trees, merge weights into a single, larger word etc. In general, the compiler tends to optimise quite aggressively when loops are fully unrolled and parameters are in registers. While you can certainly spend some time to decode what weight maps to what word in the generated Verilog, it becomes impractical quite quickly (and it will change for different neural networks, weight values etc.). As long as the design has equal outputs between CSim (C++) and CoSim (RTL) simulation, you can assume the compiler did a good job. To check equality, use the validation flag in build.

On the other hand, Resource strategy is a bit more predictable. It will transpose weights and stored them in Block RAM, the contents of which can be inspected in a predictable manner. Paper with explanation here: https://arxiv.org/abs/2308.05170

freaksie commented 8 months ago

Oh, I understand. So the HLS compiler is responsible for weight representation. Thanks @bo3z these clears many things out. And that paper is really insightful.

Few more questions though. 1) According to HLS4ML input shape is 2<16,6> which is converted into 32'd. Now assume that my input is [-5.306,-1.689]. Assuming I have six bits representing signed integer value and 10 bits for fractional value. So 32bit can be 111101=-5; 0100110000=306; 111111=-1; 1010110001=689. this gives me a total of 32 bits as 11101101001100001111111010110001 and its equivalent integer is 3979411121.

Now I am using the COCOTB test bench. image Is this the correct way to represent the signed input value?

2) The output I got from the network is of 16 bit i.e. 0000000000111001 and its integer equivalent is 57 which is not the correct output for the above-mentioned input. The correct output is 0.0527344 which also matched csim_result.log. So what am I doing wrong here?

3) How do I represent input like "[-5.0288, -3.002]" , "[-0.294, 0.201]"?

For reference network architecture and other details are in the main comment.

Let me know if more information is required.

Thanks in advance.

vloncar commented 8 months ago

It doesn't really work like that. The numbers are represented as integer+fractional part, but for 5.306 the fractional part is not 306 rather 0.306. You use 10 bits for that, giving you 1024 values between 0 and 1 (increment is ~0.000976562). So instead of 0.306 you'll represent 0.306÷0.000976562=313.344.... Thus you will represent 313 and lose a bit of precision (the number will be 5.305664). 313 is 0100111001 in 10 bits in two's complement. Combining with with 5 which is 000101 in 6 bits you get the final number as 000101 0100111001 (without the space of course).

But this is only positive, how about negative? Let's take the -5.306. To get integer+fractional you will need to do -6+0.694. Using the formula above, 0.694 will be represented as 710 or 1011000110. So finally the number is 111010 1011000110.

In a similar way -1.689 is 111110 0100111110.

If this is tricky math a shortcut that I do is to print out the values from the generated C++ testbench, since you already have them converted as ap_fixed types before calling the top function. You can access individual bits with [] operator. The least significant bit has index 0. For example:

ap_fixed<16,6> a = -1.689;
for (int i = a.width-1; i >= 0 ; i--) {
    printf("%d", a[i].get());
}

prints out 1111100100111110.

freaksie commented 8 months ago

Thank you so much @vloncar for this. I was also confused regarding whether to represent it in two's complement or signed magnitude. but your answer has made that clear. And thanks for that code snippet too.

freaksie commented 1 month ago

@bo3z

Is there a way to increase the size of LUT, I guess the default is 1024. I want to increase it to 4096. Moreover, I am using the softmax function which in turn is using LUT for inv_exp() and exp(). So can I increase the LUT for inv_exp() as well?

Thanks in advance.

vloncar commented 1 month ago

Hi Neel, This is doable with the table_size config parameter but it should really be a last resort. It's rarely needed and there are many things you can do that would be a better choice. For example:

Check out the hls4ml-tutorial where all these things are showcased.

freaksie commented 1 month ago

hello @vloncar

Thanks for the response.

This table_size, where can I pass this parameter?

hls_model = hls4ml.converters.convert_from_keras_model( model, hls_config=config, output_dir='../lbl',part='xc7vx485tffg17612',input_data_tb='../Data/xtest.npy',output_data_tb='../Data/ytest.npy',clock_period=2)

Here?

freaksie commented 1 month ago

@vloncar

Deviation from the original output from the softmax layer

Orignal output : [[0.80757248, 0.1678711, 0.02455642]] VHDL output : [[1, 0.135742, 0.0185547]]

Let me know if you need more information. Also can you mention where can I use table_size argument?

vloncar commented 1 month ago
config = hls4ml.utils.config_from_keras_model(keras_model, granularity='name')
config['LayerName']['your_softmax_layer_name']['TableSize'] = whatever_you_want_but_remember_this_is_a_bad_idea