fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.19k stars 390 forks source link

Multidimensional input support for Dense layer, Resource vs resource in Conv2D layer, lower reusefactor results in almost-0 resource usage #782

Open vandenBergArthur opened 1 year ago

vandenBergArthur commented 1 year ago

TL;DR at the bottom

Hi all, As mentioned in my other post #747, I am trying to implement a graph convolution. So, I need a matrix multiplication A * B = C where A is my input tensor and B is an adjacency matrix. To realize this, I have created 2 alternatives that use supported Keras layers so that I am able to use hls4ml to deploy this model. (We are also trying to use the extension API to implement the whole model.)

Alternative 1

In the first alternative, I simply use Dense layers to mimic the matrix multiplication.

in_channels = 32
out_channels = 32
nodes = 25
input_x = Input(shape=(nodes,in_channels), name = 'input_x')

# 1x1 Convolution of incoming frame
dense1 = Dense(units=out_channels, use_bias=True, name='dense1')(input_x)

# Switch dimensions of nodes & out_channels to setup correct matmul operation
perm1 = Permute((2,1), name='perm1')(dense1)

# Use Dense layer to perform matrix multiplication with the adjacency matrix
# Units = number of columns of adj matrix
dense2 = Dense(units=nodes, use_bias=False, kernel_initializer=tf.keras.initializers.Constant(adj1),name='dense2')(perm1)

model = Model(inputs=input_x, outputs=dense2)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_x (InputLayer)        [(None, 25, 32)]          0         

 dense1 (Dense)              (None, 25, 96)            3168      

 perm1 (Permute)             (None, 96, 25)            0         

 dense2 (Dense)              (None, 96, 25)            625       

=================================================================
Total params: 3,793
Trainable params: 3,793
Non-trainable params: 0

Where adj1 is a tensor that represents the adjacency matrix:

# Create a 25x25 tensor with random values of 0 or 1
adj1 = tf.random.uniform(shape=(25, 25), minval=0, maxval=1)

# Round the values to the nearest integer (0 or 1)
adj1 = tf.math.round(adj1)

# Set the diagonal elements to zero
adj1 = tf.linalg.set_diag(adj1, tf.zeros(25))

# Convert to numpy
adj1 = adj1.numpy()

# Make tensor symmetric by adding its transpose to its upper triangular part
adj1 = np.triu(adj1) + np.triu(adj1, 1).T

For a starting configuration, I used default precision & a RF = 64 (like in tutorial 7 where a model with dense layers is deployed to the PYNQ-Z2 board:

config = hls4ml.utils.config_from_keras_model(model, granularity='name')
config['Model']['Strategy'] = 'Resource'

for layer in config['LayerName'].keys():
    config['LayerName'][layer]['Strategy'] = 'Resource'
    config['LayerName'][layer]['ReuseFactor'] = 64

print("-----------------------------------")
plotting.print_dict(config)
print("-----------------------------------")

hls_model = hls4ml.converters.convert_from_keras_model(model,
                                                       hls_config=config,
                                                       output_dir='/home/arthur/Documents/Testing/configs/ourModel_32_resource/',
                                                       backend='VivadoAccelerator',
                                                       board='pynq-z2')
hls_model.compile()

But when I build this model with hls_model.build(csim=False, export=True) I get a rather odd output:

================================================================
== Utilization Estimates
================================================================
* Summary: 
+-----------------+---------+-------+--------+-------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  | URAM|
+-----------------+---------+-------+--------+-------+-----+
|DSP              |        -|      -|       -|      -|    -|
|Expression       |        -|      -|      40|    661|    -|
|FIFO             |        -|      -|       -|      -|    -|
|Instance         |        -|      -|    9118|   2949|    -|
|Memory           |        -|      -|       -|      -|    -|
|Multiplexer      |        -|      -|       -|    105|    -|
|Register         |        0|      -|     969|    192|    -|
+-----------------+---------+-------+--------+-------+-----+
|Total            |        0|      0|   10127|   3907|    0|
+-----------------+---------+-------+--------+-------+-----+
|Available        |      280|    220|  106400|  53200|    0|
+-----------------+---------+-------+--------+-------+-----+
|Utilization (%)  |        0|      0|       9|      7|    0|
+-----------------+---------+-------+--------+-------+-----+

I compared these results with those of the untrained model from the tutorial, and my resource usage is suspiciously low.

Model from tutorial:

model = Sequential()
model.add(QDense(64, input_shape=(16,), name='fc1',
                 kernel_quantizer=quantized_bits(6,0,alpha=1), bias_quantizer=quantized_bits(6,0,alpha=1),
                 kernel_initializer='glorot_uniform'))
model.add(QActivation(activation=quantized_relu(6), name='relu1'))
model.add(QDense(32, name='fc2',
                 kernel_quantizer=quantized_bits(6,0,alpha=1), bias_quantizer=quantized_bits(6,0,alpha=1),
                 kernel_initializer='glorot_uniform'))
model.add(QActivation(activation=quantized_relu(6), name='relu2'))
model.add(QDense(32, name='fc3',
                 kernel_quantizer=quantized_bits(6,0,alpha=1), bias_quantizer=quantized_bits(6,0,alpha=1),
                 kernel_initializer='glorot_uniform'))
model.add(QActivation(activation=quantized_relu(6), name='relu3'))
model.add(QDense(5, name='output',
                 kernel_quantizer=quantized_bits(6,0,alpha=1), bias_quantizer=quantized_bits(6,0,alpha=1),
                 kernel_initializer='glorot_uniform'))
model.add(Activation(activation='softmax', name='softmax'))
model.summary()
Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 fc1 (QDense)                (None, 64)                1088      

 relu1 (QActivation)         (None, 64)                0         

 fc2 (QDense)                (None, 32)                2080      

 relu2 (QActivation)         (None, 32)                0         

 fc3 (QDense)                (None, 32)                1056      

 relu3 (QActivation)         (None, 32)                0         

 output (QDense)             (None, 5)                 165       

 softmax (Activation)        (None, 5)                 0         

=================================================================
Total params: 4,389
Trainable params: 4,389
Non-trainable params: 0

I have found out that the 3D-input to the Dense layer is probably the reason. I tested a similar model with a 2D input shape, and the resource usage seems more normal.

So, does hls4ml not support Dense layers with 3D inputs? In Keras itself it should be allowed:

N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

Alternative 2

a = Input(shape=(1,nodes,in_channels), name = 'input_x') b = Conv2D(filters=out_channels, kernel_size=1, strides=1, padding='valid', data_format='channels_last', use_bias=True, name ='conv2d_1x1')(a) b = Reshape(target_shape=(nodes,out_channels), name = 'reshape1')(b) c = Permute((2,1), name = 'permute1')(b) c = Reshape(target_shape=(out_channels,nodes,1), name = 'reshape2')(c) d = Conv2D(filters=nodes, kernel_size=(1,nodes), strides=1, padding='valid', data_format='channels_last', use_bias=False, kernel_initializer=tf.keras.initializers.Constant(adj1), name = 'matmul')(c) model = Model(inputs=a, outputs=d)

If I then configure this model with the same settings as in `alternative 1`, I get this error:

In file included from firmware/myproject.cpp:4: firmware/parameters.h:28:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’? 28 | static const unsigned strategy = nnet::Resource; | ^~~~ | resource firmware/parameters.h:59:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’? 59 | static const unsigned strategy = nnet::Resource; | ^~~~ | resource g++: error: myproject.o: No such file or directory

Changing `Resource` into `resource` does fix the problem, but I thought this was worth pointing out.

Then, in the `Utilization Estimates` I noticed that my DSP usage was rather low, so I tried to decrease the reusefactor to 25 (instead of 64). But then again, the resource usage was suspiciously low like in alternative 1.

### TL;DR

- Is a 3D input to a Dense layer unsupported by hls4ml?
- I have the following error when using Conv2D layer:

In file included from firmware/myproject.cpp:4: firmware/parameters.h:28:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’? 28 | static const unsigned strategy = nnet::Resource; | ^~~~ | resource firmware/parameters.h:59:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’? 59 | static const unsigned strategy = nnet::Resource; | ^~~~ | resource g++: error: myproject.o: No such file or directory


- Changing reusefactor from 64 to 25 results in almost zero resource utilization. Why is it that `Latency` and `Resource` have different reusefactors because changing it to 32 is fine?

I added a jupyter notebook file that conains all the code.
[github_issue.zip](https://github.com/fastmachinelearning/hls4ml/files/11412652/github_issue.zip)
vloncar commented 1 year ago

Dense on 3D input is supported. Behind the scenes, it will result in exactly the same code as in your 2nd alternative, i.e., pointwise Conv2D. The reason you don't see DSPs being used is because of the number of bits involved. With less than 9 bits, the compiler chooses not to allocate DSPs and instead performs the multiplication in LUTs.

vandenBergArthur commented 1 year ago

Dense on 3D input is supported. Behind the scenes, it will result in exactly the same code as in your 2nd alternative, i.e., pointwise Conv2D. The reason you don't see DSPs being used is because of the number of bits involved. With less than 9 bits, the compiler chooses not to allocate DSPs and instead performs the multiplication in LUTs.

I understand that DSPs are not used when the number of bits is less than 9. But I don't understand how a change in reusefactor from 64 to 25 changes the resources from this: image To this: image Is it because 25 is an invalid reusefactor for the Resource strategy?

I am using 0.7.0 but on the main branch this error is still present:

In file included from firmware/myproject.cpp:4:
firmware/parameters.h:28:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’?
   28 |     static const unsigned strategy = nnet::Resource;
      |                                            ^~~~~~~~
      |                                            resource
firmware/parameters.h:59:44: error: ‘Resource’ is not a member of ‘nnet’; did you mean ‘resource’?
   59 |     static const unsigned strategy = nnet::Resource;
      |                                            ^~~~~~~~
      |                                            resource
g++: error: myproject.o: No such file or directory

Anyway, thanks for the insights @vloncar

vloncar commented 1 year ago

You can see in the report exactly what is the DSP used for. When using the resource strategy, the weights will be stored in BRAM and there will be accounting to access the right BRAM and fetch the right element. This arithmetic requires multiplication. Verify in the logs.

vandenBergArthur commented 1 year ago

You can see in the report exactly what is the DSP used for. When using the resource strategy, the weights will be stored in BRAM and there will be accounting to access the right BRAM and fetch the right element. This arithmetic requires multiplication. Verify in the logs.

Sorry for the many questions. But in the paper (Fast convolutional neural networks on FPGAs with hls4ml) I read, it is mentioned that The BRAM consumption does not depend on the reuse factor. So I don't understand how changing RF drops the BRAM from 18 to 0.

Also, I've looked into the log, but I cannot find for what the DSP is exactly used. My apologies but i'm still novice in the FPGA field.

(Sorry for closing and opening this issue all the time, but it automatically closes with my comment and I don't know how I can change it)

vloncar commented 1 year ago

The statement from the paper refers to that particular model, not in general. When using the resource strategy, the weights will be partitioned with an ARRAY_RESHAPE pragma (read about it in the Xilinx docs) with a block factor that is ceil(n_in * n_out / reuse_factor) (the n_in and n_out being number of input and output neurons in a fully-connected layer, or number of input features times the kernel size and the number of output filters for the convolutional layers). We don't enforce the resource allocated for these arrays, though it is assumed that it goes to BRAMs if sufficiently large. This is controlled by a heuristic inside the Xilinx compiler (a black box) and we chose to trust it rather than expose the setting to the user (though we may do that in the future). The heuristic may choose to implement everything in LUTs, hence you no longer see BRAMs allocated. If BRAMs are allocated, they will be affected by the reshape pragma, so some arithmetic to index into it will be required, especially if reuse factor is not a power-of-two. Check the report in detail to see exactly what is the DSP used for. You are only looking at the big summary, but there are per-layer reports. You can browse the via GUI if you run vivado_hls -p myproject_prj (myproject_prj will be generated in the output directory when you call build()), or you can view the .rpt text files themselves that are buried in the myproject_prj directory. Start with the top-level report and go deeper into the layers to see which resources correspond to them. You can also use the analysis view of the GUI to match the lines to the resources, though this is not exactly perfect matching.

If you used an invalid reuse factor, the tool should report it during conversion. A quick glance at the log in the shared notebook reveals you also miss timing. I would consider that a more important issue. You should experiment with what's causing that (play with reuse factor).

Oh, and the paper you quote refers to the older implementation that you aren't using but everything I said above is also valid for the current default implementation.

vandenBergArthur commented 1 year ago

Hi, thank you so much for the insightful reply @vloncar ! However, by being creative with the supported Keras layers it seems like I will not be able to succesfully implement the whole model. So, we want to use the Extension API because this seems like our only option left. We have 2 files that perform the calculations; 1 in Python and 1 in C++. But when trying to implement the model using a Keras model, it seems that our model is too big to use the Vivado backend (based on my other issues I posted). Using the VivadoAccelerator backend resulted in a synthesized model (a part of the entire model). So my question is, can we use the VivadoAccelerator backend in the extension API? Because all the examples I saw (KLLoss and the example on the documentation page) use regular Vivado backend.

vloncar commented 1 year ago

You can use extension API in VivadoAccelerator backend. If you say that your model doesn't work in Vivado but works in VivadoAccelerator you're probably doing something wrong since there should be no differences in generated model architecture and the HLS it uses.