fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.17k stars 388 forks source link

LUT utilization exceeds 100% for CNN example (part6_cnns.ipynb) for ZCU104 Ultrascale+ MPSoCs #546

Closed wilfredkisku closed 1 year ago

wilfredkisku commented 2 years ago

I have been trying to recreate models that can be deployed effectively on the ZCU104 Ultrascale+ MPSoCs (Zynq Platforms). These are CNN models for CV task such as classification and object detection.

While synthesis using Xilinx A-U250 (xcu250-figd2104-2L-e) Alveo U200 and U250 Data Center Accelerator Cards can be easily done without this issue, board that has fewer resources are constrained as the utilization easily exceeds beyond 100% generally for DSP blocks and LUTs.

The utilization estimation for only 14,210 Parameters of which only 13,886 are trainable for a task of classification (operations include CONV2D, ReLU, Pooling, FC and SoftMax. Is there a way to accommodate larger parameter numbers with deeper networks on low end cards (without loosing on the latency).

================================================================
== Utilization Estimates
================================================================
* Summary: 
+-----------------+---------+-------+--------+--------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | URAM|
+-----------------+---------+-------+--------+--------+-----+
|DSP              |        -|      -|       -|       -|    -|
|Expression       |        -|      -|       0|      32|    -|
|FIFO             |      128|      -|    7868|   20924|    -|
|Instance         |        3|    389|   44498|  267242|    -|
|Memory           |        -|      -|       -|       -|    -|
|Multiplexer      |        -|      -|       -|      36|    -|
|Register         |        -|      -|       6|       -|    -|
+-----------------+---------+-------+--------+--------+-----+
|Total            |      131|    389|   52372|  288234|    0|
+-----------------+---------+-------+--------+--------+-----+
|Available        |      624|   1728|  460800|  230400|   96|
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%)  |       20|     22|      11|     125|    0|
+-----------------+---------+-------+--------+--------+-----+

Please provide suggestions in this regard. Thanks.

jmduarte commented 2 years ago

Hi @wilfredkisku what bit width are you using?

A simple way to reduce resources is to quantize the layers more aggressively (e.g., 8-bit or lower), by using QKeras layers.

Also, do you have the results of the resource usage after Vivado (logic) synthesis (vsynth)? Often the LUT usage goes down by quite a bit because it's overestimated at the C (HLS) synthesis stage.

wilfredkisku commented 2 years ago

@jmduarte thank you for the reply. This is the synthesis report for a CNN model with ~14000 parameters given in the hls4ml tutorial repo. The model

I have kept the hls_configuration of the pruned (not quantized model to be the same) to see the synthesis results.

hls_config['Model']['Precision'] = 'ap_fixed<16,6>'
hls_config['Model']['ReuseFactor'] = 1

for Layer in hls_config['LayerName'].keys():
    hls_config['LayerName'][Layer]['Strategy'] = 'Latency'
    hls_config['LayerName'][Layer]['ReuseFactor'] = 1

hls_config['LayerName']['output_softmax']['Strategy'] = 'Stable'

This results in a synthesized code for the ZCU104 Target device: xczu7ev-ffvc1156-2-e as shown below. This is taken from the report file that is within the synthesis folder of the generated code.

================================================================
== Vivado HLS Report for 'myproject'
================================================================
* Date:           Sat May  7 18:55:31 2022

* Version:        2019.2 (Build 2704478 on Wed Nov 06 22:10:23 MST 2019)
* Project:        myproject_prj
* Solution:       solution1
* Product family: zynquplus
* Target device:  xczu7ev-ffvc1156-2-e

================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+---------+----------+------------+
    |  Clock |  Target | Estimated| Uncertainty|
    +--------+---------+----------+------------+
    |ap_clk  | 5.00 ns | 4.355 ns |   0.62 ns  |
    +--------+---------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+------+------+----------+
    |  Latency (cycles) |  Latency (absolute) |   Interval  | Pipeline |
    |   min   |   max   |    min   |    max   |  min |  max |   Type   |
    +---------+---------+----------+----------+------+------+----------+
    |     1061|     1061| 5.305 us | 5.305 us |  1029|  1029| dataflow |
    +---------+---------+----------+----------+------+------+----------+

    + Detail: 
        * Instance: 
        +-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+
        |                                                       |                                                      |  Latency (cycles) |   Latency (absolute)  |   Interval  | Pipeline |
        |                        Instance                       |                        Module                        |   min   |   max   |    min    |    max    |  min |  max |   Type   |
        +-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+
        |dense_array_array_ap_fixed_16_6_5_3_0_42u_config18_U0  |dense_array_array_ap_fixed_16_6_5_3_0_42u_config18_s  |        8|        8| 40.000 ns | 40.000 ns |     8|     8|   none   |
        |conv_2d_cl_array_array_ap_fixed_24u_config12_U0        |conv_2d_cl_array_array_ap_fixed_24u_config12_s        |       41|       41|  0.205 us |  0.205 us |    41|    41|   none   |
        |conv_2d_cl_array_array_ap_fixed_16u_config7_U0         |conv_2d_cl_array_array_ap_fixed_16u_config7_s         |      230|      230|  1.150 us |  1.150 us |   230|   230|   none   |
        |dense_array_array_ap_fixed_16_6_5_3_0_64u_config22_U0  |dense_array_array_ap_fixed_16_6_5_3_0_64u_config22_s  |        3|        3| 15.000 ns | 15.000 ns |     3|     3|   none   |
        |dense_array_array_ap_fixed_16_6_5_3_0_10u_config26_U0  |dense_array_array_ap_fixed_16_6_5_3_0_10u_config26_s  |        3|        3| 15.000 ns | 15.000 ns |     3|     3|   none   |
        |conv_2d_cl_array_array_ap_fixed_16u_config2_U0         |conv_2d_cl_array_array_ap_fixed_16u_config2_s         |     1028|     1028|  5.140 us |  5.140 us |  1028|  1028|   none   |
        |pooling2d_cl_array_array_ap_fixed_24u_config16_U0      |pooling2d_cl_array_array_ap_fixed_24u_config16_s      |       20|       20|  0.100 us |  0.100 us |    20|    20|   none   |
        |softmax_array_array_ap_fixed_10u_softmax_config28_U0   |softmax_array_array_ap_fixed_10u_softmax_config28_s   |       10|       10| 50.000 ns | 50.000 ns |    10|    10|   none   |
        |normalize_array_array_ap_fixed_64u_config24_U0         |normalize_array_array_ap_fixed_64u_config24_s         |        1|        1|  5.000 ns |  5.000 ns |     1|     1| function |
        |relu_array_array_ap_fixed_64u_relu_config25_U0         |relu_array_array_ap_fixed_64u_relu_config25_s         |        1|        1|  5.000 ns |  5.000 ns |     1|     1| function |
        |pooling2d_cl_array_array_ap_fixed_16u_config6_U0       |pooling2d_cl_array_array_ap_fixed_16u_config6_s       |      903|      903|  4.515 us |  4.515 us |   903|   903|   none   |
        |pooling2d_cl_array_array_ap_fixed_16u_config11_U0      |pooling2d_cl_array_array_ap_fixed_16u_config11_s      |      172|      172|  0.860 us |  0.860 us |   172|   172|   none   |
        |normalize_array_array_ap_fixed_42u_config20_U0         |normalize_array_array_ap_fixed_42u_config20_s         |        1|        1|  5.000 ns |  5.000 ns |     1|     1| function |
        |relu_array_array_ap_fixed_42u_relu_config21_U0         |relu_array_array_ap_fixed_42u_relu_config21_s         |        1|        1|  5.000 ns |  5.000 ns |     1|     1| function |
        |normalize_array_array_ap_fixed_24u_config14_U0         |normalize_array_array_ap_fixed_24u_config14_s         |       19|       19| 95.000 ns | 95.000 ns |    19|    19|   none   |
        |relu_array_array_ap_fixed_24u_relu_config15_U0         |relu_array_array_ap_fixed_24u_relu_config15_s         |       19|       19| 95.000 ns | 95.000 ns |    19|    19|   none   |
        |normalize_array_array_ap_fixed_16u_config4_U0          |normalize_array_array_ap_fixed_16u_config4_s          |      903|      903|  4.515 us |  4.515 us |   903|   903|   none   |
        |normalize_array_array_ap_fixed_16u_config9_U0          |normalize_array_array_ap_fixed_16u_config9_s          |      172|      172|  0.860 us |  0.860 us |   172|   172|   none   |
        |relu_array_array_ap_fixed_16u_relu_config5_U0          |relu_array_array_ap_fixed_16u_relu_config5_s          |      903|      903|  4.515 us |  4.515 us |   903|   903|   none   |
        |relu_array_array_ap_fixed_16u_relu_config10_U0         |relu_array_array_ap_fixed_16u_relu_config10_s         |      172|      172|  0.860 us |  0.860 us |   172|   172|   none   |
        |linear_array_array_ap_fixed_16u_linear_config3_U0      |linear_array_array_ap_fixed_16u_linear_config3_s      |      902|      902|  4.510 us |  4.510 us |   902|   902|   none   |
        |linear_array_array_ap_fixed_24u_linear_config13_U0     |linear_array_array_ap_fixed_24u_linear_config13_s     |       18|       18| 90.000 ns | 90.000 ns |    18|    18|   none   |
        |linear_array_array_ap_fixed_16u_linear_config8_U0      |linear_array_array_ap_fixed_16u_linear_config8_s      |      171|      171|  0.855 us |  0.855 us |   171|   171|   none   |
        |linear_array_array_ap_fixed_64u_linear_config23_U0     |linear_array_array_ap_fixed_64u_linear_config23_s     |        0|        0|    0 ns   |    0 ns   |     1|     1| function |
        |linear_array_array_ap_fixed_42u_linear_config19_U0     |linear_array_array_ap_fixed_42u_linear_config19_s     |        0|        0|    0 ns   |    0 ns   |     1|     1| function |
        |linear_array_array_ap_fixed_10u_linear_config27_U0     |linear_array_array_ap_fixed_10u_linear_config27_s     |        0|        0|    0 ns   |    0 ns   |     1|     1| function |
        |Block_proc_U0                                          |Block_proc                                            |        0|        0|    0 ns   |    0 ns   |     0|     0|   none   |
        +-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+

        * Loop: 
        N/A

================================================================
== Utilization Estimates
================================================================
* Summary: 
+-----------------+---------+-------+--------+--------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | URAM|
+-----------------+---------+-------+--------+--------+-----+
|DSP              |        -|      -|       -|       -|    -|
|Expression       |        -|      -|       0|      32|    -|
|FIFO             |      160|      -|    9748|   25232|    -|
|Instance         |        3|   5155|   69143|  248324|    -|
|Memory           |        -|      -|       -|       -|    -|
|Multiplexer      |        -|      -|       -|      36|    -|
|Register         |        -|      -|       6|       -|    -|
+-----------------+---------+-------+--------+--------+-----+
|Total            |      163|   5155|   78897|  273624|    0|
+-----------------+---------+-------+--------+--------+-----+
|Available        |      624|   1728|  460800|  230400|   96|
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%)  |       26|    298|      17|     118|    0|
+-----------------+---------+-------+--------+--------+-----+

Just needed to know how well/deep can a model with parameters of around ~20000 be ported onto the ZCU104 board. Thanks again.

thesps commented 2 years ago
+-----------------+---------+-------+--------+--------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | URAM|
+-----------------+---------+-------+--------+--------+-----+
...
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%)  |       20|     22|      11|     125|    0|
+-----------------+---------+-------+--------+--------+-----+

I think you could try to make a bitfile from this one. Did you try it? This report is from the HLS estimates, and often we see that the LUTs are overestimated. You could also try a newer Vivado version, up to 2020.2 or so.

The DSPs are normally well estimated, so the second one with ~300% DSP utilisation probably won't work.

+-----------------+---------+-------+--------+--------+-----+
|Utilization (%)  |       26|    298|      17|     118|    0|
+-----------------+---------+-------+--------+--------+-----+