Closed wilfredkisku closed 1 year ago
Hi @wilfredkisku what bit width are you using?
A simple way to reduce resources is to quantize the layers more aggressively (e.g., 8-bit or lower), by using QKeras layers.
Also, do you have the results of the resource usage after Vivado (logic) synthesis (vsynth
)? Often the LUT usage goes down by quite a bit because it's overestimated at the C (HLS) synthesis stage.
@jmduarte thank you for the reply. This is the synthesis report for a CNN model with ~14000 parameters given in the hls4ml tutorial repo. The model
I have kept the hls_configuration of the pruned (not quantized model to be the same) to see the synthesis results.
hls_config['Model']['Precision'] = 'ap_fixed<16,6>'
hls_config['Model']['ReuseFactor'] = 1
for Layer in hls_config['LayerName'].keys():
hls_config['LayerName'][Layer]['Strategy'] = 'Latency'
hls_config['LayerName'][Layer]['ReuseFactor'] = 1
hls_config['LayerName']['output_softmax']['Strategy'] = 'Stable'
This results in a synthesized code for the ZCU104 Target device: xczu7ev-ffvc1156-2-e
as shown below. This is taken from the report file that is within the synthesis folder of the generated code.
================================================================
== Vivado HLS Report for 'myproject'
================================================================
* Date: Sat May 7 18:55:31 2022
* Version: 2019.2 (Build 2704478 on Wed Nov 06 22:10:23 MST 2019)
* Project: myproject_prj
* Solution: solution1
* Product family: zynquplus
* Target device: xczu7ev-ffvc1156-2-e
================================================================
== Performance Estimates
================================================================
+ Timing:
* Summary:
+--------+---------+----------+------------+
| Clock | Target | Estimated| Uncertainty|
+--------+---------+----------+------------+
|ap_clk | 5.00 ns | 4.355 ns | 0.62 ns |
+--------+---------+----------+------------+
+ Latency:
* Summary:
+---------+---------+----------+----------+------+------+----------+
| Latency (cycles) | Latency (absolute) | Interval | Pipeline |
| min | max | min | max | min | max | Type |
+---------+---------+----------+----------+------+------+----------+
| 1061| 1061| 5.305 us | 5.305 us | 1029| 1029| dataflow |
+---------+---------+----------+----------+------+------+----------+
+ Detail:
* Instance:
+-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+
| | | Latency (cycles) | Latency (absolute) | Interval | Pipeline |
| Instance | Module | min | max | min | max | min | max | Type |
+-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+
|dense_array_array_ap_fixed_16_6_5_3_0_42u_config18_U0 |dense_array_array_ap_fixed_16_6_5_3_0_42u_config18_s | 8| 8| 40.000 ns | 40.000 ns | 8| 8| none |
|conv_2d_cl_array_array_ap_fixed_24u_config12_U0 |conv_2d_cl_array_array_ap_fixed_24u_config12_s | 41| 41| 0.205 us | 0.205 us | 41| 41| none |
|conv_2d_cl_array_array_ap_fixed_16u_config7_U0 |conv_2d_cl_array_array_ap_fixed_16u_config7_s | 230| 230| 1.150 us | 1.150 us | 230| 230| none |
|dense_array_array_ap_fixed_16_6_5_3_0_64u_config22_U0 |dense_array_array_ap_fixed_16_6_5_3_0_64u_config22_s | 3| 3| 15.000 ns | 15.000 ns | 3| 3| none |
|dense_array_array_ap_fixed_16_6_5_3_0_10u_config26_U0 |dense_array_array_ap_fixed_16_6_5_3_0_10u_config26_s | 3| 3| 15.000 ns | 15.000 ns | 3| 3| none |
|conv_2d_cl_array_array_ap_fixed_16u_config2_U0 |conv_2d_cl_array_array_ap_fixed_16u_config2_s | 1028| 1028| 5.140 us | 5.140 us | 1028| 1028| none |
|pooling2d_cl_array_array_ap_fixed_24u_config16_U0 |pooling2d_cl_array_array_ap_fixed_24u_config16_s | 20| 20| 0.100 us | 0.100 us | 20| 20| none |
|softmax_array_array_ap_fixed_10u_softmax_config28_U0 |softmax_array_array_ap_fixed_10u_softmax_config28_s | 10| 10| 50.000 ns | 50.000 ns | 10| 10| none |
|normalize_array_array_ap_fixed_64u_config24_U0 |normalize_array_array_ap_fixed_64u_config24_s | 1| 1| 5.000 ns | 5.000 ns | 1| 1| function |
|relu_array_array_ap_fixed_64u_relu_config25_U0 |relu_array_array_ap_fixed_64u_relu_config25_s | 1| 1| 5.000 ns | 5.000 ns | 1| 1| function |
|pooling2d_cl_array_array_ap_fixed_16u_config6_U0 |pooling2d_cl_array_array_ap_fixed_16u_config6_s | 903| 903| 4.515 us | 4.515 us | 903| 903| none |
|pooling2d_cl_array_array_ap_fixed_16u_config11_U0 |pooling2d_cl_array_array_ap_fixed_16u_config11_s | 172| 172| 0.860 us | 0.860 us | 172| 172| none |
|normalize_array_array_ap_fixed_42u_config20_U0 |normalize_array_array_ap_fixed_42u_config20_s | 1| 1| 5.000 ns | 5.000 ns | 1| 1| function |
|relu_array_array_ap_fixed_42u_relu_config21_U0 |relu_array_array_ap_fixed_42u_relu_config21_s | 1| 1| 5.000 ns | 5.000 ns | 1| 1| function |
|normalize_array_array_ap_fixed_24u_config14_U0 |normalize_array_array_ap_fixed_24u_config14_s | 19| 19| 95.000 ns | 95.000 ns | 19| 19| none |
|relu_array_array_ap_fixed_24u_relu_config15_U0 |relu_array_array_ap_fixed_24u_relu_config15_s | 19| 19| 95.000 ns | 95.000 ns | 19| 19| none |
|normalize_array_array_ap_fixed_16u_config4_U0 |normalize_array_array_ap_fixed_16u_config4_s | 903| 903| 4.515 us | 4.515 us | 903| 903| none |
|normalize_array_array_ap_fixed_16u_config9_U0 |normalize_array_array_ap_fixed_16u_config9_s | 172| 172| 0.860 us | 0.860 us | 172| 172| none |
|relu_array_array_ap_fixed_16u_relu_config5_U0 |relu_array_array_ap_fixed_16u_relu_config5_s | 903| 903| 4.515 us | 4.515 us | 903| 903| none |
|relu_array_array_ap_fixed_16u_relu_config10_U0 |relu_array_array_ap_fixed_16u_relu_config10_s | 172| 172| 0.860 us | 0.860 us | 172| 172| none |
|linear_array_array_ap_fixed_16u_linear_config3_U0 |linear_array_array_ap_fixed_16u_linear_config3_s | 902| 902| 4.510 us | 4.510 us | 902| 902| none |
|linear_array_array_ap_fixed_24u_linear_config13_U0 |linear_array_array_ap_fixed_24u_linear_config13_s | 18| 18| 90.000 ns | 90.000 ns | 18| 18| none |
|linear_array_array_ap_fixed_16u_linear_config8_U0 |linear_array_array_ap_fixed_16u_linear_config8_s | 171| 171| 0.855 us | 0.855 us | 171| 171| none |
|linear_array_array_ap_fixed_64u_linear_config23_U0 |linear_array_array_ap_fixed_64u_linear_config23_s | 0| 0| 0 ns | 0 ns | 1| 1| function |
|linear_array_array_ap_fixed_42u_linear_config19_U0 |linear_array_array_ap_fixed_42u_linear_config19_s | 0| 0| 0 ns | 0 ns | 1| 1| function |
|linear_array_array_ap_fixed_10u_linear_config27_U0 |linear_array_array_ap_fixed_10u_linear_config27_s | 0| 0| 0 ns | 0 ns | 1| 1| function |
|Block_proc_U0 |Block_proc | 0| 0| 0 ns | 0 ns | 0| 0| none |
+-------------------------------------------------------+------------------------------------------------------+---------+---------+-----------+-----------+------+------+----------+
* Loop:
N/A
================================================================
== Utilization Estimates
================================================================
* Summary:
+-----------------+---------+-------+--------+--------+-----+
| Name | BRAM_18K| DSP48E| FF | LUT | URAM|
+-----------------+---------+-------+--------+--------+-----+
|DSP | -| -| -| -| -|
|Expression | -| -| 0| 32| -|
|FIFO | 160| -| 9748| 25232| -|
|Instance | 3| 5155| 69143| 248324| -|
|Memory | -| -| -| -| -|
|Multiplexer | -| -| -| 36| -|
|Register | -| -| 6| -| -|
+-----------------+---------+-------+--------+--------+-----+
|Total | 163| 5155| 78897| 273624| 0|
+-----------------+---------+-------+--------+--------+-----+
|Available | 624| 1728| 460800| 230400| 96|
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%) | 26| 298| 17| 118| 0|
+-----------------+---------+-------+--------+--------+-----+
Just needed to know how well/deep can a model with parameters of around ~20000 be ported onto the ZCU104 board. Thanks again.
+-----------------+---------+-------+--------+--------+-----+
| Name | BRAM_18K| DSP48E| FF | LUT | URAM|
+-----------------+---------+-------+--------+--------+-----+
...
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%) | 20| 22| 11| 125| 0|
+-----------------+---------+-------+--------+--------+-----+
I think you could try to make a bitfile from this one. Did you try it? This report is from the HLS estimates, and often we see that the LUTs are overestimated. You could also try a newer Vivado version, up to 2020.2 or so.
The DSPs are normally well estimated, so the second one with ~300% DSP utilisation probably won't work.
+-----------------+---------+-------+--------+--------+-----+
|Utilization (%) | 26| 298| 17| 118| 0|
+-----------------+---------+-------+--------+--------+-----+
I have been trying to recreate models that can be deployed effectively on the ZCU104 Ultrascale+ MPSoCs (Zynq Platforms). These are CNN models for CV task such as classification and object detection.
While synthesis using Xilinx A-U250 (xcu250-figd2104-2L-e) Alveo U200 and U250 Data Center Accelerator Cards can be easily done without this issue, board that has fewer resources are constrained as the utilization easily exceeds beyond 100% generally for DSP blocks and LUTs.
The utilization estimation for only 14,210 Parameters of which only 13,886 are trainable for a task of classification (operations include CONV2D, ReLU, Pooling, FC and SoftMax. Is there a way to accommodate larger parameter numbers with deeper networks on low end cards (without loosing on the latency).
Please provide suggestions in this regard. Thanks.