Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
753 stars 243 forks source link

Generalized DataWidthConverter #1186

Open lstasytis opened 2 months ago

lstasytis commented 2 months ago

This PR alongside https://github.com/Xilinx/finn-hlslib/pull/144 introduce a new variant for the DataWidthConverter (DWC), called DataWidthConverterGeneralized_Batch which should eventually completely replace the old DataWidthConverter_Batch function from finn-hlslib

The new DWC has two key improvements over the previous HLS version:

a.) Cases where the input and output streams have widths which are incompatible for use with the RTL variant (one cannot be divided by the other) will no longer result in an intermediate buffer of the size equaling the lowest common multiple (LCM) of the two widths.

Instead, a single intermediate buffer of size input width + output width is always generated.

This leads to the intermediate buffer never having an enormous size due to an extremely large LCM between widths, while also limiting the node to a single module instantiation, instead of 3. Thus, a potential >4K bit width intermediate axis data bus will not be generated (unless the input+output streams widths are >4K bit wide in total) which would have otherwise broke HLS.

b.) The node supports padding and cropping of the tail-end of the transactions being passed through the node with zeroes. This allows arbitrary padding of nodes in finn for relaxing folding factor constraints.

Architecture: The node functions by using an intermediate shift-register-based buffer of size in width + out width. The input stream is addressed to the intermediate buffer using an offset variable which tracks how many elements are currently in the intermediate buffer. The output stream is tied to the right-most output stream width bits of the intermediate buffer. We also track the total number of input and output words which need to be processed by the DWC in a single transaction and either shift in zeroes (padding) or stop writing to the output stream (cropping) whenever we run out of either input words or output words relative to how many are assigned during compile time.

Downsides: The architecture does not produce efficient HLS code due to the multiplexing of the input stream to the intermediate buffer leading to massive LUT use because a general IP core is instantiated by HLS for the task. The node is only more LUT-efficient versus the old HLS variant in cases where the intermediate buffer produced by the old DWC is 3-4x larger than the sum size of the input width and output width streams.

Improvements to be made: An RTL variant for the DWC should eventually be pushed to finn-rtllib, at which point the old DWC can be retired entirely in favor of this current architecture.

Use of padding functionality: Introducing padding to FINN nodes is extremely error-prone and so should be done carefully. The recommendation is to use the new generalized folding optimizer from the following branch: https://github.com/lstasytis/finn/tree/feature/set-folding-optimizer and allow it to use padding by setting the folding_maximum_padding dataflow builder argument to more than 0. The InsertDWC transformation will then insert DWCs which will potentially perform padding since the SetFolding() transformation will relax the stream shape restrictions with the assumption of DWCs performing the padding.

For a breakdown of padding restrictions in FINN nodes, refer to the code in the new SetFolding() transformation in the aformentioned branch.

Integration: The finn-hlslib PR should be merged first and the new commit linked from this PR before merging so that the new component is being used.