Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
723 stars 230 forks source link

[RTL SWG] Support SIMD < C in window-parallel mode #922

Closed fpjentzsch closed 9 months ago

fpjentzsch commented 10 months ago

Previously, full SWG SIMD parallelism (SIMD = # Channels) was required before enabling the window-parallel mode. Due to the depthwise data layout, this prevented VVAU SIMD unfolding (across the kernel dimensions) unless VVAU PE (across the channel dimension) was maxed out.

This adds support for SWG SIMD < C when the SWG is in window-parallel and depthwise mode. Note that the SIMD of the SWG must match the PE of the following VVAU. VVAU SIMD < K is supported via a normal DWC, which is inserted automatically by the compiler.

This experimental HLS DWC component was previously introduced as a workaround for this problem and should now be obsolete: https://github.com/Xilinx/finn-hlslib/pull/134

SWG SIMD < C is also allowed in the 1x1 kernel case (no matter whether parallel_window is set or not), which should fix https://github.com/Xilinx/finn/discussions/895.

maltanar commented 10 months ago

Many thanks for this feature @fpjentzsch ! Not a full review but while testing this on a larger network, I did come across one issue during FIFO sizing with verilator. Although not part of the changes from this PR itself, I think the kinds of SWGs enabled by the PR are more likely to run into this issue: the DEPTH of the swg_reg_buffer is sometimes large enough to run into particular verilator coding style limitations.

Specifically, this is the error message I observed:

%Error-BLKLOOPINIT: /scratch/finn/verilator_fifosim_iletty6g/finn_design_wrapper.v:3250:17: Unsupported: Delayed assignment to array inside for loops (non-delayed is ok - see docs)
 3250 |             Data[i] <= Data[i-1];

The offending piece of code in context below - the DEPTH of the swg_reg_buffer instances in this example can be as large as 832 and this seems to be larger than the verilator default limits for loop unrolling in this scenario.

image

I'll give this a try with --unroll-count 1000 to see if it makes a difference for my use-case, but perhaps there is a better solution here, either by changing the coding style or getting the FINN compiler to generate code with those statements unrolled. (update: that did not help, unfortunately)

maltanar commented 10 months ago

A suggestion from @preusser (which verilator seems to be happy with) is to use a sliced vector assignment instead of the for-loop:

if(shift_enable) begin
  if(DEPTH > 1)  Data[DEPTH-1:1] <= Data[DEPTH-2:0];
  Data[0] <= shift_in;
end
fpjentzsch commented 9 months ago

Thanks @maltanar, I incorporated this fix and, from my side, we could merge this PR already.

To increase resource efficiency in cases like this, I'm currently experimenting with a "depth threshold" setting, which would split up deep shift registers where not all elements need to be accessed in parallel (such as for large dilation_w or large #Channels/SIMD) into smaller shift registers and LUTRAM/BRAM buffers. I prefer to avoid yet another configurable attribute for the custom_op, so I'm doing some benchmarking in an attempt to find a reasonable value for this threshold and the overall resource impact that this could have.