(PR combines the previously closed PRs: PR #976 , PR #975 and PR #794)
Adds support for utilizing multi-packed DSP48s and DSP58s for the 'MatrixVectorActivation' layer. For weights and activations that are between 4- and 8-bits wide (with the exception of 9-bits for activations for DSP58), the custom layer packs 2, 3 or 4 elements on the input datapath of the DSP to achieve multiple MACs per cycle per DSP48/DSP58 (either 2, 3 or 4 depending on bit-width and board).
[x] rtllib: RTL implementation for the DSP58-based MVU
[x] 4-bit weights x 4-bit activations DSP48 & DSP58: mvu_4sx4u.sv
[x] >4-bit weights x >4-bit activations DSP48: mvu_8sx8_dsp48.sv
[x] (>)4-bit weights x (>)4-bit activations DSP58: mvu_vvu_8sx9_dsp58.sv
[x] Flow control and axi wrapper: mvu_vvu_axi.sv and mvu_vvu_axi_wrapper.v respectively
[x] Custom-op for the new RTL component: see matrixvectoractivation_rtl.py
[x] Code geneneration
[x] IP-stitching
[x] Resource estimations
[x] Cycle estimations
[x] CPPsim & RTLsim
[x] Transformation to instantiate the newly created custom-op: see additions in specialize_layers.py
Tests
[x] FINN unit test -- test for the MVU custom-op & transformation (node-by-node CPPsim, node-by-node RTLsim, stitched-ip RTLsim): see test_fpgadataflow_rtl_mvau under test_fpgadataflow_mvau.py
[x] PyVerilator bug for simulating array with loop-carried dependency; packed arrays instead of unpacked arrays and ensure signed arithmetic is explicitly enforced whenever expected.
[x] Support for DSP48E1.
[x] 4-bit weights x 4-bit activations DSP48 & DSP58: support for unsigned activations.
[x] Relaxing SIMD constraint (SIMD being a multiple of 3) for DSP58-based implementation.
(PR combines the previously closed PRs: PR #976 , PR #975 and PR #794)
Adds support for utilizing multi-packed DSP48s and DSP58s for the 'MatrixVectorActivation' layer. For weights and activations that are between 4- and 8-bits wide (with the exception of 9-bits for activations for DSP58), the custom layer packs 2, 3 or 4 elements on the input datapath of the DSP to achieve multiple MACs per cycle per DSP48/DSP58 (either 2, 3 or 4 depending on bit-width and board).
Important: note that the commit-hash of PyVerilator is set to point to
ce0a08c
(https://github.com/maltanar/pyverilator/tree/refactor/drive_rising_edge) to ensure the RTL simulation (for MVU) tests pass.Functionalities to be added for the MVU
rtllib
: RTL implementation for the DSP58-based MVUmvu_4sx4u.sv
mvu_8sx8_dsp48.sv
mvu_vvu_8sx9_dsp58.sv
mvu_vvu_axi.sv
andmvu_vvu_axi_wrapper.v
respectivelymatrixvectoractivation_rtl.py
specialize_layers.py
Tests
test_fpgadataflow_rtl_mvau
undertest_fpgadataflow_mvau.py
mvu_axi_tb.sv
Outstanding bugs & features