Fix: Add stream padding to RTL SWG

I did some more digging to find out why the Verilator simulation gives unexpected results in this scenario:

Verilator exposes (and internally works with) signals between 2-8 bit as UINT8, 9-16 bit as UINT16, 17-32 bit as UINT32, 33-64 bits as UINT64, and >64 bits as arrays of UINT32 or UINT64.
It is illegal to set more bits than the signal actually has, e.g. for a 7 bit signal represented as a UINT8, the 8th bit (MSB) must be zero (padded/masked).
This restriction (2) is not documented very well (I only found this issue), not enforced by FINN or PyVerilator, and only caught by Verilator if -CFLAGS -DVL_DEBUG=1 is set (triggering an overWidthError).
In my experiments, if (2) is not caught, it will not result in a crash or incorrect initial read-in of the data (judging by the input signal waveworm), but lead to undefined behavior later during the simulation (in my case some SWG outputs were incremented by 1 in a seemingly random fashion).
If the last input dimension (SIMD) is 1, FINN applies "fast" data packing (originally introduced here), which behaves differently from the default data packing via hexstring conversion. The latter does not sign-extend, while the former does (to the full width of a Python int, which is later converted to the corresponding C UINT8/UINT16/UINT32 etc.).
The extended sign-bit(s) cause the problem because they violate (2).
Padding the top level I/O streams of the RTL module to a width that is a multiple of 8 fixes this issue because the C data type used by Verilator corresponds exactly to the RTL signal width. The unneeded MSBs are sliced off within the RTL design itself, which naturally works fine during the simulation.
This issue remains for input bit widths between 17-24 because they are padded to 24=3*8, but Verilator exposes them as UINT32. Verilator catches the error in these cases if debugging is enabled like described above, but I was not able to reproduce any incorrect simulation output for the SWG. Since this is very much an extreme edge case, I don't think we need to spend more time to fix this for now.

Xilinx / finn

Fix: Add stream padding to RTL SWG #983