32-bit cache and multiplier/accumulator

In this PR the 32-bit cache and the multiplier/accumularor is added to the fx branch.

It does the following things:

adds the multiplier and accumulator
adds the 32-bit cache
adds cache filling (one byte at the time)
adds cache writing (4 bytes at the time, ability to mask on the nibble level)
adds one byte cache cycling
adds 16-bit hopping (for +4 and +320 increments of ADDR1)
adds transparency writes (for 32-bit cache writes as well as normal 8-bit writes)
LUT reduction: adds "hints" to the optimizer in order for it to not get stuck in local minimum, saving a lot of LUTs. (by adding "syn_hier" and "syn_keep" attributes in specific comments). For more detail you can read this documentation.

Note that this PR still only contains the 8-bit version of cache filling and transparency. The 4-bit versions will follow later on. This is to keep the PRs as small as possible.

-- Sidenote: due to the need for merging back and forth between forks, commits of earlier PRs are in this PR. It is there best to concentrate on the actual file differences.

X16Community / vera-module

32-bit cache and multiplier/accumulator #15