Vivado fails to infer block ram

adamwalker commented 8 years ago

Vivado's fragile block ram inference fails again in this example:

https://gist.github.com/adamwalker/593144773095e624cab6

If I remove the call to init_to_bv in the generated code it succeeds.

christiaanb commented 8 years ago

@adamwalker and if you replace the undefined in your Haskell with repeat 0? (or in the VHDL, replace the inner (others => 'X') by (others => '0'))

adamwalker commented 8 years ago

I made the replacement in VHDL and it was still not able to infer that it should use block ram.

christiaanb commented 8 years ago

Using Vivado 2015.4 (WebPack edition) for the Xilinx Zync XC7Z020 (4.9 MegaBit blockRAM), and the following CLaSH definition:

module BramVivado where

import CLaSH.Prelude

topEntity :: Signal (Unsigned 14) -> Signal (Unsigned 14) -> Signal Bool -> Signal (Unsigned 64) -> Signal (Unsigned 64)
topEntity =  blockRamPow2 undefined

Vivado tells me that is does infer BRAM:

3. Memory
---------

+-------------------+------+-------+-----------+-------+
|     Site Type     | Used | Fixed | Available | Util% |
+-------------------+------+-------+-----------+-------+
| Block RAM Tile    |   32 |     0 |       140 | 22.86 |
|   RAMB36/FIFO*    |   32 |     0 |       140 | 22.86 |
|     RAMB36E1 only |   32 |       |           |       |
|   RAMB18          |    0 |     0 |       280 |  0.00 |
+-------------------+------+-------+-----------+-------+

Note that Unsigned 14 * Unsigned 64 = 2^14 * 64 = 1048576 bit = 1.04 MegaBit, which should be 21.2% of the BRAM resources; however, as you can see, we cannot use the BRAM resources 100% effectively. I don't know what Xilinx device you are using, but could it be that Unsigned 16 * Unsigned 64 = 2^16 * 64 = 4194304 bit = 4.19 MegaBit is hitting the BRAM resource limit?

I haven't tried your original Unsigned 16 design, because I found that Vivado was already taking a long time for my Unsigned 14 design. The problem seems to be that it is elaborating the entire initialisation vector of 2^16 elements. I guess I should add a blockRam primitive which cannot be initialised, i.e. all initial values are undefined.

adamwalker commented 8 years ago

I dont think its hitting the BRAM limit. I incorrectly assumed that it was synthesizing the entire array instead of block ram because it was taking forever, but did not take forever when I removed the init_to_bv function call. However, it looks like the real cause of the delay is the init_to_bv function since this loops over all 2^16 elements.

A block ram primitive that cannot be initialised would certainly help. But, there is still a problem when you want to initialise the block ram to a constant value. The current block ram template would loop over all 2^N elements which is very slow.

Is it possible to avoid the init_to_bv function somehow and directly specify the default value when the ram signal is declared?

christiaanb commented 8 years ago

I introduced the init_to_bv function because in https://github.com/clash-lang/clash-compiler/issues/113 it turned out that Vivado couldn't infer BRAMs when the element type was anything other than std_logic_vector/signed/unsigned.

The reason why the solution to https://github.com/clash-lang/clash-compiler/issues/113 is the init_to_bv function, is that the initialisation vector can be an arbitrary Haskell expression (as long as it expresses a constant. A constant is either a data constructor or primitive (applied to constants). Now, let's consider someone used:

vec :: Vec 1024 (Unsigned 32, Unsigned 32)
vec = replicate d1020 (0,0) ++ ((1,1) :> (2,2) :> (3,3) :> (4,4) :> Nil)

where replicate and ++ are primitives in CLaSH. Now, to turn this is an VHDL array of 1024 std_logic_vector(63 downto 0) I have two options:

Inside the CLaSH compiler: unfold the definitions of replicate and ++, and convert each element inside the CLaSH compiler to std_logic_vector(63 downto 0);
Just use the primitive VHDL defintions for replicate and ++, and use the init_to_bv function to let the VHDL synthesis tool do the conversion to std_logic_vector(63 downto 0);

I went with option no.2 because the VHDL synthesis tool is probably going to be much faster than the CLaSH compiler currently is.

So, the only "solution" seems to be to not have CLaSH generate structured VHDL types (e.g. VHDL records and VHDL arrays), and flatten everything to std_logic_vectors; this makes me sad :-(

christiaanb commented 8 years ago

I also notice that Verilog synthesis fails using Vivado with:

[Synth 8-4556] size of variable 'ram_init_n_16' is too large to handle; the size of the variable is 4194304, the limit is 1000000 ["/home/baaijcpr/devel/clash-compiler/verilog/BramVivado/BramVivado_topEntity_0.v":50]

The reason being that I literally flatten everything to bitvectors in Verilog, also arrays... I guess Vivado is going to bring me a lot of work :-P

adamwalker commented 8 years ago

Yeah, that is annoying.

Being able to initialize the blockram to arbitrary values is certainly nice, but in practice I have only ever initialized them to either undefined or a constant value for the entire array.

Constant initialization seems to work fine in Vivado e.g.

signal RAM_n_15  : RamType (0 to 65536-1) :=(((others => some_value)));

where some_value can be a function call synthesizes block ram for me

I think that a blockram primitive where a single (possible undefined) value can be specified which defines the initial values of all elements in the blockram would be enough for most use cases, and certainly all of mine.

Also - does the current for loop based init_to_bv function synthesize in a reasonable amount of time on Altera tools?

christiaanb commented 8 years ago

Yes, using Altera Quartus Prime 15.1.0 Build 185 on:

topEntity :: Signal (Unsigned 16) -> Signal (Unsigned 16) -> Signal Bool -> Signal (Unsigned 32, Signed 32) -> Signal (Unsigned 32, Signed 32)
topEntity = blockRamPow2 ((2,-1):>(8,8):>repeat (2,-5))

synthesis took like 10 seconds, and full place&route took like 1 or 2 minutes.

polygonhell commented 8 years ago

FWIW the existing implementation synthesizes quickly in ISE as well, it's entirely a Vivado issue. I recently moved a design to VIvado and was stunned how much slower it was, and finally tracked it down to the RAM initialization.

christiaanb commented 8 years ago

So, I've added a new static flag -clash-hdlsyn Vivado, which tweaks the VHDL output to be "compatible" with Vivado. Using that flag, Vivado can succesfully, and quickly, infer blockRAM from the generated VHDL.

Synthesis using the latest Vivado (no place&route) for:

topEntity :: Signal (Unsigned 16) -> Signal (Unsigned 16) -> Signal Bool -> Signal (Unsigned 32, Signed 32) -> Signal (Unsigned 32, Signed 32)
topEntity = blockRamPow2 ((2,-1):>(8,8):>repeat (2,-5))

now only takes like 15 seconds (as opposed to god knows how many minutes before). This fix will be part of the next release, most likely due this friday.

adamwalker commented 8 years ago

Thanks!

christiaanb commented 8 years ago

OK, a whole week later than promised, but I just released version 0.6.11 of CLaSH, which includes the above bugfix.

clash-lang / clash-compiler

Vivado fails to infer block ram #127