Memory inference redesign

mwkmwkmwk commented 4 years ago

Requirements:

[ ] read, write, and read+write ports
[ ] support for asymmetric width ports (eg. 4-bit write, 8-bit read)
[ ] uninitialized, zero initialized, or initalized with arbitrary data
[ ] synchronous write, with byte enables
[ ] asynchronous read
[ ] synchronous read, with
[ ] ... read new value [for write+read ports]
[ ] ... read old value [for write+read ports]
[ ] ... no change if written [for write+read ports]
[ ] ... clock enable (which may be tied with write clock enable for the same port)
[ ] ... output register initial value
[ ] ... output register reset (synchronous or not)

daveshah1 commented 4 years ago

Better handling of transparency would also be something I'd like to see, needed to support most TDP patterns - c.f. #1390 and nmigen/nmigen#216

mwkmwkmwk commented 4 years ago

Not in scope:

Xilinx extra pipeline registers — would be introducing too much complication into the pass, can just as well be handled by matching the "base" first, then using a custom pattern match pass to merge $dff* into the cell
all sort of vendor-specific ECC / FIFO stuff (no reasonable way to infer)

Ravenslofty commented 4 years ago

It should handle multiple write ports without giving up and falling back to flops.

mwkmwkmwk commented 4 years ago

It should handle multiple write ports without giving up and falling back to flops.

I mean, uh, that's kind of obvious :p

mwkmwkmwk commented 4 years ago

A bunch of related issues that should be fixed:

1802
1748
1134
1087
1088
1854
1836
1810
1958
1390
2447

hackfin commented 4 years ago

I've done a bit of digging on the TDP inference, and managed to get a prototype mapping going through the Python API, but got sidetracked while working on test scenarios. So, not sure if this helps, but referencing to the 'ghdl hemipshere':

https://github.com/ghdl/ghdl/issues/1069

The MyHDL test cases are outdated, I'll have to revisit/CI them again..

Xiretza commented 4 years ago

From an RTLIL (post-inference) PoV:

Currently, memories seem to have two possible representations: Either as a combination of an RTLIL::Memory and zero or more of each $memrd, $memwr, and $meminit, or as a single $mem cell. The former could be used to represent most of the requirements outlined here right now, while $mem falls completely flat for asymmetric memories (all read address, read data, write address, and write data ports have to be the same length, respectively, simply because of how the cell's inputs and outputs are structured). So, first of all, what should happen to $mem?

Leave it as is, only being able to represent symmetric memories, and use the split cells as the canonical representation for asymmetric memories (this would require maintaining support for both $mem and the split cells in all consumers).
Adapt it to be able to represent asymmetric memories
- Support a fixed maximum set of independent ports, e.g. two read ports and two write ports, or two pairs of shared read/write ports, and provide the corresponding cell inputs/outputs. This should cover all hardware implementations, but extending it further would be another major undertaking.
- Support arbitrarily many independent ports by dynamically creating cell inputs/outputs. I have never seen this with any other cell in yosys, so I don't know if this is in any way sane, but I thought I'd throw it in here anyway.
Remove it entirely, leaving the split cells as the one canonical representation for all memories (this would of course also require rewriting all consumers of $mem).

I'm looking forward to input, including options I may have overlooked - especially from the devs who ultimately have the final say in this :)

mwkmwkmwk commented 4 years ago

We could also support arbitrarily many independent ports by basically concatenating them in $mem interface ports (see what $macc cell is doing), though I don't consider it a particularly clean approach.

whitequark commented 4 years ago

@clairexen Is it really not an option to replace $mem with fine grained $mem* cells? One problem this would solve is formal verification of indirect memory accesses of the form mem[mem[x]] which currently has no good solution.

Xiretza commented 4 years ago

@clairexen any update on this?

clairexen commented 4 years ago

Is it really not an option to replace $mem with fine grained $mem* cells?

@whitequark Sorry, I don't understand what "fine grained $mem* cells" means in this context. Can you elaborate?

whitequark commented 4 years ago

Can you elaborate?

I meant, can we get rid of $mem completely and instead use only $memrd, $memwr and $meminit? The only problem created by this change (that I'm aware of) that couldn't be fixed within Yosys is simulating structural Verilog netlists, but I think we can just put the logic that currently lives in the $mem simulation model in write_verilog itself.

Xiretza commented 4 years ago

After some great discussion on IRC yesterday, I thought I'd add my memory cell redesign draft here as well: https://gist.github.com/Xiretza/fec08b9810a66efd368705a7f4a51aa7

Executive summary (of revision 2):

Removed $mem
Added output register init/reset functionality to $memrd
Added a new $memrdwr cell for a combined read+write port with shared clock and address; read and write port have the same dimensions
Removed \TRANSPARENT attribute from $memrd and instead added a \TRANSPARENCY_GROUP attribute to all cells, allowing transparency between arbitrary sets of read+write ports
A proposal on how to handle different-width ports on the same memory.

Please let me know if you can think of any requirements that can't be modeled easily with these proposed cells.

hackfin commented 4 years ago

Removing $mem appears a bit radical to me, I believe you could squeeze all asymmetric issues into it (by using extra parameters) while staying backwards compatible. Apart from that, $mem should probably remain for a while just for the (tech agnostic) simulation model that some might depend on.

Next, I am not sure whether asymmetrical properties should end up in a (virtual) primitive, most BRAM primitives I am aware of don't have such a built-in property and infer with glue logic to implement asymmetric behaviour.

For what I've eperimented with, TDP problems turned up mostly during technology mapping. The above mentioned mapping prototype was revisited, test case finds in this repo:

https://github.com/hackfin/hdlplayground

(quick start: fire up binder, open up tests/memory.ipynb). There are modified python scripts from the yosys techlibs used (/lib/..) to enable the mapping for TDP capabilities (A1DATA -> [ A1READ, A1WRITE]) . The custom mapping procedure is in ram_mapper.py. To verify using post-techmap-simulation, you need to install the ECP5 DP16KD.v vendor model manually as described in index.ipynb. There are also a few other 'read-after-write/writethrough' test scenarios, specific to VHDL (but you could emit Verilog as well and run that through the Cosimulation). I fear though that there is no (and never will be) full consensus on a totally portable RAM description in any HDL that infers 'just right' on all known synthesis tools. So if there will be yosys-specific design-rules on how to describe TDP-RAMs to infer correctly, no prob, as long as they tech-map accordingly).

More comments:

$memrdwr proposal: Can we not just derive the sharedness from collecting $memrd and $memwr?
\TRANSPARENCY_GROUP: I'd second that.

whitequark commented 4 years ago

Removing $mem appears a bit radical to me, I believe you could squeeze all asymmetric issues into it (by using extra parameters) while staying backwards compatible.

The point of removing $mem is specifically not having to shoehorn all the asymmetric (and transparency group) parameters into a cell not suited for them.

Apart from that, $mem should probably remain for a while just for the (tech agnostic) simulation model that some might depend on.

If write_verilog can understand $memrd and $memwr and such cells directly then the simulation model becomes unnecessary.

mwkmwkmwk commented 4 years ago

Some thoughts after doing a lot of reading:

We need to make a few upgrades to the memory cells, but not that many:
1. The $memrd cell in the synchronous version needs support for output register initialization and synchronous/asynchronous reset. I propose:
  - Adding new INIT_VALUE parameter with the initial value (all-x means uninitialized)
  - Adding new RST_VALUE parameter with the reset value
  - Adding new SRST and ARST single-bit ports for synchronous and asynchronous reset and corresponding SRST_POLARITY and ARST_POLARITY parameters (if a reset is unused, it should be tied to 0 and polarity set to 1; in the initial version, only one of the two resets can be used
  - Adding new CE_OVER_SRST boolean parameter to select priority of SRST vs CE
2. The $memrd and $memwr cells need support for mixed widths. I propose allowing a port to have a width that is a power-of-two multiple of the memory's width (with the address being appropriately shifted), without adding any new parameters for this purpose. Since this is commonly described in Verilog as a loop over the low bits of the narrow memory's address (at least Xilinx infers wide ports from this pattern), I propose adding logic to memory_share pass that would recognize ports sharing high bits of the address and having constants in the low bits, merging them to a larger port. This should cover the inference side.
3. I propose not adding a $memrdwr cell — having thought of it, it seems to be more trouble than it's worth, and the memory port sharing should rather be decided in the memory_bram replacement that has more information about shareability (for example, on Xilinx, read and write ports sharing an address cannot be merged into a read-write port if they have completely independent enables — only "read exclusive with write" and "write implies read" can actually be realized).
4. I propose removing the current TRANSPARENT parameter and instead being able to mark read ports as transparent with an arbitrary subset of write ports. To this end, I propose adding a unique sequential identifier parameter ID for write ports, and adding a new TRANSPARENT parameter to read ports that is a bitmask of write port IDs that this port should be transparent with. I also propose adding a new pass that would recognize soft transparency logic on read ports and convert it to the appropriate TRANSPARENT bit.
5. Analogously to transparency, I propose replacing the current PRIORITY parameter on write ports with a bitmask of which other write ports this write port has priority over, thus changing a total ordering into a partial ordering, better able to represent write ports from disjoint processes.
6. I propose removing the $mem cell and converting all passes to use the unpacked cells directly. While it'd be possible to update the $mem cell with the above improvements as well, the unpacked cells are just as easy to deal with in passes, and maintaining two sets of cells is more trouble than it's worth.
I propose the following changes to the flow, other than the obvious updates to support the above functionality:
1. Remove opt_mem, upgrade opt_clean to do its job instead.
2. Add a pass (memory_transparent? or maybe a part of memory_dff?) to recognize transparency implemented in soft logic and set the TRANSPARENT bit.
3. Add support in memory_share for recognizing wide ports — eg. when there are read ports differing only in low address bits (that are constant), merge them into one wide read port.
4. memory_bram: do a complete rewrite, to be designed later. The basic idea stays the same (slurp a text file describing various blockrams, pick the best type that matches, emit temp cells that will be converted to actual vendor cells via techmap), but a lot of new capabilities need to be added, and the text format will probably be all-new.

hackfin commented 4 years ago

I propose removing the $mem cell and converting all passes to use the unpacked cells directly. While it'd be possible to update the $mem cell with the above improvements as well, the unpacked cells are just as easy to deal with in passes, and maintaining two sets of cells is more trouble than it's worth.

It's still not entirely clear to me how $memrd and $memwr would infer to a simulateable model (post-synthesis), unless running an architecture specific (post-map) simulation with the mapped vendor primitives. As of now, I don't see how two distinct modules would share the actual memory buffer variable except by changing the interface. Could you elaborate on that?

Question is, if $mem could be upgraded so far to be the 'cover-all-scenarios' abstract primitive to work for all BRAM architectures and map into a set of architecture-agnostic blocks like TRELLIS_DPR16X4 LUT RAMs. This would also make creation of libraries for new architectures easier (it's better to touch *.v than *.cc, IMHO).

There might be a few users still depending on $mem as of now, including me.

mwkmwkmwk commented 4 years ago

I propose removing the $mem cell and converting all passes to use the unpacked cells directly. While it'd be possible to update the $mem cell with the above improvements as well, the unpacked cells are just as easy to deal with in passes, and maintaining two sets of cells is more trouble than it's worth.

It's still not entirely clear to me how $memrd and $memwr would infer to a simulateable model (post-synthesis), unless running an architecture specific (post-map) simulation with the mapped vendor primitives. As of now, I don't see how two distinct modules would share the actual memory buffer variable except by changing the interface. Could you elaborate on that?

This has been said already — by not having a simulateable model, and handling $mem* in write_verilog instead to convert them back to Verilog code, so that they never actually appear in the output netlist.

Question is, if $mem could be upgraded so far to be the 'cover-all-scenarios' abstract primitive to work for all BRAM architectures and map into a set of architecture-agnostic blocks like TRELLIS_DPR16X4 LUT RAMs. This would also make creation of libraries for new architectures easier (it's better to touch *.v than *.cc, IMHO).

It could, but why would we ever want to do it specifically using $mem if we can have just one memory representation ($mem* separate cells) instead of two?

There might be a few users still depending on $mem as of now, including me.

If you're depending on the raw cell in some way, you'd better describe how instead of just saying it.

hackfin commented 4 years ago

This has been said already — by not having a simulateable model, and handling $mem* in write_verilog instead to convert them back to Verilog code, so that they never actually appear in the output netlist.

Yes, and how would other languages be handled? This would implicitely output a different structure than the yosys kernel is aware of, unless run through a pass, no? So you'd offload the problem of conversion to the language output plugins.

If you're depending on the raw cell in some way, you'd better describe how instead of just saying it.

I think I've given quite a few examples on that subject (above links), incl. hints on how to handle TDP for the ECP5 case with minor changes to the *.txt. Not sure if I'm still up to date, but the GHDL plugin also uses '$mem'. So this would introduce a compatibility break at some point, how are you going to handle this? Interfacing towards synthesis is not so much of an issue, if you want to offload $mem[rd|wr] intermediate mapping to language output. But this doesn't look "nice", really.

mwkmwkmwk commented 4 years ago

This has been said already — by not having a simulateable model, and handling $mem* in write_verilog instead to convert them back to Verilog code, so that they never actually appear in the output netlist.

Yes, and how would other languages be handled? This would implicitely output a different structure than the yosys kernel is aware of, unless run through a pass, no? So you'd offload the problem of conversion to the language output plugins.

Correct, and so what? The language output handles this already anyway. The current Verilog backend does not emit the $mem cell raw into the output anyway, it's converted to an actual Verilog memory.

If you're depending on the raw cell in some way, you'd better describe how instead of just saying it.

I think I've given quite a few examples on that subject (above links), incl. hints on how to handle TDP for the ECP5 case with minor changes to the *.txt.

What, your one-off TDP scripts? Given that the point of this issue is to handle this properly within yosys core, I don't think keeping them working is worth it.

Not sure if I'm still up to date, but the GHDL plugin also uses '$mem'.

... this is pretty much a GHDL bug that needs to be fixed, given that the current memory optimization passes operate only on the unpacked cells anyway (memory_share specifically) and thus GHDL output doesn't trigger those.

So this would introduce a compatibility break at some point, how are you going to handle this?

I'm not — RTLIL, including the cell primitives, is not and has never been considered a stable interface. Perhaps it should be, but at the moment we really don't have the means to do that (and this is, btw, independent of the $mem removal discussion — adding new parameters to $memrd cell is already causing compatibility problems, let alone the TRANSPARENT changes). I'll probably make a GHDL plugin patch to fix the fallout, but that's about it.

Interfacing towards synthesis is not so much of an issue, if you want to offload $mem[rd|wr] intermediate mapping to language output. But this doesn't look "nice", really.

I'm not offloading anything — the $mem cell already has special handling in the Verilog frontend. What I want to do is to get rid of two needlessly redundant representations of the same thing. Perhaps I could be convinced to go the other way around and kill $memwr/$memrd/$meminit instead, but one of the two definitely needs to go.

hackfin commented 4 years ago

What, your one-off TDP scripts? Given that the point of this issue is to handle this properly within yosys core, I don't think keeping them working is worth it.

See also https://github.com/ghdl/ghdl/issues/1069. And I tend to keep them working (as a reference) until I see these test cases inferring like in other $-Tools.

I'm not — RTLIL, including the cell primitives, is not and has never been considered a stable interface.

The yosys LaTeX docs kinda insinuates so. I have no problem with that, if there's an entry point documenting changes. But that's another topic indeed. Compatibility breaks are kinda painful when introduced suddenly without a reference test that demonstrates the 'new' way.

What I want to do is to get rid of two needlessly redundant representations of the same thing. Perhaps I could be convinced to go the other way around and kill $memwr/$memrd/$meminit instead, but one of the two definitely needs to go.

I fully agree on reducing maintenance work. I just found it easier to map and verify a single pseudo primitive instead of different ones that need to get collected and sorted out accordingly. So you're saying that collecting this information from opt passes, techmapping and language backends (I'm aware of $mem handling) takes less maintenance than conversion into an abstracted intermediate (complex) $memx splitting into a *v. library? Happy to check out at a prototype branch for demonstration.

mwkmwkmwk commented 4 years ago

Here goes a proposal for the new text format description, along with some examples: https://gist.github.com/mwkmwkmwk/2e51a10ba55bda239ac1b667ca062693

Open issues:

current rules don't allow configurable sync-or-async read ports — I'm not sure if such a thing would be very useful (the only place where I've seen it come up is Intel MLABs, which can be easily described as two separate RAM types), but it's not hard to add later.
some blockrams have hw cascade capabilities (Ultrascale+ blockrams and ultrarams) which would make it beneficial to perform blockram duplication for depth expansion in the techmap rule instead of generic code — how to do it? (obvious idea: just make a depthcascade capability that would pass over this job to techmap, add a parameter saying how many times to duplicate depthwise; maybe also widthcascade for RAMs with usable control-only cascades like Versal UltraRAM). The current proposal contains wrcs which is usable for the simplistic CS logic on ECP5 and derived blockrams, but describing something like Xilinx cascade ports seems impossible in a generic manner
most blockrams have some sort of optional output pipeline register — while I considered it out of scope above, I'm no longer sure if it is a good idea. Given the complexity of compatibility rules for those (esp. when resets are involved), it might be beneficial to recognize them while we're infering the memory and reuse the matching / option selection machinery instead of making another set of complex register-merging passes. This would, of course, require addition of many more capabilities to describe the output registers and how the control signals are shared with the direct output latch (which can be complex).

mwkmwkmwk commented 3 years ago

So, since people have been asking about this, here's the current status:

First, the memory model changes and recognition passes for newly-supported features have all landed. This includes:

synchronous read ports with initial value and synchronous or asynchronous reset are modelled and recognized (infered from naturally-written Verilog code)
wide ports (ports that access a power-of-two multiple of base memory width at naturally-aligned address) are modelled and recognized
the write port priority model has been changed to be more flexible and we can now model and recognize an arbitrary partial-order priority relation between ports
the transparency model has been likewise changed so that we can express arbitrary transparency relations between read and write ports
a new transparency recognition code has been written (as part of memory_dff) that should be able to recognize reasonable ways of writing transparent ports in Verilog
the _v2 versions of memory cells have been added to enable the above capabilities; old cells will be recognized on input and automatically upgraded
a new cell for a memory read+write port has NOT been added, as mentioned above — it was decided that this is more trouble than it's worth, and it's easier to just merge the read and write ports in the final memory inference pass

HOWEVER, our current BRAM mapping pass (memory_bram) is unable to make use of most of the above improvements — it will either emulate the newly-added functionality instead of realizing it in hardware, or completely reject memories using the newly added features (however, it will not reject anything it would have accepted before). This means very little of the above is actually usable now.

While some new functionality (init/reset values, specifically) could be reasonably retrofitted into memory_bram, in general its architecture is just not powerful enough for some features we really want (single-port memory, true dual-port memory, wide ports), so the plan is to instead spend the resources on writing a more flexible replacement (using the design detailed above), which will complete this saga and close a whole lot of long-standing issues in one big swoop. The timeframe for this is 1-2 months from now.

mithro commented 3 years ago

This is excellent work @mwkmwkmwk and I want to say a huge thank you for taking on this epic task!

mwkmwkmwk commented 2 years ago

A very, very untested solution is now available as #3189.

YosysHQ / yosys

Memory inference redesign #1959

1802

1748

1134

1087

1088

1854

1836

1810

1958

1390

2447