[Cider 2] Handle memory input/output

EclecticGriffin commented 2 months ago

Currently Cider 2.0 cannot print out the contents of memory in a way that is compatible with the json tooling we use for snapshot testing. The old infrastructure for this is somewhat tangled and tortured in both directions since it involves some base64 encodings and a nightmare fud python script. Rather than hooking back into this, it could be worthwhile to do something different for the new version, specifically a more "stupid" binary encoding.

@sampsyo said:

I had suggested, in the spirit of trying to do the simplest possible thing here, that it might not be too hard to load/dump "raw bits," i.e., the exact bit-level contents of the memories. This is more or less what the RTL simulators already do (they are using hex-encoded text, but same difference). We could then pre-/post-process these files into our JSON format externally, alleviating the need for any serde hacking on Cider's side.

This of course omits all the non-memory results that Cider 1.0 can already produce. But I believe this is fine: all we really need to check correctness is those memory dumps.

There are a few things to pin down for this since we can have memories contain values of arbitrary bit width.

little-endian or big-endian
padding
structure validation

To that end I propose the following:

we use little-endian
values are padded to the nearest byte with the padding always being zeros (which will be discarded/ignored by cider)
multidimensional memories are flattened with row-major order

The non-obvious part is what we do about structure and validation. We could make the dump be as dumb as possible, i.e., just bits with no embedded information about memory names or structures, but that requires us to have a data file available to deserialize the dump and means for loading memory we just have to assume that the data file was created with the memory information laid out correctly and in the exact order in which the memory instances are defined in the main component. To that end, it may be worth having a preamble which contains the names of the memories, their dimensions, and the definition order and have a tool binary, in the fud2 philosophy, which can convert json to memory dumps and vice-versa without requiring retaining a data file (or constructing a dummy one for the cases in which we run a program without providing data and still wish to observe the output)

EclecticGriffin commented 2 months ago

I am currently leaning toward having a simple preamble with this info since I think it will make things simpler in the long run, but I am open to other ideas. CC: @sampsyo

sampsyo commented 2 months ago

All sounds awesome. I agree with the decisions you summarized briefly:

little endian
pad to bytes
row-major for multi-dimensional memories

And very good reasoning about the metadata that describes what's in the file(s), which of course is necessary for producing any other format from such a binary dump. I actually think this dovetails nicely with two other things going on in the ecosystem.

First:

it may be worth having a preamble which contains the names of the memories, their dimensions, and the definition order

This is pretty much the goal of the "YXI" interface definition format created by @nathanielnrn for AXI purposes! Check it out:

$ calyx -b yxi examples/futil/dot-product.futil
{
  "toplevel": "main",
  "memories": [
    {
      "name": "A0",
      "width": 32,
      "size": 8
    },
    {
      "name": "B0",
      "width": 32,
      "size": 8
    },
    {
      "name": "v0",
      "width": 32,
      "size": 1
    }
  ]
}

That is, YXI is just a JSON document that includes the names and dimensions of the exposed top-level memories (using @external or ref). It's meant to be a comprehensive description of the "external interface" to a Calyx program. So maybe it is exactly what we want here?

So, Cider 2.0 could dump "just the bytes" and rely on a separate YXI file to interpret it. Or, it could produce a single file that consists of the YXI data (presumably serialized to some other format) followed by all the bytes. Either way, maybe it would be cool to standardize on YXI being the way to describe this stuff, extending it if necessary to address this use case (as opposed to inventing a new/different format with similar-but-not-quite-identical contents)?

and have a tool binary, in the fud2 philosophy, which can convert json to memory dumps and vice-versa

I don't think I've broadcasted this too broadly, but @bcarlet and I have recently started working with @Angelica-Schell to do something like this!! That is, we are starting small, but we are hoping to build a standalone data converter tool (as you say, taking the fud2 approach) that can convert between many different data formats. Including Verilog-simulator-friendly binary files, OG fud-style JSON, anything else we can think of. And Cider's preferred format could be wrapped up into that!

Anyway, this is just to say that we should totally build such a thing, and it should probably be a command-line flag tacked onto what @Angelica-Schell is beginning to construct now.

EclecticGriffin commented 2 months ago

don't think I've broadcasted this too broadly, but @bcarlet and I have recently started working with @Angelica-Schell to do something like this!! That is, we are starting small, but we are hoping to build a standalone data converter tool (as you say, taking the fud2 approach) that can convert between many different data formats. Including Verilog-simulator-friendly binary files, OG dud-style JSON, anything else we can think of. And Cider's preferred format could be wrapped up into that!

Ah brilliant, that's exactly what I was thinking about. Happy to help out if needed.

This is pretty much the goal of the "YXI" interface definition format created by @nathanielnrn for AXI purposes!

Amazing! How does this look for multidimensional memories with the size definition?

Essentially what I was imagining was basically the yxi interface info (serialized into a binary format) followed by the raw binary data for all the memories one after another. I think that is better than the version where we have the header for a single memory followed by the data for that memory and so on, since we can easily extract the header info from that without needing to parse the entire file.

sampsyo commented 2 months ago

I don't remember of the top of my head where we landed on multi-dimensional memories, but IIRC we either (a) don't handle them at all, or (b) just report the total size (i.e., the product of the dimensions). And glancing quickly at the code, I think it's (b), as in, it uses get_mem_info: https://github.com/calyxir/calyx/blob/23cd0a1be9724663023844756493e3a5044b53c2/calyx-ir/src/utils.rs#L39

I think that is better than the version where we have the header for a single memory followed by the data for that memory and so on, since we can easily extract the header info from that without needing to parse the entire file.

Yeah, totally makes sense to me.

calyxir / calyx

[Cider 2] Handle memory input/output #1968