AWSteria GFE Step 1: BSV P2+DDR4; DMA program load; ISA tests; standard AWS-FPGA XSIM flow

rsnikhil commented 4 years ago

90% working; expect to complete and share with U.Cambridge May 1

rsnikhil commented 4 years ago

First successful run (XSim flow). 'Host side' C code reads a mem hex file for an ISA test (rv64ui-p-add); uses AWS DMA to download into AWS DDR4 (via BSV fabrics); uses AWS OCL channel to tell DUT to allow memory access from the CPU (Flute), which then executes the ISA test to completion successfully. Will now package this up to make it available to U.Camb and others.

rsnikhil commented 4 years ago

@swm11 has confirmed that he can build and run the out-of-the-box demo. Didn't quite run out-of-the-box; was missing a 'sim' directory that I thought would be generated, and a C file had a hard-coded path that had to be fixed. Have added the 'sim' directory to the repo, and am fixing the hard-coded path. (Same comment as issue #52, which should be closed since it overlaps with and is subsumed by this one.)

rsnikhil commented 4 years ago

Added facilities for inter-process interrupts in both directions, host-to-Flute and vice versa. See: commit 9b54dd22

jrtc27 commented 4 years ago

Can we please stop calling them inter-process interrupts? The host is a device, and the FPGA is a device, and both communicate with each other. In the host-to-FPGA direction, those ultimately manifest as external interrupts. In the FPGA-to-host direction, those manifest as software being told that there has been a write. The host and FPGA are not peers, so whilst there are two processors and multiple processes in the system, calling them IPIs is at best a confusing term, and at worst leads to a misunderstanding of what's going on.

The host-to-FPGA direction is a bit special, since those ultimately need to turn into lines into the PLIC going high. However, the FPGA-to-host direction needs to be far more general than it currently is. I suspect there has been some miscommunication that should have been clarified, so let me try and explain it more concretely in case that helps you see the bigger picture.

A memory-mapped VirtIO device has the following layout in memory:

uint32_t MagicValue; // R
uint32_t Version; // R
uint32_t DeviceID; // R
uint32_t VendorID; // R
uint32_t DeviceFeatures; // R
uint32_t DeviceFeaturesSel; // W
uint32_t Reserved0[2];
uint32_t DriverFeatures; // W
uint32_t DriverFeaturesSel; // W
uint32_t Reserved1[2];
uint32_t QueueSel; // W
uint32_t QueueNumMax; // R
uint32_t QueueNum; // W
uint32_t QueueReady; // RW
uint32_t Reserved2[4];
uint32_t QueueNotify; // W
uint32_t Reserved3[3];
uint32_t InterruptStatus; // R
uint32_t InterruptACK; // W
uint32_t Reserved4[2];
uint32_t Status; // RW
uint32_t Reserved5[3];
uint32_t QueueDescLow; // W
uint32_t QueueDescHigh; // W
uint32_t Reserved6[2];
uint32_t QueueAvailLow; // W
uint32_t QueueAvailHigh; // W
uint32_t Reserved7[2];
uint32_t QueueUsedLow; // W
uint32_t QueueUsedHigh; // W
uint32_t Reserved8[21];
uint32_t ConfigGeneration; // RW

followed by more device-specific memory-mapped registers. All the values for those registers live on the host. When the FPGA's OS reads from one of those, let's say Status, the host software needs to be informed that there is a 32-bit read request for offset 0x70 (or the absolute address within the entire address space, with the host de-muxing between VirtIO devices and calculating the offset), determine what the current status is, and send a response to the FPGA such that ultimately the Flute core sees an AXI response (to its original AXI read request) with the data from the host representing the Status register, and can then retire the load instruction from stage 2 with that data being written back to the register file.

Similarly, if the FPGA's OS writes to one of those, let's say QueueDescLow (which represents the low 32 bits of the physical address where the FPGA's OS has placed a descriptor table for the queue), the host software needs to be informed a write to that address with a given value has occurred so that it can perform whatever actions need to happen as a result, including any host-side state updates.

All of these memory-mapped reads and writes fall under what you are currently thinking of as "Flute-to-host IPIs", in that every single one requires the host to undergo actions, but importantly reads need to be delivered synchronously since you need their response (which itself includes data), and writes carry data on their requests, not just a single bit. However, I imagine that currently you have been thinking of QueueNotify as being the only thing you need to support, as that's a memory-mapped register whose host-side interpretation happens to mean "go look at the queue I previously told you about because I have added to it", but any solution that covers all the other memory-mapped registers will automatically cover this particular one, and so thinking about just this case loses sight of the bigger picture.

The underlying transport mechanism by which this is achieved doesn't matter. But what does need to happen is that any AXI request that originates from P2_Core for any of the VirtIO devices' memory-mapped registers gets followed by a corresponding AXI response, giving the illusion that there is a device speaking AXI on the other end, regardless of what complicated actions occur in between the request going out and the response coming back to make that happen.

In the Connectal-based case, the transport mechanism is whatever Connectal provides, sending AXI requests and responses in a format very similar to what your Bluespec AXI types use, but via the magic of Connectal to encode and decode it at either end and drive the relevant methods. There is an AXI slave at the BSV top level that takes requests, turns them into the format for Connectal, and calls a Connectal-generated method to send it over to the host. Some time later a C++ method on the host is called with the deserialised arguments representing the AXI request. That C++ function inspects the request, decides which device it refers to and calls whatever VirtIO implementation function it deems relevant to satisfy that request. Eventually those functions terminate and return a result (either data+status for reads, or just a status for writes) to the top-level C++ method that was given the AXI request. That method then calls a Connectal-generated C++ method with that response, that gets encoded, and some time later a corresponding BSV function is called with the deserialised arguments representing the response. That BSV method then drives the relevant AXI response channel on its slave interface going back to the interconnect, and thus an AXI transaction has ultimately been round-tripped through the host-side software. In case code makes it clearer (slightly simplified to not clutter it with unnecessary details):

The BSV top level:

module mkAWSP2#(AWSP2_Response response)(AWSP2);
   ...
   rule master1_aw if (rg_ready);
      // Take an AXI request coming from the interconnect to this slave
      let req <- pop_o(io_slave_xactor.o_wr_addr);
      let burstLen = 8 * (req.awlen + 1);
      // Tells Connectal to call the "io_awaddr" C++ method below with these arguments
      response.io_awaddr(truncate(req.awaddr), extend(burstLen), extend(req.awid));
   endrule
   rule master1_wdata if (rg_ready);
      // Take an AXI request coming from the interconnect to this slave
      let req <- pop_o(io_slave_xactor.o_wr_data);
      // Tells Connectal to call the "io_wdata" C++ method below with these arguments
      response.io_wdata(req.wdata, 0);
    endrule
   rule master1_ar if (rg_ready);
      // Take an AXI request coming from the interconnect to this slave
      let req <- pop_o(io_slave_xactor.o_rd_addr);
      let burstLen = 8 * (req.arlen + 1);
      // Tells Connectal to call the "io_araddr" C++ method below with these arguments
      response.io_araddr(truncate(req.araddr), extend(burstLen), extend(req.arid));
   endrule

   interface AWSP2_Request request;
      ...
      method Action io_rdata(Bit#(64) rdata, Bit#(16) rid, Bit#(8) rresp, Bool rlast);
         // Drives the AXI slave response back to the interconnect
         io_slave_xactor.i_rd_data.enq(AXI4_Rd_Data { rdata: rdata, rid: truncate(rid), rlast: rlast, rresp: 0 });
      endmethod
      method Action io_bdone(Bit#(16) bid, Bit#(8) bresp);
         // Drives the AXI slave response back to the interconnect
         io_slave_xactor.i_wr_resp.enq(AXI4_Wr_Resp { bid: truncate(bid), bresp: truncate(bresp), buser: 0 });
      endmethod
      ...
   endinterface
   ...
endmodule

The C++ side:

void AWSP2_Response::io_awaddr(uint32_t awaddr, uint16_t awlen, uint16_t awid) {
    fpga->io_write_queue.emplace(awaddr, awlen / 8, awid);
}
void AWSP2_Response::io_wdata(uint64_t wdata, uint8_t wstrb) {
    AXI_Write_State &io_write = fpga->io_write_queue.front();
    uint32_t awaddr = io_write.awaddr;
    PhysMemoryRange *pr = fpga->virtio_devices.get_phys_mem_range(awaddr);
    uint32_t offset = awaddr - pr->addr;
    if (awaddr & 4) wdata = (wdata >> 32) & 0xFFFFFFFF;

    // Inform VirtIO device implementation of the write
    r->write_func(pr->opaque, offset, wdata, 2);

    // Tells Connectal to call the "io_bdone" BSV method above with these arguments
    fpga->request->io_bdone(io_write.wid, 0);
    fpga->io_write_queue.pop();
}
void AWSP2_Response::io_araddr(uint32_t araddr, uint16_t arlen, uint16_t arid) {
    PhysMemoryRange *pr = fpga->virtio_devices.get_phys_mem_range(araddr);
    uint32_t offset = araddr - pr->addr;

    // Inform VirtIO device implementation of the read and get result
    uint64_t val = pr->read_func(pr->opaque, offset, 2);
    if ((offset % 8) == 4) val = (val << 32);

    // Tells Connectal to call the "io_rdata" BSV method above with these arguments
    fpga->request->io_rdata(val, arid, 0, 1);
}

AWSteria is of course not using Connectal and so its transport mechanism may look very different, but an abstract form of this model at a high level (ie where host-side software is informed of memory writes to the VirtIO memory-mapped addresses, and generates the responses for those requests that end up back on the SoC's AXI interconnect) is what there needs to be.

rwatson commented 4 years ago

Just an aside: I agree -- calling these IPIs will only cause confusion, as that terminology is generally reserved for cores working within SMP/NUMA clusters, and used within a single OS instance for activities like TLB shootdown. We should not be thinking of this configuration that way, and especially not in a way that might confuse what we are doing with the existing RISC-V IPI primitive(s). Rather, we should think of this as two independent endpoint hosts that have an I/O channel between them. So from the perspective of the FPGA-embedded soft core, especially, this should be made to appear (and be thought about) as conventional VirtIO I/O DMA and interrupt delivery via the PLIC.

rsnikhil commented 4 years ago

Re. "Can we please stop calling them inter-process interrupts?": no problem. I had no terminology for this, and picked up the term from someone else in the discussions on this project in the last week or so.

rsnikhil commented 4 years ago

I suspect there has been some miscommunication that should have been clarified

Indeed I think this is true (but this is easily fixable):

(a) On Tuesday's weekly phone call, I asked questions about what's needed for VirtIo, I asked if ANY data structures are in host memory at all. My impression of the answers is: no, they all live in DDR4, and that we just need a mechanism for interrupts in both directions to notify each side that some data (in the DDR4) was available. The host would use OCL or DMA to directly read/write data structures in DDR4 memory.

(b) The detailed description by Jessica above is a different model:

The VirtIO device data structure live in host memory.
It is memory-mapped into Flute's address space, so Flute accesses it with normal load/store instructions.
The rest is an implementation detail: briefly, Flute's load/store becomes an AXI I/O request to a slave that arranges to round-trip it to the host and back, to read/write a word on the host.

(a) and (b) are alternative models of where these data structures live. In (a), the host reaches across to data structures in FPGA DDR4; in (b) Flute reaches across to data structures on the host.

I got the impression in Tuesday's call that we're doing (a), and so we just needed an interrupt mechanism, since host access to DDR4 is already available.

But your message indicates that (b) is desired.

Assuming my reading is correct, I will fix up the HW today to do (b) instead, since my new flexible bidirectional communication channels can now support either.

jrtc27 commented 4 years ago

There are two data structures in play. One set of data structures represent the devices' configurations and state, and live as part of the device implementations on the host. The other set of data structures represent the request queues and DMA buffers, which live in FPGA DRAM under the primary control of the guest. I.e., we need both (a) and (b). This should be compared to the story with a normal DMA engine that lives somewhere out on a system bus, which has configuration registers and state living inside itself presented to the core via a memory-mapped interface, and also accesses DRAM for the actual rings and buffers that get copied to/from.

rsnikhil commented 4 years ago

In response to above discussion: commit c005ae11

Removes "inter-processor interrupt" terminology.
Removes AWS_IPI_Out box, replaces it with AWS_Host_Access box.

The former just delivered a 32b write from Flute up to the AWS host, to interpret as a vector of interrupt requests.

The latter acts as a 'proxy' for AXI4 transactions to be serviced by the host: it forwards the AXI4 request channels from the fabric to the host, and forwards the host's AXI4 responses back into the fabric.

kiniry commented 4 years ago

I presume given our call this morning that there is a bit more to do on this issue @rsnikhil, so I'm moving it into Sprint 3. Please be sure to cross-reference this issue in your MR when you are ready to call this "done", that way this issue will be closed when the merge takes place.

rsnikhil commented 4 years ago

We've been running AWSteria with DMA program load and running ISA tests in Bluesim (as of yesterday), XSIM (for some weeks) and FPGA (for a week or so) flows. More cleanups will follow as we work towards FreeBSD and Virtio, but I'm closing this issues now.

GaloisInc / BESSPIN-CloudGFE

AWSteria GFE Step 1: BSV P2+DDR4; DMA program load; ISA tests; standard AWS-FPGA XSIM flow #55