AWSteria: Execute implementation plan adding L2 with coherence support for L1 and host-to-FPGA DMA

rsnikhil commented 4 years ago

Some expected sub-tasks:

(a) Adapt Flute L1 write-back cache and Toooba L2 cache into an L1/L2 cache system for Flute and AWS DMA access, in vanilla Flute.

(b) Redo SoC structure: AWS DMA PCIS connects to DMA port of (a); likely reduce double AXI4 fabric to one; I/O overlay network, etc.

(c) Track (a) and (b) with a CHERI-Flute version (tags, Cambridge AXI)

rsnikhil commented 4 years ago

I've added the new L1 cache with write-back policy to the GitHub Flute repo, while retaining the older write-through L1 cache as an option. @gameboo will study this to plan and do enhancements for CHERI tags/capabilities. Meanwhile I move on to integrate the RISCY-OOO L2 cache behind this write-back L1, making mods to the L1 to support coherence with L2. I will likely start with the version of RISCY-OOO L2 that U.Camb has already modified (from Toooba work) to support tags/capabilities.

rsnikhil commented 4 years ago

Progress report

Got new L1 Writeback cache working (re-sync to Flute, since based on a February commit)
- Passing all ISA tests
- There are separate directories for original L1, WB_L1; chosen at build time
- Committed it and notified Alexandre
Fixed up Carnyx (RISCV_Mem_Test) so I can debug/test the new setup
Studied L2 in more detail to plan integration
Studied CHERI_Toooba's L2 to see if tag support will complicate anything: no
Created new Near_Mem_WB_L1_L2 to prep for integration
- Updated L1 internals in prep for integration (cache line states, count SHARED as write-miss requiring upgrade, respond to downgrade requests)

kiniry commented 4 years ago

Thank you for the update, @rsnikhil.

rsnikhil commented 3 years ago

Progress report:

Identified and fixed various bugs in new Write-back L1
- Not even compiling for RV32
- ISA test failures for RV32ACDFIMSU (85 fails, now down to 6)
- Fixed a Boot_ROM problem which only showed up in GFE, where Boot_ROM is in I/O space, not memory space.
- Fixed a bug in the PTW.bsv (page-table walker) in WB_L1 and WB_L1_L2
In a new WB_L1_L2 directory, developing the L1+L2 cache with coherence and coherent DMA
Changed Write-back L1 Cache to have line-wide interface to L2, instead of 64b
- Changed AXI4-Adapter to accept line-wide interface accordingly (will connect to back-end of L2, which has line-wide)
- Retested all ISA tests in RV32 and RV64, ACIMU and ACDFIMSU
Approaching point where there is a place-holder for the L2 from RISCY-OOO/Toooba Will initially test on all ISA tests with a 'null' L2

kiniry commented 3 years ago

Thanks for the update.

rsnikhil commented 3 years ago

Progress report:

Flute's Write-back L1 modified to produce protocol interactions expected by RISCY-OOO/Toooba L2
L2 integrated into Flute mem system Near_Mem_WB_L1_L2 (in new dir Near_Mem_VM_WB_L1_L2)
Started Carnyx-based testing with small directed tests designed to provoke specific protocol actions. So far, so good.
If remaining directed tests go ok, could graduate to Flute ISA tests today

rsnikhil commented 3 years ago

Progress report:

Running Carnyx-based directed tests on integrated L1+L2
Normal mem ops (Fetches, all D-Mem ops) all working ok
Am now testing mem ops that provoke coherence actions between I- and D- caches, such as
- Store in D-Cache to a line already in I-Cache (needs downgrade SHARED-to-INVALID)
- Fetch in I-Cache to a line already modified in D-Cache (needs downgrade MODIFIED-to-SHARED)
- Store in D-Cache to a line already SHARED in both I- and D-cache (needs invalidation to I-, upgrade to EXCLUSIVE in D-)
All working ok on tests that are not concurrent across I- and D- (finish one transaction before starting next)
Am now testing with concurrency (I- and D- overlapped), so coherence actions can come during transactions

I- and D- caches mux into the L2, whose back-end connects to one of the two AXI4 master ports. For I- and D- MMIO paths, will implement mux into second AXI4 port
I can then run ISA tests

At that point, can hand-off to Alexandre to do CHERI fixups, while I proceed to AWSteria to connect DMA and complete the Virtio work.

rsnikhil commented 3 years ago

Progress Report:

L2 integration done, and checked in (new dir src_Core/Near_Mem_VM_WB_L1_L2 in Flute repo).
Passing all 229 ISA tests for RV64GC, in WindSoC Flute (GItHub standalone build).
Still todo: integration into AWSteria and GFE, stress-testing, performance optimizations

rsnikhil commented 3 years ago

Merged-in some changes from @gameboo (Alexandre Jouannou) to the cache setup to abstract over 'cache words' so that it's easy to redefine it to include CHERI tags; cleaned up; handed off to Alexandre for 'CHERI-fication' of the new integrated L1-coherent-L2 system. I am moving on to integrate these new changes into AWSteria, including the coherent DMA.

rsnikhil commented 3 years ago

Summary of progress since end of July

Updated AWSteria to work with Flute using new coherent L2 cache, including: restructured SoC so that AWS DMA and GDB talk to memory via coherent-dma port of L2, and L2 back-end talks 512-bit wide AXI4 directly to AWS DDR4; updated host-side SW architecture to cleanly support console, UART, virtio, tandem-verification, etc.
Updated simulation setup so that AWSteria can be simulated using Bluesim exactly like actual FPGA run (i.e., same SW and HW sides)
AWSteria once again passing all ISA tests, Hello World!, etc. (in simulation)
Fixed up flow so can build and run simulation in Verilator (slower build than Bluesim, but somewhat faster sim than Bluesim, better VCD waves than Bluesim)
Added recording of Tandem-Verification traces to file in simulation. These are verified offline using Cissr.
Updated Documentation for AWSteria system architecture.
Currently debugging FreeBSD boot (without Virtio) in simulation.

rsnikhil commented 3 years ago

Currently debugging FreeBSD boot (without Virtio) in simulation, using same image that booted in late June before restucturing AWSteria for coherent L2 cache. That simulation took > 10 hours.

Currently encountering an assertion failure in I-Cache at 3h40m; investigating.

rsnikhil commented 3 years ago

Results of several attempts to boot FreeBSD (without Virtio, for now) in AWSteria Bluesim simulation: reaching about 190 million instructions, then failures. Failures vary across runs. 2 failures are (different) assertion failures in the cache. 1 failure is a FreeBSD 'kernel panic' ("Fatal page fault at 0xffffffc0004728f0: 0x00100000000008") dropping it into KDB. Final console message before failure is 'start_init: trying /sbin/init'. Debugging continues.

rwatson commented 3 years ago

I'm not sure if the FreeBSD kernel you are working with has internal assertions enabled -- e.g., INVARIANTS, WITNESS -- but we've found that while those hugely slow down kernel boot, they are excellent for catching memory subsystem bugs, as they do self checks on numerous data structures, atomic operations, etc. The kernel will print out a message on boot warning about the performance hit, if they are configured. If you don't get those messages (one for each debugging feature), we should be able to provide kernels with them enabled.

rsnikhil commented 3 years ago

I have a kernel from June (or earlier) that does have WITNESS enabled. My most recent runs were with a kernel Jessica sent me on Sep 3 where WITNESS is not enabled (I had asked for this, to improve simulation speed, but I'm not sure it's improving sim speed so much). One suspicion is that my L1 cache is not interacting properly with MIT's L2 cache in the corner case where and L1-to-L2 request and an L2-to-L1 request may refer to the same cache line.

jrtc27 commented 3 years ago

For completeness I've now added debug (i.e. INVARIANTS + WITNESS) versions alongside the existing kernels (just drop the -NODEBUG from the name).

jrtc27 commented 3 years ago

Also, in case it help you in your debugging quest, start_init: trying /sbin/init is the point at which FreeBSD starts the first userspace process, which itself will fork and exec additional ones, and in that error message 0xffffffc0004728f0 is satp and 0x00100000000008 is stval (i.e. the virtual address for which a dereference was attempted). That address itself looks very wrong; we're deep in the kernel (and not in the few copyin/copyout-like functions) and so should always be trying to access kernel-space virtual addresses, which for FreeBSD are always the negative/top half of the address space (though that address is not even a valid Sv39 address).

podhrmic commented 3 years ago

@rsnikhil how can I help you with the debugging here? Would it be helpful if I run the simulation as well?

rsnikhil commented 3 years ago

@podhrmic Let's hold off for the moment; your help would be most useful (I think) when I'm debugging Virtio on the setup, and I've not reached that point yet.

Since Monday Sep 14 I've been hammering at the Flute's new memory system (WB_L1_L2, writeback with L2 caches coherent with the L1 caches) using my Carnyx memory stress-test tool. I have encountered and fixed 4 bugs so far. They were all concurrency bugs in L1, due to a request from L2 that arrives "asynchronously" at L1 at certain delicate moments in L1's normal activity. The request from L2 must be serviced at higher priority to avoid possible deadlock, requiring some of L1's state to saved and restored properly after the service.

Carnyx fits into the GFE just like a normal CPU, except it is not a CPU (it does not execute any RISC-V instructions) It just generates (controlled) random memory requests into Flute's memory system, records requests and responses, and checks them against a memory model. Failures are deterministically reproducible, and it quickly uncovered the above-mentioned 4 bugs, within a few 1000 requests to tens of thousands of requests, taking few seconds to few minutes.

(By comparison, FreeBSD booting reached a 190 million instructions, simulating for 10+ hours, before it failed, and it's not clear what happened. Multiple such runs produced different failures after 10-hour runs. Those specific assertion-failures have been seen in the Carnyx experiments and have been fixed.)

I'm continuing with Carnyx until I can get it to execute a billion requests without error (it gets to about 123K requests so far), and will be retrying the FreeBSD boot continuously as the memory system quality improves.

rsnikhil commented 3 years ago

Retrying FreeBSD boot, we no longer see the assertion failures seen earlier (possibly fixed by the Carnyx-based debugging described in previous comment). It now gets stuck, possibly a deadlock.

Testing the memory system with Carnyx, we also encountered a 'stuck' situation after 123K transactions, which we were able to shrink to about 15K transactions.

The issue seems to be inside the L2 (a.k.a. LLC, last-level cache):

L2 gets two requests for cache lines, from D-L1 and I-L1, on successive clocks
L2 responds to the first, but not to the second (which results in getting stuck).

Neither request misses in L2 (both requested lines have been loaded from mem earlier, and were not written back)

Investigating (and have also asked Sizhuo Zhang, author of the L2 code, for his opinion).

rsnikhil commented 3 years ago

Still no joy on retrying boot of FreeBSD (without virtio), in Bluesim simulation, on AWSteria ('vanilla' version, i.e., non-CHERI).

Simulates successfully for 39 hours, 393 million instructions.
We see 80 lines of expected console output, with only 4 more lines expected before the kernel prompt. Tail of console output:

---- 67 lines of expected console output before this ---- start_init: trying /sbin/init Setting up sysctls sysctl: unknown oid 'kern.polling.user_frac' at line 2 sysctl: unknown oid 'machdep.unaligned_log_pps_limit' at line 5 kern.coredump: 1 -> 0 kern.random.harvest.mask: 991 -> 735 mount / rw entropy read from /boot/entropy entropy read from /var/db/entropy/entropy.0 entropy read from /var/db/entropy/entropy.1 create 500m TMPFS at /tmp set up loopback lo0: link state changed to UP ---- REACHED HERE ---- generate host keys start sshd random: unblocking device. exec /bin/sh
and then we get stuck: stops executing instructions, suggesting that fetch, load or store is stuck.

Pondering next move ... (including whether I should abandon 2-day simulations and switch to FPGA execution).

GaloisInc / BESSPIN-CloudGFE

AWSteria: Execute implementation plan adding L2 with coherence support for L1 and host-to-FPGA DMA #118