Create concept for demo rack

bittide / bittide-hardware

17 stars 1 forks source link

Create concept for demo rack #228

Open lmbollen opened 1 year ago

lmbollen commented 1 year ago

The mining rig will eventually contain up to 9 FPGAs that all contain one (or more?) bittide domains. Each domain is controlled by a clock control algorithm that will run on a RISC-V soft-core (VexRiscV). For each domain, every incoming link is connected to an elasticBuffer. These elastic buffers produce datacount s that represent the number of elements in the buffer.

The clock control algorithm uses the datacounts of all incoming links to control its own frequency. In the end we'd like to have a hardware experimentation platform, which is remotely accessible, presumably through github actions.

This experimentation platform can be used to experiment with different configurations for the clock control algorithm for different topologies.

Currently we expect to require the following features:

Discover and connect to all FPGAs in the mining rig via ethernet.
Reset parts of the FPGA (SPI interface? Clock control core?)
Be able to reliably communicate RISC-V binaries for the clock control algorithm.
Collect and store data from the FPGAs (buffer occupancies for all links, clock control bumps, various other metrics?)
Make system enter low power mode when not in use?
Post process recorded data (produce plots?)
Make data available for GA?

Since this is mostly conceptual work, features can be added, dropped or moved elsewhere. This issue can be closed when we have a conceptual overview of all steps that have to be performed when performing an experiment via GA. Alongside a architectual overview of the required components.

kleinreact commented 1 year ago

Some first thoughts right below. I primarily focused on the FPGA setup for the moment, because I see most of the open decisions to be there at the moment. The mentioned proposals should be seen as a starting point for future discussions and can be refactored totally from scratch if required.

General Observations

All (non-bittide) communication should run via Ethernet. The available Ethernet core implements everything up to IP. For reliable network discovery, e.g., TCP, we like to use lwip. As a consequence, there must be some CPU to run lwip for handling the network communication.
All data transfers can be realized via memory maps to designated memory regions.
Configuration items for experiments on the rig are:
- Clock Control Algorithm (CCA)
- Network Topology
- Monitoring and Analytics Data
- Metadata (like start & end timestamps of the experiment, CCA version, ...)
- Configuration Settings for the elasticBuffers (optional)

One of the major challenges is the configuration management of the runtime data on each FPGA while simultaneously running the CCA. There are two proposals for realizating this:

Proposal 1: Use an additional managing core

This approach builds on an easier concept, but is more resource-heavy on the FPGA. The main idea is to spend an extra RISC-V soft core just for handling all the management tasks for loading and running the CCA. In particular, this core will be responsible for

Ethernet communication,
implementing the software stack for TCP
handling the configuration management, like
- receiving new clock algorithms and network topologies,
- loading them to the second core, and
- starting, pausing, and resetting the second core,
monitoring the elasticBuffers and the outcome of the CCA, and
the power management of the second core.

The second core just runs the CCA. It only has access to the elasticBuffer data and the network topology, and produces the calculated clock modification strategies.

Pros:

Clean clock and timing separation of the management tasks and the CCA.
Predictable resource availability with respect to the execution of the CCA.
Monitoring and analytics are decoupled from the CCA implementation.
Straightforward debugging capabilities, e.g., the memory of the second core can be mapped as read-only memory to the first, allowing for easy and parallel data inspection.
More flexibility with respect to future changes, e.g., the managing core can be simply removed/disabled in a productive environment.

Cons:

The approach requires twice as many logic cells (this should however not be much of a limitation with the current setup).
Simulating the CCA requires the same isolation conditions as given by the exclusive core.
Connecting both cores correctly may be a bit more intricate in terms of development steps.

Proposal 2: Introduce the CCA as a single (side-effect free) C/Rust function

This approach only requires one CPU core, but restricts the execution context of the CCA. As for Proposal 1, all management tasks are still implemented in software, but are executed on the same RISC-V soft core as the CCA. Running the CCA then can be seen as calling a special purpose main function in C or Rust, where the topology and buffers are passed as arguments and which returns the clock adaption strategy instead of an int. Note that we especially choose main for this analogy here, since main is the function that is compiled to an executable, which is exchangeable at runtime. The function is then called repeatedly by the management context, and can (if necessary) also be scheduled at fixed times with constant frequency.

Pros:

The setup mostly reduces to a standard HW & SW stack enabling almost all regular tools for development to be thrown to this play field immediately.
The context for the CCA can be framed at the software level within C or Rust.

Cons:

The management routines and the CCA are more intertwined, i.e., an unintentional side-effect in the management routines may affect the CCA execution. This making the whole setup more error-prone in general.
Decoupling the mining rig specific management and the CCA execution later on for another application setup requires active adaption of the CCA code.
It must be guaranteed that the CCA is scheduled properly, i.e., it must be incorporated and maintained into the management routine implementation.

kleinreact commented 1 year ago

Some first proposal for an architecture implementing Proposal 1. The memory layouts are just given for illustration to have a first working example.

Architecture

The EBs, CPU CCA, CB CCA, and CC are considered to be part of the CCA Module.
CPU CMU, CB CMU, iMem CMU, and dMem CMU are considered to be part of the CMU Module.
dMem CCA and iMem CCA are shared between both modules.

CB Implementations

`CB CCA` maps:

the statistics output of the EBs to the addresses of dMem CCA from 64 to 64 + (n-1) * s_e (read only)
the content at address 64 + n * s_e of dMem CCA to the clock control interface (write only)

`CB CMU` maps:

the first bit at address 4 of dMem CMU to rst of the CCA Module
the second bit at address 4 of dMem CMU to ena of the CCA Module
iMem CCA to iMem CMU starting at address <a_i> stored at address 8
dMem CCA to dMem CMU starting at address <a_d> stored at address 12
if rst is low and ena is high (see control bits above), then the mapped iMem CMU and dMem CMU are readonly

Abbreviations

Short	Long
CB	Crossbar
CCA	Clock Control Algorithm
CMU	Central Management Unit
CPU	Central Processing Unit
EB	Elastic Buffer
MM	Memory Map
RAM	Random Access Memory
ro	read only
ROM	Read Only Memory
rw	read/write
WB	Wishbone Interface

martijnbastiaan commented 1 year ago

I've got a couple of questions / remarks.

The CMU will use the FPGA's onboard oscillator/PLL. The CCA will be controlled by the clock multiplier boards. Will the CMU be responsible for talking to the clock multiplier boards? If not, will the CMU's reset be controlled by circuitry doing the talking? (For context: we already have an FPGA design that can do the talking.)
With regards to reset sequencing: I'm assuming the CMU will control the CCA's reset. Correct?
We need to get debug information from the CCA to the CMU. Would it be an idea to reserve space in the CCA's dmem: an address x storing how much debug info is is stored. Debug info stored at x + n where n ~ *x (*x ~ deref x). That way CMU could monitor x and reset it to 0 when it has transferred everything to the host. The CCA would then be two functions: the programing running on the CCA CPU and one interpreting the debug output on the host.

kleinreact commented 1 year ago

The CMU will use the FPGA's onboard oscillator/PLL. The CCA will be controlled by the clock multiplier boards. Will the CMU be responsible for talking to the clock multiplier boards?

If the CCA is active, then only the CCA should be able to change the state of the multiplier boards. If the CCA is not active, e.g., currently updated or paused, then I still don't see any requirements for the CMU changing the boards state other than resetting them completely for starting a new experiment.

If not, will the CMU's reset be controlled by circuitry doing the talking? (For context: we already have an FPGA design that can do the talking.)

What talking are we actually talking about?

With regards to reset sequencing: I'm assuming the CMU will control the CCA's reset. Correct?

That's correct. This is what the CCA control bits are intended for.

We need to get debug information from the CCA to the CMU. Would it be an idea to reserve space in the CCA's dmem: an address x storing how much debug info is is stored. Debug info stored at x + n where n ~ x (x ~ deref x). That way CMU could monitor x and reset it to 0 when it has transferred everything to the host. The CCA would then be two functions: the programing running on the CCA CPU and one interpreting the debug output on the host.

I don't think we should make things too complicated here. I would prefer not putting any restriction on the executed CCA code at all. Especially, the CCA code should not be restricted in how the CMU debugging works, otherwise we might limit the CCA's capabilities although we don't have to.

Technically, every action of the CCA can be observed via the CCA's iMEM and dMem operations. The only bit that is currently missing is the instruction pointer of the CCA, but that one can be added as well. Clearly, observing all state of the CCA is hard (except we run the CCA much slower than the CMU), but observing only several dedicated dMem addresses should be fine and also sufficient for standard debugging tasks. Usually, you only need monitoring capabilities like

if <iMem instruction @addr X> gets executed, then get the content of <dMem @addr A_1>, .. <dMem @addr A_n>

Or do you see any other requirements here?

bittide / bittide-hardware