Controlling hardware simulation: Python test benches?

abe-k commented 7 years ago

From the chat:

tim sherwood George is working towards an integration with the PyNQ board that will hopefully allow us to run real PyRTL seemlessly on real hardware. I think the plan would be to have a "hardware simulate" as an option. However, running each an every step individually would be really slow. As such we probably need a way to add some sort of "simulate_until" which will simulate until some condition (a specified PyRTL wire) goes high. We probably need some additional options to say it is okay to throw away intermediate results as well. My guess is that this would help with the other "fast simulation" approaches as well.

As the simulation becomes faster, communication between it and PyRTL has the potential to become a bottleneck, so giving it enough information to run independently seems like a very good idea. However, if the simulation can run for an indefinite number of steps, specifying the input at each step becomes more complicated than simply giving a list of inputs.

For other HDLs, the standard approach is to write a test bench—a separate piece of logic that can control the inputs and monitor the outputs. If we only intend to run hardware simulations on PYNQ, though, we can use its processor, allowing test benches to be arbitrary Python code. Under this approach, the current setup of fixed inputs would simply be the default test bench, but users could create more sophisticated test benches, such as ones that discard uninteresting outputs or stop when a specific output changes.

These Python test benches would also be used with current simulation techniques, so the interface would stay consistent: if no test bench is specified, the default one would be used, meaning no change to existing code. I'm envisioning a generator-based API for the test bench:

def my_test_bench():
    while still_running:
        inputs = ...
        outputs = yield inputs
        ...

Does this seem like a useful addition? Does PYNQ work in a way that allows this efficiently? Any other thoughts?

timsherwood commented 7 years ago

I agree with you on the problem completely, but I am not sure I quite follow the part of the the test benches. Are you thinking that we pass an "input generator" as a function along with simulation? I think this makes sense from a fast simulation standpoint, but I am not sure how it maps to the FPGA space (unless the generator was something we could then synthesize).

For the hardware side I think the idea is that often times processors operate in "bursts" of activities. For example, we might send over set of commands which invoke some function or set some memory bits and then "run" the processor which might take many thousands or even millions of cycles (which no new inputs being needed -- obviously the input wires will be set to some value but they don't necessarily need to change). For example if we were to have a "terminal" on this hardware machine we might need to send something on each key stroke but then nothing is sent in between. So what I was thinking was that this could be captured by some sort of "run until" operation with an explicit watchdog counter.

However, perhaps there is a middle way -- if functions or classes are how we generate inputs we could have a particular type of function/class that was then hardware accelerated? Open to more discussion.

abe-k commented 7 years ago

That's basically my idea with the test bench, but in addition to being an input generator it could also be an output pruner of sorts, discarding or compressing uninteresting outputs to reduce memory requirements. You're right that this wouldn't work on a typical FPGA, but the PYNQ has a couple ARM Cortex-A9 cores (capable of running Python) that are directly connected to the FPGA fabric, so a Python test bench running on the ARM wouldn't necessarily have the same performance problems as one on the host computer.

For the simple case you describe of setup followed by constant inputs until some end condition, a "run until" would suffice. For anything more complicated, though, it would be easiest to express the testing conditions in code. Hardware is a pain to write compared to standard Python, so if we can get reasonable performance without requiring a synthesizable test bench, that would be ideal.

timsherwood commented 7 years ago

I agree that having synthesizable testbenches is less than ideal. However, I thought that the Cortex-A9 cores were on-board but still off chip -- having the processor provide inputs every cycle might be pretty slow. I think each transaction will require both transfer across the off-chip bus and interrupt handling. From a bandwidth standpoint my guess is that it is pretty decent, but from a latency perspective (invoke on each cycle) I am less sure. It might make sense to estimate the performance of the schemes (or run some tests)?

jolting commented 7 years ago

You probably want to store some a bunch of pre-generated inputs in host memory, DMA the input to the programmable logic and then DMA the output back. Interrupt the processor when completed. There are some IP blocks that should help you with that.

timsherwood commented 7 years ago

From a hardware performance standpoint I agree that DMA is the way to go. I think the question is how to encapsulate that interaction. Abe is absolutely right that a generator is the right way to capture that interaction without requiring the "test bench" to be explicitly manage the reentrant nature. However, if you want performance out of the hardware some structure is required of that generator (like that there are "blocks" where inputs and outputs how no dependency). That is where I was going with the "types" of generators... only some of which get you hardware performance (but all should still be functional as long as the hardware supports single step). Another option would be that there is a .softsim and .hardsim method that is either hand written or automatically generated from some other specification... however it quickly becomes YASL (yet another specification language) pretty quickly down that path.

abe-k commented 7 years ago

I think the right approach in general is to start with functionality and convenience, and only worry about performance where necessary. This would suggest Python test benches in the general case, with an option to add a synthesizable test bench when you need pure-FPGA speed. In this view, PyRTL would provide a framework for connecting the Python test benches to the simulated logic (as well as some basic test benches), so that users can write their test benches normally, then port to logic only those pieces of the test bench that need higher speed. We could perhaps provide some pieces of logic (using this framework) that implement the optimizations you mentioned, such as running several steps without changing the inputs.

As far as I can tell, this approach meets all the requirements:

No change to existing code, because we'd provide a standard test bench.
Full control by the user, because they could write arbitrary code for controlling the simulation.
Option for higher-performance simulation (with FPGAs), because the user could write logic that directly interfaces with the simulation.
No new specification language.
Easy high-performance simulation for common special cases, because we'd provide logic that implemented it.

This would work particularly well on the PYNQ, since the CPUs and FPGA are in the same package, but even if the only connection between Python and the FPGA is over USB, we should be able to present the same API.

timsherwood commented 7 years ago

I agree that a synthesizable test bench is not the right way to go to start as well. In fact what I originally proposed did not have test benches at all -- but I am convinced that the generator approach both would work and is a good idea in some cases. However, if performance is not part of the equation then guess what I don't understand the proposed approach takes us any closer to running things in hardware (which was the topic of my original discussion)? I.e. why is it "better for hardware" to have a generator-based test bench than just calling sim.step? Both can be easily done on the ARM?

BTW if you are convinced it is a good idea then by all means do it -- sometimes it is much easier to explain when there is running code in place :)

jolting commented 7 years ago

From a hardware performance standpoint I agree that DMA is the way to go. I think the question is how to encapsulate that interaction.

I can imagine encapsulating that interaction would have applications beyond simulation. Being able to stream data between the programmable logic and the processor system(which is the best part of having the processor on chip) seems fairly useful.

abe-k commented 7 years ago

Finally put together a (hopefully) coherent explanation of my idea for the system:

The Zynq is something of an unusual case, so let's consider using a typical FGPA test board, with user-programmable microcontroller, but with some IC that translates between USB and UART serial, with the latter connected to a couple FPGA pins. Let's assume that the problem of going from a PyRTL netlist to an FPGA bitstream and loading the bitstream onto the FPGA has been solved.

The problem is then one of translation, since the logic block being simulated presumably does not contain a UART as its sole input and output. Therefore, we need to include some additional logic that converts between the serial stream and the inputs and outputs of the user's logic block. Here's a diagram: pyrtl-fpga-diagram The simplest hardware test bench would receive inputs for a single step over serial, run the logic block on them, and send the resulting outputs over serial. The user interface on the Python side could remain as .step. However, this would have terrible performance, taking at least 2 milliseconds per step from USB latency alone.

The obvious improvement is to batch the steps, sending a packet of many inputs to the HW test bench and getting back a packet with all the corresponding outputs. From the user's perspective, this is the .run method, which I implemented for CompiledSimulation because of the foreign-function-call latency. Throughput for this approach is still limited, though—full-speed USB can handle at best 1 MB/s (with most UARTs limited to under half that), and processing that fast in Python is probably unrealistic.

To progress, we need to not send all the inputs and outputs over USB, which means generating some inputs in the HW test bench and/or having it summarize the outputs. This can take many forms, some of which have been mentioned previously. I suggest that PyRTL include a HW test bench that implements some of these optimizations.

Of course, the packet format being sent over USB and serial is now not just simple lists of inputs and outputs. I suggest that the Python side separate concerns: one piece that talks to the operating system to send and receive packets, and another piece that builds and parses packets, wrapping the first piece with user-friendly methods. This latter piece would be the software test bench.

Once all this exists, it's not a big step to have users create their own test benches, for when the input-generation or output-summarization techniques they need aren't implemented by PyRTL. We would still provide the low-level Python code and automatically include the FPGA logic for the UART and packetization, allowing the user's test bench to be written with without worrying about these details.

In the case of the Zynq, these ideas still apply, except that instead of USB and serial, we have the internal bus that connects the ARM to the FPGA. Because we'd have hidden all the interfacing details, the same test benches would work on the Zynq. In the case of simulation without an FPGA (like we currently have), the test bench approach could also be used, though purely for consistency rather than performance.

timsherwood commented 7 years ago

Got it. Thanks for the very clear write up! I would say a next step would be to connect up George and work out a game plan -- I am not sure if you two have had a chance to talk yet or not (but I think you have very overlapping interests!). I am happy to be involved too if you can fit a meeting into some lab times or free time.

abe-k commented 7 years ago

I talked with George, and here is my attempt to summarize the points he made:

Functional simulation on an FPGA is not likely to result in a huge speedup, because of how long place-and-route takes.
The main reason to test on an FPGA is to test at extreme conditions, where the actual behavior might not be the one specified by the logic.
This sort of testing requires running the FPGA at full speed with the final routing, which means that the test bench can't control the running of the user logic, only observe it.
The amount of data produced by inspecting an FPGA that's running at full speed is more than most connections to a computer can keep up with, so the data has to be buffered on the FPGA and retrieved later.

Based on all of this, it seems that my plan for a testing system would not be very effective. Does anyone disagree with George on the first point and think that they would benefit from doing functional simulation on an FPGA? If so, I'm happy to continue with my plan. If not, there are other things George suggested I can work on.

UCSBarchlab / PyRTL

Controlling hardware simulation: Python test benches? #226