chipsalliance / chisel

Chisel: A Modern Hardware Design Language
https://www.chisel-lang.org/
Apache License 2.0
3.84k stars 578 forks source link

[RFC] New Testers Proposal #725

Open ducky64 opened 6 years ago

ducky64 commented 6 years ago

This is a proposal for a new testers API, and supersedes issues #551 and #547. Nothing is currently set in stone, and feedback from the general Chisel community is desired. So please give it a read and let use know what you think!

Motivation

What’s wrong with Chisel BasicTester or HWIOTesters?

The BasicTester included with Chisel is a way to define tests as a Chisel circuit. However, as testvectors often are specified linearly in time (like imperative software), this isn’t a great match.

HWIOTesters provide a peek/poke/step API, which allows tests to be written linearly in time. However, there’s no support for parallelism (like a threading model), which makes composition of concurrent actions very difficult. Additionally, as it’s not in the base Chisel3 repository, it doesn’t seem to see as much use.

HWIOTesters also provides AdvancedTester, which allows limited background tasks to run on each cycle, supporting certain kinds of parallelism (for example, every cycle, a Decoupled driver could check if the queue is ready, and if so, enqueue a new element from a given sequence). However, the concurrent programming model is radically different from the peek-poke model, and requires the programmer to manage time as driver state.

And finally, having 3 different test frameworks really kind of sucks and limits interoperability and reuse of testing libraries.

Goal: Unified testing

The goal here is to have one standardized way to test in Chisel3. Ideally, this would be:

Proposal

Testdriver Construction API

This will define an API for constructing testdriver modules.

Basic API

These are the basic conceptual operations:

A subset of this API (poke, check, step) that is synthesizable to allow the generation of testbenches that don't require Scala to run with the simulator.

Values are specified and returned as Chisel literals, which is expected to interoperate with the future bundle literal constructors feature. In the future, this may be relaxed to be any Chisel expression.

Peek, check, and poke will be defined as extensions of their relevant Chisel types using the PML (implicit extension) pattern. For example, users would specify io.myUInt.poke(4.U), or io.myUInt.peek() would return a Chisel literal containing the current simulation value.

This is to combine driver code with their respective Bundles, allowing these to be shared and re-used without being tied to some TestDriver subclass. For example, Decoupled might define a pokeEnqueue function which sequences the ready, valid, and bits wires and can be invoked with io.myQueue.pokeEnqueue(4.U). These can then be composed, for example, a GCD IO with Decoupled input and output might have gcd.io.checkRun(4, 2, 2) which will enqueue (4, 2) on the inputs and expect 2 on the output when it finishes.

Pokes retain their values until updated by another poke.

Concurrency Model

Concurrency is provided by fork-join parallelism, to be implemented using threading. Note: Scala’s coroutines are too limited to be of practical use here.

Fork: spawns a thread that operates in parallel, returning that thread. Join: blocks until all the argument threads are completed.

Combinational Peeks and Pokes

There are two proposals for combinational behavior of pokes, debate is ongoing about which model to adopt, or if both can coexist.

Proposal 1: No combinational peeks and pokes

Peeks always return the value at the beginning of the cycle. Alternatively phrased, pokes don’t take effect until just before the step. This provides both high performance (no need to update the circuit between clock cycles) and safety against race conditions with threaded concurrency (because poke effects can’t be seen until the next cycle, and all testers are synchronized to the clock cycle, but not synchronized inbetween).

One issue would be that peeks can be written after pokes, but they will still return the pre-poke value, but this can be handled with documentation and possibly optional runtime checks against “stale” peeks. Additionally, this makes it impossible to test combinational logic, but this can be worked around with register insertion.

Note that it isn’t feasible to ensure all peeks are written before pokes for composition purposes. For example, Decoupled.pokeEnqueue may peek to check that the queue is ready before poking the data and valid, and calling pokeEnqueue twice on two different queues in the same cycle would result in a sequence of peek, poke, peek, poke.

Another previous proposal was to allow pokes to affect peeks, but to check that the result of peeks are still valid at the end of the cycle. While powerful, this potentially leads to brittle and nondeterministic testing libraries and is not desirable.

Proposal 2: Combinational peeks and pokes that do not cross threads

Peeks and pokes are resolved in the order written (combinational peeks and pokes are allowed and straightforward). Pokes may not affect peeks from other threads, and this is checked at runtime using reachability analysis.

This provides easy testing of combinational circuits while still allowing deterministic execution in the presence of threading. Since pokes affecting peeks is done by combinational reachability analysis (which is circuit-static, instead of ad-hoc value change detection), thread execution order cannot affect the outcome of a test. Note that clocks act as a global synchronization boundary on all threads.

One possible issue is whether such reachability analysis will have a high false-positive rate. We don’t know right now, and this is something we basically have to implement and see.

Efficient simulation performance is possible by using reachability analysis to determine if the circuit needs to be updated between a poke and peek. Furthermore, it may be possible to determine if only a subset of the circuit needs to be updated.

Multiclock Support

This section is preliminary.

As testers only synchronize to an external clock, a separate thread can drive clocks in any arbitrary relationship.

This is the part which has seen the least attention and development (so far), but robust multiclock support is desired.

Backends

First backend will be FIRRTerpreter, because Verilator compilation is slow (probably accounts for a significant fraction of time in running chisel3 regressions) and doesn’t support all platforms well (namely, Windows).

High performance interfaces to Verilog simulators may be possible using Java JNI to VPI instead of sockets.

Conflicting Drivers

This section is preliminary.

Conflicting drivers (multiple pokes to the same wire from different threads on the same cycle, even if they have the same value) are prohibited and will error out.

There will probably be some kind of priority system to allow overriding defaults, for example, pulling a Decoupled’s valid low when not in use.

Some test systems have a notion of wire ownership, specifying who can drive a wire to prevent conflicts. However, as this proposal doesn’t use an explicit driver model (theoretically saving on boilerplate code and enabling concise tests), this may not be feasible.

Misc

No backwards compatibility. As all of the current Chisel testers are extremely limited in capability, many projects have opted to use other testing infrastructure. Migrating existing test code to this new infrastructure will require rewriting. Existing test systems will be deprecated but may continue to be maintained in parallel.

It may be possible to create a compatibility layer that exposes the old API.

Mock construction and blackbox testing. This API may be sufficient to act as a mock construction API, and may enable testing of black boxes (in conjunction with a Verilog simulator).

Examples

Decoupled, linear style

implicit class DecoupledTester[T](in: Decoupled[T]) {
  // Alternatively, this could directly be in Decoupled
  def enqueue(data: T) {
    require(in.ready, true.B)
    in.valid.poke(true.B)
    in.bits.poke(data)
    step(1)
    in.valid.poke(false.B, priority=low)
  }
}

// Testdriver is a subclass of Module, which must be called from a Tester environment, 
// Example DUT-as-child structure
class MyTester extends Testdriver {
  val myDut = Module(new MyModule())
  // myModule with IO(new Bundle {
  //  val in = Flipped(Decoupled(UInt(8.W)))
  //  val out = Decoupled(UInt(8.W))  // transaction of in + crtl
  //  val in2 = Flipped(Decoupled(UInt(8.W)))
  //  val out2 = Decoupled(UInt(8.W))  // transaction of in + ctrl
  //  val ctrl = UInt(8.W)
  //} )

  myDut.io.in.enqueue(42.U)  // steps a cycle inside
  myDut.io.out.dequeueExpect(43.U)  // waits for output valid, checks bits, sets ready, step
  myDut.io.ctrl.poke(2.U)  // .poke added by PML to UInt
  myDut.io.in.enqueue(45.U)
  myDut.io.out.dequeueExpect(47.U)

  // or with parallel constructs
  myDut.io.ctrl.poke(4.U)

  join(fork {
    myDut.io.in.enqueue(44.U)
    myDut.io.out.dequeueExpect(48.U)
    myDut.io.in.enqueue(46.U)
    myDut.io.out.dequeueExpect(50.U)
  } .fork {  // can be called on a thread-list, spawns a new thread that runs in parallel with the threads on the list - lightweight syntax for spawning many parallel threads
    myDut.io.in2.enqueue(1.U)
    myDut.io.out2.dequeueExpect(5.U)
    myDut.io.in2.enqueue(7.U)
    myDut.io.out2.dequeueExpect(11.U)
  })
  // tester ends at end of TestDriver and when all spawned threads completed
}

External Extensions

These items are related to testing, but are most orthogonal and can be developed separately. However, they will be expected to interoperate well with testers:

Nic30 commented 5 years ago

@Dolu1990 I have seen only https://github.com/SpinalHDL/VexRiscvSoftcoreContest2018/blob/master/test/common/testbench.h#L83 before. But I still do not know how clock synchronization of agents for clock generated from DUT works in your code.

I mean how how this simple example would work?

clkIn->                -> clkOut
input-> register -> +1 -> output

and register = 0, clkIn = 0

Lets have Agent which reads value from output signal on rising edge of clkOut

  1. clkIn=1, input=1
  2. Force deltacycle ? -> Y
  3. Verilator eval
  4. now output=2 but should be read when it was 1

But where is "dut signals write generated from the callbacks logic" in your code?

Dolu1990 commented 5 years ago

@Nic30 Hoo that stuff from VexRiscvSoftcoreContest isn't using the SpinalSim stuff, it was raw Verilator + C++ without scala invoved into the testbench.

I impemented your case above :

object SimPlayDeltaCycle2{
  import spinal.core.sim._

  class TopLevel extends Component {
    val clkIn = in Bool()
    val clkOut = out Bool()
    val input = in(UInt(8 bits))
    val output = out(UInt(8 bits))
    val register = ClockDomain(clock = clkIn, config = ClockDomainConfig(resetKind = BOOT)) (Reg(UInt(8 bits)) init(0))
    register := input
    val registerPlusOne = register + 1
    output := registerPlusOne
    clkOut := clkIn
  }

  def main(args: Array[String]) {
    SimConfig.withWave.compile(new TopLevel).doSim{dut =>
      def printState(header : String) = println(s"$header dut.clkIn=${dut.clkIn.toBoolean} dut.input=${dut.input.toInt} dut.output=${dut.output.toInt} dut.clkOut=${dut.clkOut.toBoolean} time=${simTime()} deltaCycle=${simDeltaCycle()}")

      dut.clkIn #= false
      dut.input #= 42
      printState("A")
      sleep(10)
      printState("B")
      dut.clkIn #= true
      dut.input #= 1
      printState("C")
      sleep(0) //A delta cycle is anways forced, but the sleep 0 allow the thread to sneak in that forced delta cycle
      printState("D")
      sleep(0) //Let's go for another delta cycle
      printState("E")
      sleep(10)
      printState("F")
    }
  }
}

The wave is : image

Note, we can see input and clkIn going to one at the same time, because the stimulus did it, but that's probably not a good way of giving readable stimulus.

Its output is : A dut.clkIn=false dut.input=220 dut.output=100 dut.clkOut=false time=0 deltaCycle=0 B dut.clkIn=false dut.input=42 dut.output=1 dut.clkOut=false time=10 deltaCycle=0 C dut.clkIn=false dut.input=42 dut.output=1 dut.clkOut=false time=10 deltaCycle=0 D dut.clkIn=true dut.input=1 dut.output=1 dut.clkOut=false time=10 deltaCycle=1 E dut.clkIn=true dut.input=1 dut.output=2 dut.clkOut=true time=10 deltaCycle=2 F dut.clkIn=true dut.input=1 dut.output=2 dut.clkOut=true time=20 deltaCycle=0

So to be sure to understand each other, there is another sample written with the dev branch of spinalsim : https://github.com/SpinalHDL/SpinalHDL/blob/36b7444c4397cde1e55967cde5579f9cff68df0d/tester/src/main/scala/spinal/tester/PlayDev.scala#L1412

There is the produced wave : image

And there is the produced output :

Pre  ref init          dut.a=88 dut.b=158 dut.result=50 time=0 deltaCycle=1 //Fixed acordingly to my next message
Pre  dut.b.randomize() dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Post dut.b.randomize() dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Pre  dut.a.randomize() dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Post dut.a.randomize() dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Post ref init          dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Pre  ref sampling      dut.a=88 dut.b=158 dut.result=0 time=170 deltaCycle=1
Pre  dut.b.randomize() dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Post dut.b.randomize() dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Pre  dut.a.randomize() dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Post dut.a.randomize() dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Post ref sampling      dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Pre  ref sampling      dut.a=81 dut.b=67 dut.result=246 time=180 deltaCycle=1
Pre  dut.b.randomize() dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Post dut.b.randomize() dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Pre  dut.a.randomize() dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Post dut.a.randomize() dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Post ref sampling      dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Pre  ref sampling      dut.a=255 dut.b=170 dut.result=148 time=190 deltaCycle=1
Pre  dut.b.randomize() dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Post dut.b.randomize() dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Pre  dut.a.randomize() dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Post dut.a.randomize() dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Post ref sampling      dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Pre  ref sampling      dut.a=154 dut.b=157 dut.result=169 time=200 deltaCycle=1
Pre  dut.b.randomize() dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Post dut.b.randomize() dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Pre  dut.a.randomize() dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Post dut.a.randomize() dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Post ref sampling      dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Pre  ref sampling      dut.a=27 dut.b=219 dut.result=55 time=210 deltaCycle=1
Pre  dut.b.randomize() dut.a=90 dut.b=163 dut.result=246 time=220 deltaCycle=1
Post dut.b.randomize() dut.a=90 dut.b=163 dut.result=246 time=220 deltaCycle=1
Pre  dut.a.randomize() dut.a=90 dut.b=163 dut.result=246 time=220 deltaCycle=1
Post dut.a.randomize() dut.a=90 dut.b=163 dut.result=246 time=220 deltaCycle=1
Post ref sampling      dut.a=90 dut.b=163 dut.result=246 time=220 deltaCycle=1

For me, all look fine, but it isn't easy reading things, so i'm not saying i'm right, let's me know if something isn't looking correct.

But where is "dut signals write generated from the callbacks logic" in your code?

It is https://github.com/SpinalHDL/SpinalHDL/blob/dev/sim/src/main/scala/spinal/sim/SimManager.scala#L266 Which is filled by write access : https://github.com/SpinalHDL/SpinalHDL/blob/dev/sim/src/main/scala/spinal/sim/SimManager.scala#L124

Dolu1990 commented 5 years ago

Just spotted two issues (fixed now) : https://github.com/SpinalHDL/SpinalHDL/blob/dev/sim/src/main/scala/spinal/sim/SimManager.scala#L223 delta cycle should also be forced when the command buffer is written from sensitive callbacks (wasn't properly made before)

Also the delta cycle calculation is now correct, which change Pre ref init dut.a=88 dut.b=158 dut.result=50 time=0 deltaCycle=0 into Pre ref init dut.a=88 dut.b=158 dut.result=50 time=0 deltaCycle=1

(When you fork a thread, its execution start on the next delta cycle)

Updated diagram : image

Nic30 commented 5 years ago

@Dolu1990 I do not see any problem in your implementation.