chipsalliance / chisel

Chisel: A Modern Hardware Design Language

https://www.chisel-lang.org/

Apache License 2.0

3.82k stars 577 forks source link

[RFC] New Testers Proposal #725

Open ducky64 opened 6 years ago

ducky64 commented 6 years ago

This is a proposal for a new testers API, and supersedes issues #551 and #547. Nothing is currently set in stone, and feedback from the general Chisel community is desired. So please give it a read and let use know what you think!

Motivation

What’s wrong with Chisel BasicTester or HWIOTesters?

The BasicTester included with Chisel is a way to define tests as a Chisel circuit. However, as testvectors often are specified linearly in time (like imperative software), this isn’t a great match.

HWIOTesters provide a peek/poke/step API, which allows tests to be written linearly in time. However, there’s no support for parallelism (like a threading model), which makes composition of concurrent actions very difficult. Additionally, as it’s not in the base Chisel3 repository, it doesn’t seem to see as much use.

HWIOTesters also provides AdvancedTester, which allows limited background tasks to run on each cycle, supporting certain kinds of parallelism (for example, every cycle, a Decoupled driver could check if the queue is ready, and if so, enqueue a new element from a given sequence). However, the concurrent programming model is radically different from the peek-poke model, and requires the programmer to manage time as driver state.

And finally, having 3 different test frameworks really kind of sucks and limits interoperability and reuse of testing libraries.

Goal: Unified testing

The goal here is to have one standardized way to test in Chisel3. Ideally, this would be:

suitable for both unit tests and system integration tests
designed for composable abstractions and layering
able to target multiple backends and simulators (possibly requiring a link to Scala, if the testvector is not static, or using a limited test constructing API subset, when synthesizing to FPGA)
included in base chisel3, to avoid packaging and dependency nightmares
highly usable, encouraging unit tests by making it as easy, painless (avoiding boilerplate and other nonsense), and useful as possible to write them

Proposal

Testdriver Construction API

This will define an API for constructing testdriver modules.

Basic API

These are the basic conceptual operations:

Peek: returns the value of a circuit node
Check: asserts that a circuit node has some value, Similar semantics to peek (details below)
Poke: pokes a value into a circuit node
Step: blocks until the next rising edge of the specified clock (for single-clock designs, equivalent to stepping the clock) Note: A better name is desired for this...

A subset of this API (poke, check, step) that is synthesizable to allow the generation of testbenches that don't require Scala to run with the simulator.

Values are specified and returned as Chisel literals, which is expected to interoperate with the future bundle literal constructors feature. In the future, this may be relaxed to be any Chisel expression.

Peek, check, and poke will be defined as extensions of their relevant Chisel types using the PML (implicit extension) pattern. For example, users would specify io.myUInt.poke(4.U), or io.myUInt.peek() would return a Chisel literal containing the current simulation value.

This is to combine driver code with their respective Bundles, allowing these to be shared and re-used without being tied to some TestDriver subclass. For example, Decoupled might define a pokeEnqueue function which sequences the ready, valid, and bits wires and can be invoked with io.myQueue.pokeEnqueue(4.U). These can then be composed, for example, a GCD IO with Decoupled input and output might have gcd.io.checkRun(4, 2, 2) which will enqueue (4, 2) on the inputs and expect 2 on the output when it finishes.

Pokes retain their values until updated by another poke.

Concurrency Model

Concurrency is provided by fork-join parallelism, to be implemented using threading. Note: Scala’s coroutines are too limited to be of practical use here.

Fork: spawns a thread that operates in parallel, returning that thread. Join: blocks until all the argument threads are completed.

Combinational Peeks and Pokes

There are two proposals for combinational behavior of pokes, debate is ongoing about which model to adopt, or if both can coexist.

Proposal 1: No combinational peeks and pokes

Peeks always return the value at the beginning of the cycle. Alternatively phrased, pokes don’t take effect until just before the step. This provides both high performance (no need to update the circuit between clock cycles) and safety against race conditions with threaded concurrency (because poke effects can’t be seen until the next cycle, and all testers are synchronized to the clock cycle, but not synchronized inbetween).

One issue would be that peeks can be written after pokes, but they will still return the pre-poke value, but this can be handled with documentation and possibly optional runtime checks against “stale” peeks. Additionally, this makes it impossible to test combinational logic, but this can be worked around with register insertion.

Note that it isn’t feasible to ensure all peeks are written before pokes for composition purposes. For example, Decoupled.pokeEnqueue may peek to check that the queue is ready before poking the data and valid, and calling pokeEnqueue twice on two different queues in the same cycle would result in a sequence of peek, poke, peek, poke.

Another previous proposal was to allow pokes to affect peeks, but to check that the result of peeks are still valid at the end of the cycle. While powerful, this potentially leads to brittle and nondeterministic testing libraries and is not desirable.

Proposal 2: Combinational peeks and pokes that do not cross threads

Peeks and pokes are resolved in the order written (combinational peeks and pokes are allowed and straightforward). Pokes may not affect peeks from other threads, and this is checked at runtime using reachability analysis.

This provides easy testing of combinational circuits while still allowing deterministic execution in the presence of threading. Since pokes affecting peeks is done by combinational reachability analysis (which is circuit-static, instead of ad-hoc value change detection), thread execution order cannot affect the outcome of a test. Note that clocks act as a global synchronization boundary on all threads.

One possible issue is whether such reachability analysis will have a high false-positive rate. We don’t know right now, and this is something we basically have to implement and see.

Efficient simulation performance is possible by using reachability analysis to determine if the circuit needs to be updated between a poke and peek. Furthermore, it may be possible to determine if only a subset of the circuit needs to be updated.

Multiclock Support

This section is preliminary.

As testers only synchronize to an external clock, a separate thread can drive clocks in any arbitrary relationship.

This is the part which has seen the least attention and development (so far), but robust multiclock support is desired.

Backends

First backend will be FIRRTerpreter, because Verilator compilation is slow (probably accounts for a significant fraction of time in running chisel3 regressions) and doesn’t support all platforms well (namely, Windows).

High performance interfaces to Verilog simulators may be possible using Java JNI to VPI instead of sockets.

Conflicting Drivers

This section is preliminary.

Conflicting drivers (multiple pokes to the same wire from different threads on the same cycle, even if they have the same value) are prohibited and will error out.

There will probably be some kind of priority system to allow overriding defaults, for example, pulling a Decoupled’s valid low when not in use.

Some test systems have a notion of wire ownership, specifying who can drive a wire to prevent conflicts. However, as this proposal doesn’t use an explicit driver model (theoretically saving on boilerplate code and enabling concise tests), this may not be feasible.

Misc

No backwards compatibility. As all of the current Chisel testers are extremely limited in capability, many projects have opted to use other testing infrastructure. Migrating existing test code to this new infrastructure will require rewriting. Existing test systems will be deprecated but may continue to be maintained in parallel.

It may be possible to create a compatibility layer that exposes the old API.

Mock construction and blackbox testing. This API may be sufficient to act as a mock construction API, and may enable testing of black boxes (in conjunction with a Verilog simulator).

Examples

Decoupled, linear style

implicit class DecoupledTester[T](in: Decoupled[T]) {
  // Alternatively, this could directly be in Decoupled
  def enqueue(data: T) {
    require(in.ready, true.B)
    in.valid.poke(true.B)
    in.bits.poke(data)
    step(1)
    in.valid.poke(false.B, priority=low)
  }
}

// Testdriver is a subclass of Module, which must be called from a Tester environment, 
// Example DUT-as-child structure
class MyTester extends Testdriver {
  val myDut = Module(new MyModule())
  // myModule with IO(new Bundle {
  //  val in = Flipped(Decoupled(UInt(8.W)))
  //  val out = Decoupled(UInt(8.W))  // transaction of in + crtl
  //  val in2 = Flipped(Decoupled(UInt(8.W)))
  //  val out2 = Decoupled(UInt(8.W))  // transaction of in + ctrl
  //  val ctrl = UInt(8.W)
  //} )

  myDut.io.in.enqueue(42.U)  // steps a cycle inside
  myDut.io.out.dequeueExpect(43.U)  // waits for output valid, checks bits, sets ready, step
  myDut.io.ctrl.poke(2.U)  // .poke added by PML to UInt
  myDut.io.in.enqueue(45.U)
  myDut.io.out.dequeueExpect(47.U)

  // or with parallel constructs
  myDut.io.ctrl.poke(4.U)

  join(fork {
    myDut.io.in.enqueue(44.U)
    myDut.io.out.dequeueExpect(48.U)
    myDut.io.in.enqueue(46.U)
    myDut.io.out.dequeueExpect(50.U)
  } .fork {  // can be called on a thread-list, spawns a new thread that runs in parallel with the threads on the list - lightweight syntax for spawning many parallel threads
    myDut.io.in2.enqueue(1.U)
    myDut.io.out2.dequeueExpect(5.U)
    myDut.io.in2.enqueue(7.U)
    myDut.io.out2.dequeueExpect(11.U)
  })
  // tester ends at end of TestDriver and when all spawned threads completed
}

External Extensions

These items are related to testing, but are most orthogonal and can be developed separately. However, they will be expected to interoperate well with testers:

SystemVerilog Assertions (basically LTL on circuits)
Constrained random generation
Memory initialization

grebe commented 6 years ago

I think the proposal should say something about X propagation.

@jackkoenig talked about poison in the firrtl interpreter as a closely related idea to X. The idea could be formalized more. Verilog blackboxes that interface with the firrtl interpreter could interpret X's as poison and do its own randomization.

@albert-magyar had an interesting idea about being able to annotate individual registers as having different X behavior (i.e. pessimistic, optimistic or random, perhaps with random as the default). Firrtl could define semantics for how wires with different X behavior could be connected (i.e. random Xs can be assigned to any kind of X, optimistic+pessimistic should be mutually exclusive without some explicit cast).

ducky64 commented 6 years ago

Discussion on combinational vs. stale (beginning-of-cycle) peeks: expose both APIs, with combinational peek being the default (since it does what the programmer expects, and will fail noisily). Users can fall back to stale peeks if combinational peeks, and we may consider changing stale peeks to the default if the false positive rate from reachability analysis is too high. Both APIs are expected to coexist, with stale peeks not running reachability analysis, and returning the value the circuit had right after the rising edge (and before any pokes would have executed).

Resolution: implement stalePeek first, then peek. We don't think it's possible to implement (combinational) peek using stalePeek.

ducky64 commented 6 years ago

Multiclock semantics proposal: There will be some way to specify the timing relationships between clocks. Details TBD, but this may be a testdriver thread just just drives clocks. step() (optionally?) takes a clock as an argument, it blocks until the next rising edge of that clock. Implications: peeks, pokes, etc happen eagerly after a step (conceptually immediately after the rising edge of the argument clock). A testdriver / thread can call step on different clocks - for example, step(clk1) then step(clk2) would result in the the second step waiting for the edge of clk2 after the edge of clk1.

jackkoenig commented 6 years ago

I think we should lay out what our primary concerns are:

This should be the framework rocket-chip uses and designers want to use
- This means unit testing and full system verification
It should be fast
- As a heuristic I think this means 1 cycle* of the whole testing infrastructure should be no slower than 1 cycle of evaluating the DUT
It should be deterministic
- This includes multithreading**

By cycle of the tester I mean the runtime of tester logic that is required in between* steps of the DUT

**Obviously in arbitrary Scala code people can do whatever thread unsafe stuff they want, but when it comes to the Tester APIs, there should be a requirement of determinism (eg. if thread A pokes an input that combinationally affects an output peeked by thread B, thread ordering cannot affect the outcome).

schoeberl commented 6 years ago

You want to use multithreading for testing, but need to synchronize at each clock tick? I think this will introduce a large performance overhead. Or is it just for a nice concurrent programing model for the testing code?

Having a nice programming model for testing concurrent clocked systems is in my opinion an intersting and challenging question. I worked a little bit on this related to test a multicore arbitration circuit, but I am far away from a decent elegant solution. I am still at the level of writing concurrent FSMs in software to simulate the clients :-(

ducky64 commented 6 years ago

Threading is mainly intended as the concurrency programming model. However, because Scala coroutines are kind of a mess and appears insufficient, threading will probably also be the implementation strategy.

The main reason for this programming model is to eliminate the need to write a custom FSM as a stand-in for a program counter when multi-cycle concurrent actions are needed. Instead, actions that span multiple cycles but are otherwise logically related can be written directly (imperative style, actions directly following the previous). One example would be testing a shift register, the action for each element can be specified directly as 'poke this value, step some cycles, expect that value out', with pipelining of elements achieved by forking a thread for each element.

True concurrency isn't needed, the tester will actually schedule one thread to be running at any time (without guarantees on ordering, though). Threads are only used as a mechanism to keep track of multiple program counters.

Overall, the goal is to be suitable for both unit testing (allowing cycle-accurate tests) and integration testing (using composition of abstractions).

Of course, it remains to be seen if this is a good idea - potential issues include pitfalls / complexity of a threading model and (as you've mentioned) threading performance.

wachag commented 6 years ago

For concurrency have you considered the actor model? It could help partitioning the simulation to smaller entities.

The dataflow-like reactive streams could also be useful.

I am experimenting with such an implementation in one of my projects.

Gabor

dec. 30. de. 1:57 ezt írta ("Richard Lin" notifications@github.com):

Threading is mainly intended as the concurrency programming model. However, because Scala coroutines are kind of a mess and appears insufficient, threading will probably also be the implementation strategy.

The main reason for this programming model is to eliminate the need to write a custom FSM as a stand-in for a program counter when multi-cycle concurrent actions are needed. Instead, actions that span multiple cycles but are otherwise logically related can be written directly (imperative style, actions directly following the previous). One example would be testing a shift register, the action for each element can be specified directly as 'poke this value, step some cycles, expect that value out', with pipelining of elements achieved by forking a thread for each element.

True concurrency isn't needed, the tester will actually schedule one thread to be running at any time (without guarantees on ordering, though). Threads are only used as a mechanism to keep track of multiple program counters.

Overall, the goal is to be suitable for both unit testing (allowing cycle-accurate tests) and integration testing (using composition of abstractions).

Of course, it remains to be seen if this is a good idea - potential issues include pitfalls / complexity of a threading model and (as you've mentioned) threading performance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-354518494, or mute the thread https://github.com/notifications/unsubscribe-auth/AFj73BRCz7uQOa8-o_MC-RdJ-L5fdVKdks5tFYqMgaJpZM4Q3Kxa .

ducky64 commented 6 years ago

I think the actor model is quite similar to how the AdvancedTester (https://github.com/freechipsproject/chisel-testers/blob/master/src/main/scala/chisel3/iotesters/AdvTester.scala) works. The absence of threading means that the user needs to manually sequence multi-cycle actions using a FSM (or similar), which is a lot of programming overhead and may not compose well. Cycle-accurate unit tests may also be difficult to achieve, though it's more suitable for integration level system testing.

Partitioning into actors is an interesting thought for improving performance, but this proposal mainly looks at the programming interface (how tests are written / specified) as long as potential optimizations aren't precluded.

Dolu1990 commented 6 years ago

Have you checked SpinalSim API ? https://spinalhdl.github.io/SpinalDoc/spinal/sim/example/single_clock_fifo/

ducky64 commented 6 years ago

@Dolu1990 I haven't yet, thanks for bringing it up! At a high level, it looks structurally similar to this proposal, which I guess is good news in indicating this proposal is sane...

Some interesting comments after reading through the docs:

I see SpinalSim uses operator overloading #= for pokes. We proposed poke operators in a previous tester, though ultimately decided against it. I think testing is one of those cases where operator overloading makes sense, since users will likely be writing many of those statements (so the upfront investment makes sense). Of course, this does have to be balanced with learnability (so new users don't just ragequit if the code appears too unreadable). Thoughts everyone?
I see there is fork/join implemented using continuations underneath. Did you have any thoughts on performance issues with threading (which probably involves an expensive kernel call), or learnability and usability issues with continuations? Performance with threading is one of the unknowns in this proposal, but continuations also can result in non-intuitive error messages (which is something we've put much effort into getting rid of).
I see that SpinalSim has a real-time model, which has been discussed as a possible follow-on API to this proposal, especially for testing some clock-crossing hardware.
I see that SpinalSim is using JNI (JNR-FFI) for performance. JNI is also something we've discussed implementing (among other possibilities - for instance synthesizing stimulus into Verilog, or using Firrterpreter), as inter-process communication is expensive and has been a bottleneck.

Dolu1990 commented 6 years ago

Basicaly it is close from the COCOTB python API, my inspiration came in part from it.

I think that in practice the #= add in lisibility, but then it's right that it "look weird" for new peoples.
Continuations concept is great, but its scala implementation has some dirty sides. See https://github.com/scala/scala-continuations/issues/36 . It not the only one. But then all those dirty side only appear in the driver/monitor side of the testbench. As soon you are back into a function which don't suspend the execution, everything is fine. See https://github.com/SpinalHDL/SpinalHDL/blob/master/tester/src/test/scala/spinal/tester/scalatest/SpinalSimStreamFifoTester.scala#L79 and https://github.com/SpinalHDL/SpinalHDL/blob/7aff4524d9afebc98cfd919bd9b96da7304b176b/lib/src/main/scala/spinal/lib/sim/Stream.scala#L11 (Stream are like Chisel Decoupled interface) You can build abstractions utiles realy easily, and then go away from the time based things (and away from scala continuation issues).
Continuation are realy fast, native thread are realy slow, basicaly on my laptop, on windows (i like gaming) switching thread and comming back take 3.3 us without CPU core afinity and 1.6 us with it. Then in my linux VM it is about 20 us to do a context switching with a lot of jitter. For big RTL tests it probably doesn't realy matter, but for medium to small design it is a big overhead. Continuation in the other hand is about 50 ns per switch, maybe even less, i haven't check exactly.
Also real-time model allow to easily define independentes abstraction. The one implemented in SpinalSim also emulate simulation delta cycle on the top of Verilator. When you write a dut signal, you can only read the changed value after the end of the current delta cycle, which avoid race conditions.
I initialy used JNR-FFI, but i had some issue to unmount shared libraries to free the memory after a simulation. Now it's pure JNI stuff + some runtime java compilation to bind things without conflict in the shared libraries symboles. JNI simple call => 8ns overhead
I tried to have the coroutine feature via an C implementation binded through JNI, but i realy had issue with the Java JNI stack managment. Haven't find a solution to workaround this.

Dolu1990 commented 6 years ago

Hoo another things that you can't do with scala continuation, is suspending the execution inside a scala for loop. (You can workaround it by having your own suspendable utils like Suspendable.repeat(count = 100){ ... }

ducky64 commented 6 years ago

Yeah, the limitations of continuations seem significant (also, rumor is that it's no longer being actively maintained - instead work is being put into scala-async). It's currently unclear how significant the threading limitation will be (for example, firrtl-interpreter can simulate GCD at 2MHz - so a 20us context switch would be a massive performance hit, but rocket-chip is going to simulate much slower to where the threading overhead may be negligible). One interesting direction would be to have users wrap test code in a macro annotation, so that the system could catch the common gotchas and rewrite it in a async / continuations friendly style. But it's probably worth seeing how bad threading is on a reasonably large system first.

Dolu1990 commented 6 years ago

Right, Scala continuation doesn't look actively maintained. But at least it is ported to scala 2.12 and 2.13 Hopefully, in the worst case, passing from a continuation based implementation to a native threading implementation would not require any change in the user code.

About the overhead, 20us multiplied by the number of agent/threads that you need to wake up in the TB could be significant.

ducky64 commented 6 years ago

Yeah, fair point about the 20us per thread, it might have scaling issues. But the nice thing about code in development is that the API can still be fluid. I think it makes sense to test and benchmark first with a nonoptimal version first, and if the results aren't reasonable, we can explore the use of continuations / async without breaking user code (because there's no official API yet). My main concern about both is the learning curve for users, especially where the user needs to type in some magic keywords, or where compiler might not give a helpful error message. But if performance is a massive concern, there are many possibilities.

ducky64 commented 6 years ago

Also, I've gotten a basic system up. Check out the test code in https://github.com/freechipsproject/chisel3/blob/testers2/src/test/scala/chisel3/tests/BasicTest.scala As always, comments are welcome.

Interesting notes: the global context is split between the tester backend (Firrterpreter/Verilator/VCS/whatever) and test environment (like ScalaTest) to allow customizations for both. It also turns out that ScalaTest has an API for specifying user code location, so it can properly report the c.io.out.check(...) line that failed, instead of the line internal to check. However, if layering test functions becomes a thing (and I hope it does), we'll need some kind of mechanism to report multiple relevant locations, for example, the line in (the hypothetical) DecoupledTester that calls check, and the line in user code that calls DecoupledTester.dequeue.

ducky64 commented 6 years ago

We discussed details at the meeting today, notes:

Overall examples BasicTest.scala, QueueTester.scala look reasonable
Possible driving examples: rocket-chip unit tests, concurrency for CRAFT
Possible multiclock examples: AsyncFifo. CRAFT may have additional applications that are more analog, it's unclear if we want to target those for this version (or if they're feasible to model in Chisel at all).
Interesting features to explore (future work): avoiding unnecessary re-elaborations, bundle literals
[ ] Multiclock semantics: a single dedicated clock thread is probably sufficient (independent of testing logic, which can only wait on a clock edge, but not trigger one). Inter-thread communication mechanisms may allow data-dependent clock.
[ ] Multithreaded poke resolution: poke semantics will be:
- pokes must remain in effect throughout an entire clock cycle, unless overridden by another poke of higher priority (in any thread) or a poke of equivalent priority (in the same thread). It is an error (and the test bench will fail) if these semantics are violated, because test results can be thread-execution-order-dependent.
- no additional syntax will be necessary to associate a poke with a clock in the single-clock case
- multiclock semantics TBD, potentially inferring from the preceding or following step? Overall "make the common case easy, make the hard case possible"
[x] in the common case, tests should be backend independent (use default backend)

ducky64 commented 6 years ago

Merge strategy was discussed at today's meeting, this will be in a chisel3.experimentals.testers2._ package. When it goes mainline (chisel3.testers2._), that import will either break, or will give a deprecation warning, based on what we can do. It's experimental, so it's fine if users need to update their code on a major release.

Target is for a merge in 2-4 weeks. Dependent on literal types.

Also, someone please come up with a better name than testers2.

ducky64 commented 6 years ago

Literal types turned out to be a bust, so we're going to go with runtime checks.

Anyways, the discussion has now turned towards timing semantics, or attaching durations to tester actions like pokes.

Current latching semantics

Currently, uninitialized inputs are randomized, and poke is defined to set the value on the wire until it is overridden. If there are multiple pokes on the same timestep from different threads, that is an error (because the result is thread order dependent). A priority system exists to resolve pokes from different threads on the same time step, mainly used to revert a wire to a default value (for example, to set valid low after a transaction, but also allow it to be overridden for another transaction that cycle).

However, it seems more natural to instead specify a default value, then let a poke override that for some duration, like a clock cycle, and reverting automatically when the duration is over. Additionally, having an explicit duration can have signals revert to invalid (X-ish) and prevent certain bugs caused by values latching for longer than they were expected to.

Several poke duration proposals

in order with most promising (my opinion) first

Duration scopes

Idea: pokes last until the end of their duration, delineated by some kind of scope. For example, a Decoupled transaction might look like:

io.valid.weakPoke(false.B)
timeScope {
  io.valid.poke(true.B)
  io.bits.poke(myBits)
  io.clock.step(1)
}

so when the timeScope falls off the edge, the (higher priority) pokes on valid and bits ends, and it reverts to the lower priority poke where valid is false and bits defaults to invalid.

Sequential pokes within a timescope would continue to have current semantics (latching until overridden), but they would all be invalidated / cleared at the end of the timescope.

Pros:

peek continues to have existing semantics without timescope
handles combinational logic well
poke duration is defined without additional overhead (because you probably needed to step there anyways)

Cons:

unknown how to associate time durations with peeks, where it's rarer to need something held for longer - maybe timeScope only applies to pokes, and peeks are held until the end of the clock cycle?
attention needs to be given to nesting timescopes, particularly with libraries, though defining the timescope statement as not ending until its contents end may be fine, and explicit forks can be done otherwise
attention needs to be given to Bundle level library functions which might not have access to a Module level clock - either a clock needs to be passed in and tracked, or some way to get a clock from wires (perhaps examining registers on the combinational paths?)

Latching / non-latching constructs

Idea: Separate latching poke and nonlatching poke constructs. The nonlatching poke construct would have a time duration associated with it, and the wire value would revert to a lower priority value after its duration. The current thought is that the latching poke would have lower priority (used to set a default value).

Pros:

allows both modes, the familiar latching mode and the safer / stricter nonlatching explicit-duration mode
the latching low priority pattern fits the use case of default / temporary overrides well

Cons:

potential confusing interactions with top-level code and library code written by different people using latching or nonlatching mode, though this is only a problem when wires are being driven from multiple places, which is probably an antipattern
some way needed to associate a clock with a number-of-cycles duration, to avoid the boilerplate of explicitly specifying a clock with each poke

Unified semantics

Idea: Only have a single poke construct with a duration, which can either default to one cycle or infinite (latching). Sequential pokes from the same thread would override earlier pokes regardless of duration (or maybe only if at the default duration?). A priority system (like weakPoke) can be used for default/override.

Pros:

single construct
elegant unified implementation

Cons:

rules may be unclear, and are somewhat separated from usage patterns (implementation-focused rather than usage-focused)
also need some ways to associate a clock with a number-of-cycles duration

Thoughts?

seldridge commented 6 years ago

I'll give the rest of this a read and provide some feedback.

@ducky64: This came up when going through the generator bootcamp with some questions related to multiclock testing. This was the solution that I came up with. It defines a MultiClockTest with an abstract member def clocks: Seq[ClockInfo]. ClockInfo defines a mapping of each clock to a period and phase. Based on the period/phases specified, this generates clocks for you and connects them.

Multiclock module:

import chisel3._
import chisel3.experimental.withClock

class MultiClockModule extends Module {
  val io = IO(
    new Bundle {
      val clockA = Input(Clock())
      val clockB = Input(Clock())
      val clockC = Input(Clock())
      val a = Output(Bool())
      val b = Output(Bool())
      val c = Output(Bool())
    })

  /* Make each output (a, b, c) toggle using their respective clocks
   * (clockA, clockB, clockC) */
  Seq(io.clockA, io.clockB, io.clockC)
    .zip(Seq(io.a, io.b, io.c))
    .foreach{ case (clk, out) => { withClock(clk) { out := RegNext(~out) } }}
}

Multiclock test:

import chisel3._
import chisel3.experimental.RawModule
import chisel3.util.Counter
import chisel3.testers.{BasicTester, TesterDriver}
import chisel3.iotesters.{PeekPokeTester, ChiselFlatSpec}

/** A description of the period and phase associated with a specific
  * clock */
case class ClockInfo(signal: Clock, period: Int, phase: Int = 0)

/** A clock generator of a specific period and phase */
class ClockGen(period: Int, phase: Int = 0) extends Module {
  require(period > 0)
  require(phase >= 0)
  val io = IO(
    new Bundle {
      val clockOut = Output(Bool())
    })

  println(s"Creating clock generation with period $period, phase $phase")

  val (_, start) = Counter(true.B, phase)
  val started = RegInit(false.B)
  started := started | start

  val (count, _) = Counter(started, period)
  io.clockOut := count >= (period / 2).U
}

trait MultiClockTester extends BasicTester {
  self: BasicTester =>

  /* Abstract method (you need to fill this in) that describes the clocks */
  def clocks: Seq[ClockInfo]

  /* The finish method is called just before elaboration by TesterDriver.
   * This is used to generate and connect the clocks defined by the
   * ClockInfo of this module. */
  override def finish(): Unit = {
    val scale = clocks
      .map{ case ClockInfo(_, p, _) => p / 2 == p * 2 }
      .reduce( _ && _ ) match {
        case true => 1
        case false => 2 }
    clocks.foreach{ case ClockInfo(c, p, ph) =>
      c := Module(new ClockGen(p * scale, ph * scale)).io.clockOut.asClock }
  }
}

class MultiClockTest(timeout: Int) extends BasicTester with MultiClockTester {

  /* Instantiate the design under test */
  val dut = Module(new MultiClockModule)

  /* Define the clocks */
  val clocks = Seq(
    ClockInfo(dut.io.clockA, 3),
    ClockInfo(dut.io.clockB, 7),
    ClockInfo(dut.io.clockC, 7, 2))

  val (countA, _) = Counter(dut.io.a =/= RegNext(dut.io.a), timeout)
  val (countB, _) = Counter(dut.io.b =/= RegNext(dut.io.b), timeout)
  val (countC, _) = Counter(dut.io.c =/= RegNext(dut.io.c), timeout)

  val (_, timeoutOccurred) = Counter(true.B, timeout)
  when (timeoutOccurred) {
    printf(p"In ${timeout.U} ticks, io.a ticked $countA, io.b ticket $countB, io.c ticked $countC\n")
    stop()
  }
}

class MultiClockSpec extends ChiselFlatSpec {

  "ClockGen" should "throw exceptions on bad inputs" in {
    Seq(() => new ClockGen(0, 0),
        () => new ClockGen(1, -1))
      .foreach( gen =>
        intercept[IllegalArgumentException] { Driver.elaborate(gen) } )
  }

  "MultiClockTest" should "work" in {
    TesterDriver.execute(() => new MultiClockTest(128))
  }
}

2018-03-09-181626_1000x600_scrot

chick commented 6 years ago

@ducky64 This is a late comment but it would be nice to include into this development the ability to test a DUT against a golden model. The golden model might be an earlier version of DUT that you want to ensure that it matches behavior. The golden model should also be implemented in Scala, perhaps as some sort of mock.

ducky64 commented 6 years ago

More discussion on testers happened at today's meeting, mostly focusing on allowing combinational peek-after-poke behavior across threads. The driving use case is various interlocked Decoupled-to-Decoupled topologies: one-to-one (straightfoward), many-to-one (output fires when all inputs are valid, a reduction operation is applied to inputs), and one-to-many (replication across many inputs), and many-to-many. In all cases, a transaction happens only when all outputs are ready and all inputs are valid, and testdrivers are organized as two phases, poking inputs lines on the first phase, and peeking output lines on the second phase for scoreboarding.

Proposed solutions:

Having phases associated with threads, for example, a scoreboarding phase might follow the main testdriver phase. There is a implicit global synchronization barrier between phases within the same timestep, so peeks after pokes crossing a phase barrier are well-ordered. Threads cannot be associated with an earlier phase then the parent - though if there is a good use case we can look into this. This is the current plan.
- For now, there will be a small list of pre-defined phases (testdriver, scoreboard - in that order). In the future, we could look into user-defined phases, and how to make them interact well.
Have phases, but move between phases with a barrier construct. A slightly less restrictive version of the above. However, it's unclear whether this finer granularity is needed, so we'll implement the simpler case first.

These other solutions were also discussed but did not gain significant traction:

Use some sort of constraint construct to sequence actions between threads. Very powerful and fine-grained, but it's unclear whether this will be cumbersome or confusing.
Allow combinational dependencies based on lexical order in which threads were spawned. Also powerful, but possibly error-prone and unintuitive.
Build a custom testdriver construct for the specific queue topology that avoids cross-thread read-after-write, either by doing everything in one thread, or avoiding the read action. No action is required from the testers infrastructure, but limits the goal of testdriver reuse.

ducky64 commented 6 years ago

@chick I've pushed a failing clock-crossing test case to the testers2 branch

...
    test(new Module {
      val io = IO(new Bundle {
        val divIn = Input(UInt(8.W))
        val mainOut = Output(UInt(8.W))
      })

      val divClock = RegInit(true.B)
      divClock := !divClock

      val divRegWire = Wire(UInt())
      withClock(divClock.asClock) {
        val divReg = RegNext(io.divIn, 1.U)
        divRegWire := divReg
      }

      val mainReg = RegNext(divRegWire, 0.U)
      io.mainOut := mainReg
    }) { c =>
...

sbt:chisel3> testOnly chisel3.tests.ClockCrossingTest
...
[info] [0.000] Elaborating design...
[info] [0.172] Done elaborating.
Total FIRRTL Compile Time: 327.0 ms
[info] ClockCrossingTest:
[info] Testers2 with clock crossing signals
[info] - should test crossing from a 2:1 divider domain *** FAILED ***
[info]   firrtl.passes.CheckChirrtl$UndeclaredReferenceException: @[:@3.2]: [module ClockCrossingTestanonfun1anonfunapplymcVsp1anon2] Reference _T_14 is not declared.
[info]   at firrtl.passes.CheckChirrtl$.firrtl$passes$CheckChirrtl$$checkChirrtlE$1(CheckChirrtl.scala:74)
[info]   at firrtl.passes.CheckChirrtl$$anonfun$firrtl$passes$CheckChirrtl$$checkChirrtlS$1$3.apply(CheckChirrtl.scala:103)
[info]   at firrtl.passes.CheckChirrtl$$anonfun$firrtl$passes$CheckChirrtl$$checkChirrtlS$1$3.apply(CheckChirrtl.scala:103)
[info]   at firrtl.ir.DefRegister.mapExpr(IR.scala:224)
[info]   at firrtl.Mappers$StmtMagnet$$anon$2.map(Mappers.scala:19)
[info]   at firrtl.Mappers$StmtMap$.map$extension(Mappers.scala:33)
[info]   at firrtl.passes.CheckChirrtl$.firrtl$passes$CheckChirrtl$$checkChirrtlS$1(CheckChirrtl.scala:103)
[info]   at firrtl.passes.CheckChirrtl$$anonfun$firrtl$passes$CheckChirrtl$$checkChirrtlS$1$5.apply(CheckChirrtl.scala:104)
[info]   at firrtl.passes.CheckChirrtl$$anonfun$firrtl$passes$CheckChirrtl$$checkChirrtlS$1$5.apply(CheckChirrtl.scala:104)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info]   ...
[info] ScalaTest
[info] Run completed in 1 second, 302 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
[error] Failed: Total 1, Failed 1, Errors 0, Passed 0
[error] Failed tests:
[error]         chisel3.tests.ClockCrossingTest
[error] (Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 6 s, completed Jun 12, 2018 12:34:37 PM

But if I comment out the withClock(...) { and the corresponding closing brace, the test starts fine.

chick commented 6 years ago

This is due to Firrtl Issue #749

ducky64 commented 6 years ago

Updates from the last few meetings:

Autoclock

One issue with clock.step is that any abstraction that needs to step time also needs a reference to the clock. What ends up happening a lot is creating custom testdrivers like

class ReadyValidSource[T <: Data](x: ReadyValidIO[T], clk: Clock) {

which is a thin layer over the bundle of interest and the clock, and really just boilerplate.

Proposal is to infer the clock from any signal (or a group of signals, assuming they have a common clock), so you can do something like mySignal.step, without needing an explicit reference to the clock. The mechanism could be similar to combinational loop detection: get all the registers backwards and forwards from a signal, check that they all have the same clock, and return that clock.

When multiple clocks are associated with a wire, explicit clock specification will still be required. But this is expected to be a uncommon use case (standard Chisel paradigm is single-clock Modules), so it makes sense to provide an optimization for the common case.

We discussed this at the meeting last week, and consensus was to try it. Main concerns was that this actually needs to work almost all the time, and fail clearly when it doesn't

Test library style, whether to fork by default

If the common case is sequential actions, it makes sense for library functions to not fork by default; instead, they would block until completion (but probably have internal pokes be duration-scoped). Where parallelism is desired, fork (or a more concise syntax for multiple forks, like parallel(..., ..., ...)) should be used explicitly.

Using joins to advance time was stylistically confusing and discouraged (except, perhaps, where a library function is variable-latency). Having blocking library functions and explicit parallelism could prevent this, and better match expectations from standard imperative programming.

Development roadmap

(a loosely prioritized list of features that need to be implemented)

combinational poke / peek-after-poke conflict detection
refactor utilities to not fork by default, add concise parallel(..., ..., ...) construct
implicit clock resolution
phases
anything I missed?

jackkoenig commented 5 years ago

https://github.com/ucb-bar/berkeley-hardfloat is a possible target for the new testers API. It is currently compiled with Chisel2 because its tests (written in C) are written against the Chisel2-generated C API. The whole codebase might be able to be migrated to Chisel3 if we can update the tests to work with chisel3. The Chisel code itself is written in the union of Chisel2 and Chisel3-compatibility-mode (and is currently compiled using Chisel 3 in rocket-chip).

ducky64 commented 5 years ago

One problem while adding the combinational path checking to a non-toy design: weakPokes / default values are done in the main thread (and last indefinitely) and causes an error when something combinationally dependent on them is read from another thread.

Potential solutions I can think of right now, none of them great:

Relax checking so that only thread-order-dependent operations are affected - that is, a poke can only cause a conflict with a peek on the timestep the poke starts, as opposed to current behavior where the poke will cause a conflict as long as it is in effect. After one timestep, there should be no more thread order dependence effects, but depending on cross-thread effects might still be bad. This also requires at least one timestep between the peek and poke, which is problematic.
Create a special case construct, like pull to set default values. Pulls must be specified before any other operation takes place (so there's no order issues) - this can either be runtime checked (nasty) or done in another scope (outside the main testdriver block - also nasty).
- Related issue: the default pulls / weakPokes are the main reason we need to have wrapper classes, like ValidSourceDriver (the other reason is to specify a clock, though implicit clock resolution should address that). Having some way to associate a default value with a wire could eliminate the need for wrapper classes.
Don't check peeks against pokes of parent threads that were in effect before the child thread was spawned, on the idea that the operations are well-ordered in code. This creates a special case, but is arguably intuitive in that the lexical order is followed.

ducky64 commented 5 years ago

Resolutions from today's dev meeting:

For default values: go with option 3, don't check peeks against pokes (even combinational ones) in the parent thread before the thread spawned. Potential performance implications, but shouldn't be significant for small (unit) tests.
For parallel: return a list of threads, like the fork syntax. Replication of the parallel to emulate varargs is acceptable, because Scala does not have a good syntax for parameter-less anonymous functions.
- This syntax was considered to be hot garbage:
```
parallel(
() => a.enqueue(...),
() => b.enqueue(...),
() => c.enqueue(...)
).join
```
- This syntax would be ideal
```
parallel(
{ a.enqueue(...) },
{ b.enqueue(...) },
{ c.enqueue(...) }
).join
```
  however, the issue is that {...} doesn't create an anonymous function, but just returns the result of executing the block, so call-by-name is still needed. In this case, the { } are optional, but could be useful for uniformity if running sequential items in a thread.
- Another option (which I will implement) is just to make fork varargs, instead of having an alternative parallel construct.
A use case that needs to be supported is re-using elaborated / compiled modules. The current implementation shouldn't preclude it, since IOs are resolved to string names before the command is communicated to the backend. This allows for testers2 to run with a blackbox that provides the module interface, and the backend could refer to some other circuit, even a non-Chisel circuit (like post-syn / post-PAR verilog).
Wrappers (like ValidSourceDriver(...)) should be avoided if possible. The Decoupled adapter can be eliminated with implicit clock resolution, and something like myDecoupled.init() can poke the default values. More implicit initialization of default values was not discussed.

ducky64 commented 5 years ago

So the 'threads can run at any time' idea turns out to have some pitfalls, like when running back-to-back operations:

fork { a.enqueue(0.U) }  // poke for one cycle
clk.step(1)
fork { a.enqueue(1.U) }  // poke for one cycle, but the above enqueue also ends on this cycle

Ideally, on the clock cycle after the step, the 0.U should fall out of scope before the 1.U poke starts, but if thread execution order is arbitrary, then this doesn't hold. The main issue that this causes is with peeks, since that is partially dependent on execution order ("clock low" circuit propagation is fine, since the active pokes at the end of a timestep need to be unambiguous).

Proposal: impose partial thread execution order, specifically:

a newly spawned thread executes at the fork point or after
otherwise, a thread always executes before its parent
there are no ordering constraints between sibling-level threads
inter-thread dependent operations are still checked and flagged as errors:
- pokes must follow overlapping timescope semantics (pokes must be strictly encapsulated in its parent timescope, and simultaneous sibling-level pokes are invalid)
- there must be an unambiguous poke for each peek, defined as either a preceding poke in the same thread, or a preceding poke in a parent thread that happened before its immediate child spawned (effectively lexically unambiguous). These must not be overridden, except within the same thread, or for a child thread where the immediate child spawned after the peek.

This kind of matches structured programming semantics (child threads are encapsulated, in a sense, by its parent thread). The main downside would be constraints on inter-thread communications (for example, joining on a parent thread or parent-to-child communication mechanisms could create a one-timestep delay, though those aren't things I've given a lot of thought to either).

Thoughts?

oharboe commented 5 years ago

I would like to have a way to peek() just after step(1) that reads out the value of a combinatorial output just before the positive edge of the clock, which is what would be clocked into a register on an FPGA.

I don't need combinatorial peek() and poke(), I just find them confusing(which is bad enough for me who's learning Chisel/FPGAs, but I'd say worse for complete beginners).

Today I have used the following workaround to be able to read out the value just before the rising edge of the clock using peek() immediately after a step(1):

https://groups.google.com/forum/#!topic/chisel-users/5qx9MQQQuRg

ducky64 commented 5 years ago

Why do you think combinational peek/poke are confusing? It seems straightforward (at least given the RTL abstraction): when you poke something, the effects can be seen.

I don't think it makes sense to read out the value before the rising edge right after a step: step means to fire a rising edge, so whatever happens after the step would happen after the rising edge. Wouldn't it make sense to peek out the value right before the step?

For composing actions in parallel, the proposal is to split a timestep into phases, so there would be the main phase (where most testdriver actions happen) and also a monitor phase, where you could peek into the circuit after all the main phase actions happen but before the step.

oharboe commented 5 years ago

The reason why I find combinatorial peek/poke confusing, is because it's not what I need to test.

What I need to test is is that the correct value would be clocked into a register connected to an output that I peek on the rising clock edge of a step(1).

combinatorial peek() and poke() are straightforward, but they can't be used to write the tests I need to write.

If that is unclear, I guess it underscores my point: it's confusing.

I've explained in more detail on the mailing list: https://groups.google.com/forum/#!topic/chisel-users/5qx9MQQQuRg

Regarding composing actions in parallel, that's not a big concern for me currently. I've looked briefly at cocotb, which looks like an easy to use, well thought out and powerful framework. It can be used together with Chisel, because it can test the Verilog. I like Chisel iotesters for simple tests, because they can easily run within the comfort of my IDE.

On Fri, Oct 5, 2018 at 9:41 PM Richard Lin notifications@github.com wrote:

Why do you think combinational peek/poke are confusing? It seems straightforward (at least given the RTL abstraction): when you poke something, the effects can be seen.

I don't think it makes sense to read out the value before the rising edge right after a step: step means to fire a rising edge, so whatever happens after the step would happen after the rising edge. Wouldn't it make sense to peek out the value right before the step?

For composing actions in parallel, the proposal is to split a timestep into phases, so there would be the main phase (where most testdriver actions happen) and also a monitor phase, where you could peek into the circuit after all the main phase actions happen but before the step.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427477547, or mute the thread https://github.com/notifications/unsubscribe-auth/ACq05qqij8272aciih9MBH4SOKcbbWX0ks5uh7XIgaJpZM4Q3Kxa .

-- Øyvind Harboe, General Manager, Zylin AS, +47 917 86 146

ducky64 commented 5 years ago

So you want peeks to be the output of an implicit register on the wire being peeked? That sounds like pretty nonintuitive behavior (unless this is actually industry standard practice for whatever reason - but I'd like examples and a rationale). Apparently this may have been the case in chisel2, though a lot of things weren't done in the greatest way in chisel2.

I looked at your example, and in the absence of parallel actions where you need a total ordering, is there any reason you can't put the peeks and expects right before the step (it might also help to think of step as clock rising edge)? I don't see why adding an implicit register would be less confusing or more intuitive?

shunshou commented 5 years ago

I agree that step should be considered as clock rising edge. When you test registers, you expect outputs to be available slightly after the rising edge (not exactly at the rising edge). When I'm verifying that things are functionally correct (ignoring any critical path timing issues, for Chisel or Verilog or VHDL designs...), I use a TB to feed in data some time after a rising edge (so it'll be registered on the next rising edge), and I expect outputs to be valid some small time after a rising edge. In this case, having the simulator peek and poke (starting at) falling edges actually makes the most sense for functional verification--and I think if you look at waveforms from Chisel tests, that's actually what it does? Although I haven't stared at them in a while.

chisel2 registering and peek/poke simulation was actually fundamentally incorrect (mostly from how things were registered IIRC). It generated bad Verilog in some cases that didn't match Chisel c++ simulations.

On Fri, Oct 5, 2018 at 3:58 PM Richard Lin notifications@github.com wrote:

So you want peeks to be the output of an implicit register on the wire being peeked? That sounds like pretty nonintuitive behavior (unless this is actually industry standard practice for whatever reason - but I'd like examples and a rationale). Apparently this may have been the case in chisel2, though a lot of things weren't done in the greatest way in chisel2.

I looked at your example, and in the absence of parallel actions where you need a total ordering, is there any reason you can't put the peeks and expects right before the step (it might also help to think of step as clock rising edge)? I don't see why adding an implicit register would be less confusing or more intuitive?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427520067, or mute the thread https://github.com/notifications/unsubscribe-auth/AGTTFnwFdONkKocjOWEYLl0VUgoHzXZ_ks5uh-QAgaJpZM4Q3Kxa .

oharboe commented 5 years ago

I haven't been using Chisel and FPGAs for very long, I'm still learning, but here is an explanation to the best of my abilitiets:

I think you are asking me for an example of how to write a test-bench in an "industry standard tool" (probably something like ModelSim). I will ask my colleague who's much more knowledgable in FPGAs than me, if we can put together an example.

It disturbs me that the test-bench has to have intimate knowledge about implemention details, so that I can know if I need to place the peek() before or after the step(). That doesn't sound like a robust abstraction to me.

My understanding is that in an FPGA it doesn't make sense to talk about how combinatorial logic is implemented. poke() and peek() gives you the ability to "see" what's going on when signals are being changed, which you can't know in an FPGA. The FPGA has a programming model where it can do combinatorial logic however it wants. All we can know in an FPGA is what would be clocked into a register on the rising edge.

On Sat, Oct 6, 2018 at 12:58 AM Richard Lin notifications@github.com wrote:

So you want peeks to be the output of an implicit register on the wire being peeked? That sounds like pretty nonintuitive behavior (unless this is actually industry standard practice for whatever reason - but I'd like examples and a rationale). Apparently this may have been the case in chisel2, though a lot of things weren't done in the greatest way in chisel2.

I looked at your example, and in the absence of parallel actions where you need a total ordering, is there any reason you can't put the peeks and expects right before the step (it might also help to think of step as clock rising edge)? I don't see why adding an implicit register would be less confusing or more intuitive?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427520067, or mute the thread https://github.com/notifications/unsubscribe-auth/ACq05hH24q3Cnyx79i1WWceNeiKbvcBNks5uh-QCgaJpZM4Q3Kxa .

-- Øyvind Harboe, General Manager, Zylin AS, +47 917 86 146

oharboe commented 5 years ago

@shunshou I think you are saying that if I create a testbench that causes the input to the device under test to be the output of a register, then my problem with not having a single unambigious location to put the expect/peek() goes away.

I gave it a try and it seems to work!

As a bonus the wavetraces become much easier to read as the signals only change on the rising edge, which matches what my FPGA colleague uses in his testbenches and when he explains thigns and also what I find in e.g. the Altera manuals.

Everything then acts as I expect and there's a single unambigous location to put the expect peek/expect() statements that does not rely on knowing implementation details.

Thanks!

Now... for Chisel Testers2, my vote would be on a model where this is how things work out of the box as my best understanding is that it matches the industry standard expeceted behavior of a test-bench.

screenshot from 2018-10-06 11-50-00

I bet any Chisel/Scala expert would be able to make a generic testbench wrapper utility function that would do this automatically, removing the need to write specific test-bench code to achieve this.

FiddlyBobTests.zip

On Sat, Oct 6, 2018 at 1:27 AM Angie Wang notifications@github.com wrote:

I agree that step should be considered as clock rising edge. When you test registers, you expect outputs to be available slightly after the rising edge (not exactly at the rising edge). When I'm verifying that things are functionally correct (ignoring any critical path timing issues, for Chisel or Verilog or VHDL designs...), I use a TB to feed in data some time after a rising edge (so it'll be registered on the next rising edge), and I expect outputs to be valid some small time after a rising edge. In this case, having the simulator peek and poke (starting at) falling edges actually makes the most sense for functional verification--and I think if you look at waveforms from Chisel tests, that's actually what it does? Although I haven't stared at them in a while.

chisel2 registering and peek/poke simulation was actually fundamentally incorrect (mostly from how things were registered IIRC). It generated bad Verilog in some cases that didn't match Chisel c++ simulations.

On Fri, Oct 5, 2018 at 3:58 PM Richard Lin notifications@github.com wrote:

So you want peeks to be the output of an implicit register on the wire being peeked? That sounds like pretty nonintuitive behavior (unless this is actually industry standard practice for whatever reason - but I'd like examples and a rationale). Apparently this may have been the case in chisel2, though a lot of things weren't done in the greatest way in chisel2.

I looked at your example, and in the absence of parallel actions where you need a total ordering, is there any reason you can't put the peeks and expects right before the step (it might also help to think of step as clock rising edge)? I don't see why adding an implicit register would be less confusing or more intuitive?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427520067 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AGTTFnwFdONkKocjOWEYLl0VUgoHzXZ_ks5uh-QAgaJpZM4Q3Kxa

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427524492, or mute the thread https://github.com/notifications/unsubscribe-auth/ACq05riVlmQB-gx6lHaVbuBtjTaqWxGZks5uh-r0gaJpZM4Q3Kxa .

-- Øyvind Harboe, General Manager, Zylin AS, +47 917 86 146

ducky64 commented 5 years ago

I think there's multiple ideas / interpretations here: is what you actually want is for testers poke to modify immediately after the rising edge, which makes the dumped waveforms more consistent with what you would see on a FPGA? This would be a separate issue from adding an implicit register stage on peeks, where it would read out the value on the previous cycle. I think the first could make sense, but the second doesn't. And for the second, you still need to know where the peek takes place (before or after the edge, implicitly registered or not), your proposal just has different semantics with more magic under the hood.

If you're writing a pure Chisel design (specifically, no negedge triggered logic), then functionality wise, poking on negedge or right after posedge are equivalent, since nothing in the circuit happens on the negative edge.

As for implementation details, you can't have a testbench that knows absolutely nothing about the circuit. In some cases, cycle-level timing may be important (and you may want to test that), and in others, you might be working at the transaction level. Testers2 aims to provide the former, but gives you the pieces to write abstractions that work at the latter. (alternatively phrased, you can use timing-aware semantics to build a timing-oblivious transaction library, but not the other way around)

As for FPGA optimization, the synthesis tools may remap your logic to be more optimal, but testers focuses on testing the RTL as you wrote it. There are cases where you may want to test combinational circuits or subblocks, even if they're going to get completely mangled by the tools, And since you really don't necessarily know how the tools might mangle your design, you can only test the design as you wrote it. If the tools are competent, the externally visible behavior (for some definition of that) should be equivalent to your design anyways.

shunshou commented 5 years ago

“If you're writing a pure Chisel design (specifically, no negedge triggered logic), then functionality wise, poking on negedge or right after posedge are equivalent, since nothing in the circuit happens on the negative edge.”

I think this is a subtle point that people new to Chisel might not understand (definitely took me a while to get a feel for testing when I first learned Chisel...). Also, people new to RTL design and functional verification might not understand why they’d want to “peek”/“poke” at the negative edge (or some delta from positive edge) for positive edge triggered designs, but once they stare at a correct waveform and remember that registers have clock to Q and setup time requirements, things make a lot more sense. No matter the abstraction you use, that’s something you can’t forget as a hardware designer.

On Saturday, October 6, 2018, Richard Lin notifications@github.com wrote:

I think there's multiple ideas / interpretations here: is what you actually want is for testers poke to modify immediately after the rising edge, which makes the dumped waveforms more consistent with what you would see on a FPGA? This would be a separate issue from adding an implicit register stage on peeks, where it would read out the value on the previous cycle. I think the first could make sense, but the second doesn't. And for the second, you still need to know where the peek takes place (before or after the edge, implicitly registered or not), your proposal just has different semantics with more magic under the hood.

If you're writing a pure Chisel design (specifically, no negedge triggered logic), then functionality wise, poking on negedge or right after posedge are equivalent, since nothing in the circuit happens on the negative edge.

As for implementation details, you can't have a testbench that knows absolutely nothing about the circuit. In some cases, cycle-level timing may be important (and you may want to test that), and in others, you might be working at the transaction level. Testers2 aims to provide the former, but gives you the pieces to write abstractions that work at the latter. (alternatively phrased, you can use timing-aware semantics to build a timing-oblivious transaction library, but not the other way around)

As for FPGA optimization, the synthesis tools may remap your logic to be more optimal, but testers focuses on testing the RTL as you wrote it. There are cases where you may want to test combinational circuits or subblocks, even if they're going to get completely mangled by the tools, And since you really don't necessarily know how the tools might mangle your design, you can only test the design as you wrote it. If the tools are competent, the externally visible behavior (for some definition of that) should be equivalent to your design anyways.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-427592286, or mute the thread https://github.com/notifications/unsubscribe-auth/AGTTFrmHiUL85f0572QJ6ID1lDU5ZMaiks5uiOkGgaJpZM4Q3Kxa .

oharboe commented 5 years ago

Near as I can understand, @shunshou nailed it. Her approach of creating a wafer thin wrapper that registers the inputs before they are peek'ed and poke'd() are exactly what's needed when working on an FPGA:

It matches exactly the simplest abstraction for an FPGA: positive edge triggered registers. I have been led to understand that are other ways to trigger registers on an FPGA, but at the end of the day they are equivalent. There needs to be a good reason to have multiple abstractions in a single code-base.
The waveforms match the literature. Signals change only on the positive edge. This helps to reduce cognitive load when reading waveforms and also importantly eyestrain. Eyestrain is a concern to be taken very seriously when working with digital logic compared to software development, I find.
Reduced cognitive load and a clear separation of concerns(physics vs. logic) by having the right abstraction with simple rules: always poke just before the step, always peek just after the step. Looking at the wavetrace, only look at the positive edges: left is the past, right is the future as seen from that positive edge.

The only fly in the ointment, about this approach is that I have to manually create a wafer thin wrapper.

Not a huge deal, but a source of error and less typing is more.

I'd like to have a utility fn like "RegisterInput(Module(new Foo))" that would drill down into the io bundle and find all leave input Data objects and add registers to them. My understanding is that this utility fn would need to have access to private members of for instance the Data class.

Nic30 commented 5 years ago

Hello, I was looking in to this problematic back then. I have found that there some minimum requirements on simulators in order to use UVM style Interface agents with readable code.

It is required to have possibility to wait on end of combinational update.
It is required to have possibility to wait on end of sequential for specified clock signal.
System thread for each simulation thread is not an option.
Generator (the data structure) fits the best for description of the simulator processes.
UVM interface agents has to be usually implemented as multiple synchronized simulation processes to have readable code.

I have framework similar to chisel3 I would like to see ultimate meta-HLD language someday and I think that chisel3 has the best potential.

Agents in my framework look like https://github.com/Nic30/hwt/blob/master/hwt/interfaces/agents/vldSynced.py#L33

There are 5 methods which are enought. wait(time), read(signal), write(signal, val), waitOnCombUpdate(), waitOnSeqUpdate()

I can help you if you are still working on it.

ducky64 commented 5 years ago

So this proposal isn't meant to be UVM in Chisel, since UVM has some drawbacks, including (from a non-user / outsiders perspective) high verbosity and excessive separation of concerns (spaghetti-with-meatballs-code). The focus here is more on lightweight unit tests, and figuring how a core set of simple abstractions might compose into something more powerful that could be used for integration testing.

I think this proposal has equivalents to most of the methods you require, though with slightly different semantics. The idea here is that combinational logic operates infinitely fast, but concurrent actions can only influence others in limited ways (to avoid a huge source bugs while allowing concurrent sequences, since writing sequences is much less annoying than transforming them into FSMs). Unfortunately this proposal has also changed significantly and is in the midst of another rewrite, but hopefully examples (to come soon) will make things a bit more clear. Feedback is always welcome, though!

Why do you say that system threads for each simulation thread is not an option? Is this because of potential for concurrency bugs (which we try to avoid here by detecting potential race conditions, and imposing partial thread run order)? Or is it because of performance issues from expensive OS scheduler calls? (put another way, are coroutines a good solution, and if so, what is most important over threads?)

Note that Scala coroutine support is pretty bad overall, so unless that improves, we're limited in what we can do.

Nic30 commented 5 years ago

OS thread are not suitable from both reasons.

Concurrency bugs. Simulation processes very often has to communicate not only with simulation and test but also between each other. This generates large number of critical sections. And this complicates randomization and verification significantly. But I do not have scala parallel programming experience.
Number of threads: It may seem that there is not so much of simulation processes but If you want translate AXI4 writes and reads to actual memory access it is 5+ simulation threads. Also for tri state interfaces it x3. I mean that number of simulation threads can grow easily to 1000+ for just AXI DMA test just because simulation processes are used to describe nearly all parts of test.

Nic30 commented 5 years ago

@ducky64 Implementation of whole UVM is to big step. But UVM style similation agenets are just bunch of simulation processes packed in one object. It is nothing more than you want to implement.

And they are extremely useful because user does not have to know the protocol of the interface in order to use it. (F.e. you can just use push()/pop() method on interface agent instead of setting signals manually on fifo interface)

ducky64 commented 5 years ago

That's actually a really good point: what are the features of UVM you like and dislike the most, so that we can take the best of it without being tied down to the worst of it?

We're already planning to support use-defined higher levels of abstraction (eg enqueue / dequeue functions onto a Decoupled IO), so there is going to be a bit of that separation of interface and implementation / encapsulation of test details. Anything else you particularly want to see, or don't want to see?

edcote commented 5 years ago

Hey Richard, I can speak at length on the topic. Drop me a direct email if you'd like to discuss further.

UVM simulation phasing http://www.learnuvmverification.com/index.php/2016/04/29/uvm-phasing/
A uniform end of test http://blog.verificationgentleman.com/2016/03/an-overview-of-uvm-end-of-test-mechanisms.html mechanism (objections)
Method for specifying tests via base test classes (uvm_test)

This would be over an above SystemVerilog features such as fork/join/join_none/any and some "thread" synchronization primitives (mailbox, semaphore)

On Sat, Oct 13, 2018 at 11:12 AM Richard Lin notifications@github.com wrote:

That's actually a really good point: what are the features of UVM you like and dislike the most, so that we can take the best of it without being tied down to the worst of it?

We're already planning to support use-defined higher levels of abstraction (eg enqueue / dequeue functions onto a Decoupled IO), so there is going to be a bit of that separation of interface and implementation / encapsulation of test details. Anything else you particularly want to see, or don't want to see?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freechipsproject/chisel3/issues/725#issuecomment-429563373, or mute the thread https://github.com/notifications/unsubscribe-auth/AZIj632QhU_gDD_gobITuOtt54eqKS2hks5uki0ogaJpZM4Q3Kxa .

Nic30 commented 5 years ago

I think UVM simulation phasing has to be implemented in simulation framework (easy to do), rest can be implemented as separate library and also later. Connection phases can also be implemented after implementation of simulator core.

Now it is important to have simulator core with high enough abstraction level, which will not restrict us in future.

I think that randomization, coverage checking, model, scoreboard logic can be implemented in future without any problems. (And also as a separate library.)

Dolu1990 commented 5 years ago

Just some usefull data, Using JVM thread to emulate coroutine will use about 3 us to do the whole following : From the main thread, resume a sim thread and wait until the sim thread suspend itself before getting further.

This 3 us were optained on my laptop (3.3 Ghz i7) on both native windows and linux VM, using java-thread affinity to lock the main thread and the sim thread on the same logical cpu core. Without locking the thread affinity, it is 6 us on host windows, and about 30 us on guest linux VM.

So using regular JVM thread is a viable way to provide coroutine in a simulation context. Also, to provide faster simulation utilities without the threading overhead, i added the possibility to have "sensitive process" which execute a body of code at each delta cycle of the simulation, which allow emulating many different things for nearly no cost.

Threadless example (run at 1000 Khz on my laptop) : https://github.com/SpinalHDL/SpinalHDL/blob/dev/tester/src/test/scala/spinal/tester/scalatest/SpinalSimPerfTester.scala#L44

Threadfull example (run at 200 Khz on my laptop) : https://github.com/SpinalHDL/SpinalHDL/blob/dev/tester/src/test/scala/spinal/tester/scalatest/SpinalSimPerfTester.scala#L73

Dolu1990 commented 5 years ago

Hoo and about JVM threads in the SpinalSim, only one is running at the time, and they always do handshakes while switching from each others, so, there is no concurency issues.

Nic30 commented 5 years ago

Maybe it is possible to perform "process switching" on C++ level, it is much faster. https://www.boost.org/doc/libs/1_54_0/libs/coroutine/doc/html/coroutine/performance.html

Currently I am working on simulator which uses Verilator and Boost.Coroutine. https://github.com/Nic30/pycocotb/tree/master/pycocotb ( Now it is just prototype and I will finish it after January. It is for Python only but integration to JVM should not be the problem. Also problem is not simulation API itself, problem is how to modify Verilator to keep it's speed while improving re-usability of simulation parts and readability. Because as Verilator is not discrete event simulator it adds an extra problems. )

Dolu1990 commented 5 years ago

@Nic30

Maybe it is possible to perform "process switching" on C++ level, it is much faster. https://www.boost.org/doc/libs/1_54_0/libs/coroutine/doc/html/coroutine/performance.html

It is my hope, i made some tries, but the JVM wasn't realy happy of that kind of context manipulation when jumping from the C context to the java context via JNI. Maybe/probably I made something wrong.

Also problem is not simulation API itself, problem is how to modify Verilator to keep it's speed while improving re-usability of simulation parts and readability. Because as Verilator is not discrete event simulator it adds an extra problems.

It isn't realy an issue, the behaviour of a real simulator can be emulated. I'm currently documenting it, There is the main simulation loop which emulate a event driven simulator by using cocotb + some tricks : https://github.com/SpinalHDL/SpinalHDL/blob/dev/sim/src/main/scala/spinal/sim/SimManager.scala#L203

There is a diagram of it :

Basicaly, the simulation loop provide 2 primitives, sensitive callbacks (function call on each emulated delta cycle), and delayed callbacks (call a function when the simulation reach a given time)

Then the threading model is another layer on the top.