ChiselSim tests runs orders of magnitude slower than chiseltest

Converting some tests to ChiselSim, I've noticed the tests run about 20x slower than chiseltest.

As a comparison, running on ChiselSim:

❯ time ./mill chiselv.test.testOnly chiselv.ALUSpec
[86/86] chiselv.test.testOnly
ALUSpec:
- should ADD
- should ADDI
- should SUB
- should AND
- should ANDI
- should OR
- should ORI
- should XOR
- should XORI
- should SRA
- should SRAI
- should SRL
- should SRLI
- should SLL
- should SLLI
- should SLT
- should SLTI
- should SLTU
- should SLTIU
- should EQ
- should NEQ
- should GT
- should GTU
Run completed in 1 minute, 37 seconds.
Total number of tests run: 23
Suites: completed 1, aborted 0
Tests: succeeded 23, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
./mill chiselv.test.testOnly chiselv.ALUSpec  0.99s user 0.79s system 1% cpu 1:40.39 total

and on chiseltest:

❯ time ./mill chiselv.test.testOnly chiselv.ALUSpec
[86/86] chiselv.test.testOnly
ALUSpec:
- should ADD
- should ADDI
- should SUB
- should AND
- should ANDI
- should OR
- should ORI
- should XOR
- should XORI
- should SRA
- should SRAI
- should SRL
- should SRLI
- should SLL
- should SLLI
- should SLT
- should SLTI
- should SLTU
- should SLTIU
- should EQ
- should NEQ
- should GT
- should GTU
Run completed in 3 seconds, 661 milliseconds.
Total number of tests run: 23
Suites: completed 1, aborted 0
Tests: succeeded 23, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
./mill chiselv.test.testOnly chiselv.ALUSpec  0.31s user 0.10s system 8% cpu 4.846 total

I've followed the migration guide which ended with the following simple changes:

❯ gdpatch chiselv/test/src/ALUSpec.scala
diff --git a/chiselv/test/src/ALUSpec.scala b/chiselv/test/src/ALUSpec.scala
index ac56a81..b4bafd0 100644
--- a/chiselv/test/src/ALUSpec.scala
+++ b/chiselv/test/src/ALUSpec.scala
@@ -1,7 +1,7 @@
 package chiselv

 import chisel3._
-import chiseltest._
+import chisel3.simulator.EphemeralSimulator._
 import com.carlosedp.riscvassembler.ObjectUtils.NumericManipulation
 import org.scalatest._

@@ -9,7 +9,7 @@ import Instruction._
 import flatspec._
 import matchers._

-class ALUSpec extends AnyFlatSpec with ChiselScalatestTester with should.Matchers {
+class ALUSpec extends AnyFlatSpec with should.Matchers {
   val one        = BigInt(1)
   val max        = (one << 32) - one
   val min_signed = one << 32 - 1
@@ -124,12 +124,12 @@ class ALUSpec extends AnyFlatSpec with ChiselScalatestTester with should.Matcher
     dut.io.a.poke(i.to32Bit)
     dut.io.b.poke(j.to32Bit)
     dut.clock.step()
-    dut.io.x.peekInt() should be(out)
+    dut.io.x.peek().litValue should be(out)
   }
   def testCycle(
       op: Type
     ) =
-    test(new ALU) { c =>
+    simulate(new ALU) { c =>
       cases.foreach { i =>
         cases.foreach { j =>
           testDut(i, j, aluHelper(i, j, op).to32Bit, op, c)

The file is from https://github.com/carlosedp/chiselv/blob/main/chiselv/test/src/ALUSpec.scala

Type of issue: Bug Report

Please tell us about your environment:

Chisel 6.4.0 on MacOS Sonoma 14.5. Verilator 5.024 2024-04-05 rev UNKNOWN.REV

Maybe after #4158 lands, we can find some other high performance solution w/ DPI.

I did few experiments on my local workstation (ubuntu 22.04, 5950X).

time
chiselsim (baseline)	16 sec
chiseltest (default (=treadle?))	2.5 sec
chiseltest (verilataor)	12 sec
chiselsim (with removing Files.createTempDirectory, commit)	6.5 sec
chiselsim (with fusing all tests into a single test, commit)	2.5 sec

It took 16 sec to run ./mill chiselv.test.testOnly chiselv.ALUSpec which is 6~7 slower than chiseltest(treadle) in my environment. When I removed createTempDirectory in EphemeralSimulator it downs to 6.5 sec. Even though my enviroment is linux I wouldn't be surprised if mac also has similar file IO issue.

One issue here is verilog is generated and compiled every time (in this case 13 times) so I feel 6.5 sec seems to be a reasonable time for the overhead. When I fused all tests into a single test it took 2.5 sec to run. For comparison I checked chiseltest time with verilator backend but it took 12 sec (which uses SFC so not apple to apple comparison though).

So actionable items for us would be:

Check createTempDirectory is actually causing the regression and fix
Provide a way to compile once and run multiple tests from ChiselSim (if it does not already exits).

I have also noticed that a single-thread chiselsim testbench runs ~4 times slower than a similar testbench using chiseltest, with or without multithreaded tasks. This is the test execution alone, not including the compilation times. A Python cocotb testbench with the same functionality and test iterations but with concurrent drivers/monitors and additional checks also runs 4-5 times faster than a more primitive chiselsim version. At least on my system (macOS, arm64, fast NVM drive), I realized that the the file I/O for generating the execution script is partly to be blamed. Currently, the execution script is always enabled and the simulator continues to fill up the script file with unuseful messages even when an executionScriptLimit of say 0 is specified. Disabling the script gives me about 1.5x-2x speed improvement. Still not nearly as good as cocotb or chiseltest. Another thing to investigate is `svsim's choice of a text-based protocol and use of stdio for the communications between Scala and the simulation executable. Depending on the amount of data communication, the data overhead and conversion overhead could be significant. I won't be surprised if the impact would vary on different platforms and operating systems.

PS: this WIP PR includes adding an executionScriptEnabled flag to disable the execution script, which can easily be made into a quick standalone PR.

Thanks for looking into this everyone and for all of your efforts to improve it!

SVSim as the engine underneath ChiselSim definitely has some overhead, some of which we should fix, some of which we can mitigate.

Excellent observation @kammoh on the execution script, we should probably disable that by default. I'll nit at the characterization of it as "unuseful messages" since they are intended for simulation replay which is a pretty neat debugging feature: https://github.com/chipsalliance/chisel/tree/main/svsim#make-replay.

Making the protocol more efficient might help some. One of the design decisions of SVSim is to use inter-process communication (see README) which will have an overhead if there is a lot of communication, especially every cycle. SVSim is optimized for having some amount of decoupling and the raw API supports essentially clocking the design until some simple "port equals value" condition is met. This does not lend itself well to peek-poke style testing (thus the measured slowdowns) but works well if there is some decoupling. At SiFive we decouple it quite a bit (doing some testing logic in Chisel itself) and we get no slowdown, but this is not necessarily the best API for ChiselSim.

If we want highly coupled peek/poke-style tests to work well, we probably just need to avoid IPC. That may require an alternative backend to SVSim. We need an alternative backend eventually to support Arcillator (CIRCT's native simulator), which may itself be another source of speedups. Another possibility could be making it convenient to express "agents" that are pure Chisel (so they can be compiled into the simulation) but seamlessly interoperate with ChiselSim to give some decoupling.

Thus throwing some ideas out there, thanks everyone for looking into this!

chipsalliance / chisel

ChiselSim tests runs orders of magnitude slower than chiseltest #4207