emu-russia / dmgcpu

DMG CPU Reverse Engineering
Creative Commons Zero v1.0 Universal
29 stars 4 forks source link

Speed up simulation by ~35 times by using Yosys and Verilator #292

Closed Rodrigodd closed 1 week ago

Rodrigodd commented 1 week ago

For the past couple of weeks, I have been developing a way to speed up the simulation. What I have come up with is using Yosys to automatically rewrite the Verilog code, optimizing out the usage of tri-states (and making other small changes) and generating code that can be simulated using Verilator.

I use Yosys to:

I had to implement some features in Yosys to achieve what I needed, and they are currently in this branch. The changes mentioned above are:

In the next couple of days, I will try to clean up and upstream those changes. For now, you need to build Yosys from my fork to use it.

With these changes, I can now run the simulation about 35 times faster than before. Note that I also fixed the clock frequency used in the simulation—it was set to 20 MHz before, which is 4.8 times faster than the original Game Boy, so the actual speedup is 168 times faster than my previous measurement, from around 8 hours per simulated second to currently 2.35 minutes per simulated second.

This means that I can simulate the full cpu_instrs.mem test in around 2h30min, which is a much more manageable timeframe. However, I still haven't run it. I plan to make a tool for comparing wave files, and then compare the simulated run with wave files generated by my emulator (which I also need to finish implementing).

Below are my benchmark results, simulating 4917 µs. I also tried to compile the Verilator simulation with Profile-Guided Optimization (PGO), but it didn’t significantly speed up the simulation.

HDL/sm83/Icarus $ make yosys verilator verilator-pgo run ROM=roms/01-special.mem CYCLES=20480
HDL/sm83/Icarus $ sleep 30 && hyperfine --setup='sleep 30' -- 'verilator_pgo_build/VSM83_Run +verilator+rand+reset+1' 'verilator_build/VSM83_Run +verilator+rand+reset+1' 'vvp sm83.yosys.run' 'vvp sm83.run'
Benchmark 1: verilator_pgo_build/VSM83_Run +verilator+rand+reset+1
  Time (mean ± σ):     696.2 ms ±   4.7 ms    [User: 686.0 ms, System: 5.8 ms]
  Range (min … max):   691.8 ms … 706.3 ms    10 runs

Benchmark 2: verilator_build/VSM83_Run +verilator+rand+reset+1
  Time (mean ± σ):     703.9 ms ±   5.8 ms    [User: 694.0 ms, System: 5.7 ms]
  Range (min … max):   694.6 ms … 712.3 ms    10 runs

Benchmark 3: vvp sm83.yosys.run
  Time (mean ± σ):      7.035 s ±  0.803 s    [User: 6.406 s, System: 0.060 s]
  Range (min … max):    6.444 s …  8.544 s    10 runs

Benchmark 4: vvp sm83.run
  Time (mean ± σ):     24.983 s ±  0.225 s    [User: 24.734 s, System: 0.108 s]
  Range (min … max):   24.664 s … 25.397 s    10 runs

Summary
  verilator_pgo_build/VSM83_Run +verilator+rand+reset+1 ran
    1.01 ± 0.01 times faster than verilator_build/VSM83_Run +verilator+rand+reset+1
   10.10 ± 1.16 times faster than vvp sm83.yosys.run
   35.88 ± 0.40 times faster than vvp sm83.run
ogamespec commented 1 week ago

I'm running PR, everything is fine there, thanks for the detailed descriptions you do, future generations will be grateful :) I'll add from myself - I'm pleased that the results go to the emulator, since you probably noticed that this is one of the topics I'm interested in. I also started to make SM83 simulator in C, small part of what I managed to do (decoder) is already in repository. But first I want to add all accumulated materials on SoC.