VUnit / vunit

VUnit is a unit testing framework for VHDL/SystemVerilog
http://vunit.github.io/
Other
696 stars 250 forks source link

NVC performance with Vunit #1036

Open Blebowski opened 5 days ago

Blebowski commented 5 days ago

Hi,

I am trying to port my project: https://github.com/Blebowski/CTU-CAN-FD

that I run with VUnit from GHDL into NVC. I managed to get everything working, and I can run the regression with the same results in GHDL and in NVC.

The issue I face, is that NVC run-time is almost double the GHDL run-time (when executed via VUnit, I have not tried to profile "raw" commands typed into the command line).

The analysis time of NVC is much smaller, but since analysis takes fraction of the overal regression, in total run-time GHDL wins.

I would be curious to find out why does this occur. AFAICT, NVC is faster simulator.

My design contains large constant arrays of large structs that presumably take long time to elaborate. When I run single test with -v, I see that VUnit always elaborates the design with NVC. Therefore, each test requires separate elaboration since NVC does not support run-time setting of generics.

Does VUnit elaborate each test in GHDL too ? Or does Vunit in some smart manner passes the generics such as test name, seed or other numeric parameters only to GHDL simulation ?

LarsAsplund commented 5 days ago

I would suggest that you try first without VUnit to see if the differences are related to VUnit, the design you have, or differences in the simulators.

I can't think of any obvious reason for why VUnit would have this impact on performance. There are no special tricks related to how we pass generics. In both cases, we use the -g option.

Elaboration can take some time. I've noticed with GHDL that some of its advantages in terms of startup performance go away if many contexts are included in a testbench.

Generally speaking, I've experienced significant performance differences between simulators but it varies a lot with the design. X can be twice as fast as Y on one design and then Y is twice as fast as X for another.

For many (fast) unit tests, the raw simulator performance is irrelevant. It is the startup time of the simulator that dominates.

It should also be noted that the simulator interface we have for NVC is a contribution from @nickg so I would assume that there are no major performance-killing issues in that code.

nickg commented 5 days ago

@Blebowski can you provide some instructions for running the testbench? I found a run.py in that repository but looks like it needs some additional arguments.

Blebowski commented 5 days ago

Hi, I will prepare a reproducer that compares these two.

Blebowski commented 4 days ago

Hi,

the steps to run the comparison:

git clone https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core.git ctu_can_fd_benchmark
cd ctu_can_fd_benchmark
git checkout port-regression-to-nvc
export CTU_TB_TOP_TARGET="tb_ctu_can_fd_rtl_vunit" 
cd test
VUNIT_SIMULATOR=ghdl ./run.py tb_rtl_test_fast_asic -p`nproc`

To run with NVC, just change the value of the VUNIT_SIMULATOR variable.

On my PC nproc has 20 cores. You can see that the individual run-times of each simulation are much longer in NVC than in GHDL.

I first thought this is due to Vunit reusing single elaboration for GHDL, and passing generics to elaborated binary in GHDL (GHDL docs claims it is supported).

When I use only single core at a time (no -p), I get better performance in NVC. The overall run-time in such case is terrible of course since all the simulations are executed one-by-one.

My results are:

ghdl.txt nvc_jit.txt nvc_no_jit.txt nvc_no_parallel_runs.txt

I use NVC from couple of days ago. My GHDL and VUnit installations are from autumn of last year, I hope this should not cause the issue.

Could this be caused by mutex that is held by NVC on compiled libraries during the elaboration ? So if multiple elaborations are done at the same time due to -p, actually only single elaboration can read the code compiled into libraries at a time ?

nickg commented 4 days ago

Can you trying setting the environment variable NVC_MAX_THREADS=1 with -p$(nproc)? NVC will create up to 8 background threads for JIT compilation which is probably too many here. Another combination to try might be NVC_MAX_THREADS=2 with -p$(($(nproc) / 2)).

Could this be caused by mutex that is held by NVC on compiled libraries during the elaboration ?

Yes that might be an issue too. NVC uses a reader/writer lock on the library which means writing to the library requires there are no concurrent readers (but a library can be read concurrently by multiple processes). I think VUnit passes --no-save when elaborating which should avoid this however. Anyway thanks for the instructions, I'll try it at the weekend.

Blebowski commented 4 days ago

Hi, setting the variable helps, the run-times are better. With JIT i now get to 15000 seconds of total run-time instead of 18000. Without JIT, I get to 14000 something.

nvc_max_threads_1.txt

nickg commented 1 day ago

The reference_data_set_* packages take a long time to compile. I think the difference with GHDL is that the GCC/LLVM backends compile these packages once whereas NVC is generating code for it each time it runs a simulation (GHDL mcode should have similar behaviour). I've made some improvements to speed this up, can you try with the latest master branch? I'll also do something to make NVC aware of the how many concurrent jobs VUnit is running to enable it to scale its worker threads accordingly.

nickg commented 1 day ago

Please also try with the VUnit changes in #1037.