VUnit / vunit

VUnit is a unit testing framework for VHDL/SystemVerilog
http://vunit.github.io/
Other
742 stars 263 forks source link

NVC performance with Vunit #1036

Closed Blebowski closed 4 months ago

Blebowski commented 4 months ago

Hi,

I am trying to port my project: https://github.com/Blebowski/CTU-CAN-FD

that I run with VUnit from GHDL into NVC. I managed to get everything working, and I can run the regression with the same results in GHDL and in NVC.

The issue I face, is that NVC run-time is almost double the GHDL run-time (when executed via VUnit, I have not tried to profile "raw" commands typed into the command line).

The analysis time of NVC is much smaller, but since analysis takes fraction of the overal regression, in total run-time GHDL wins.

I would be curious to find out why does this occur. AFAICT, NVC is faster simulator.

My design contains large constant arrays of large structs that presumably take long time to elaborate. When I run single test with -v, I see that VUnit always elaborates the design with NVC. Therefore, each test requires separate elaboration since NVC does not support run-time setting of generics.

Does VUnit elaborate each test in GHDL too ? Or does Vunit in some smart manner passes the generics such as test name, seed or other numeric parameters only to GHDL simulation ?

LarsAsplund commented 4 months ago

I would suggest that you try first without VUnit to see if the differences are related to VUnit, the design you have, or differences in the simulators.

I can't think of any obvious reason for why VUnit would have this impact on performance. There are no special tricks related to how we pass generics. In both cases, we use the -g option.

Elaboration can take some time. I've noticed with GHDL that some of its advantages in terms of startup performance go away if many contexts are included in a testbench.

Generally speaking, I've experienced significant performance differences between simulators but it varies a lot with the design. X can be twice as fast as Y on one design and then Y is twice as fast as X for another.

For many (fast) unit tests, the raw simulator performance is irrelevant. It is the startup time of the simulator that dominates.

It should also be noted that the simulator interface we have for NVC is a contribution from @nickg so I would assume that there are no major performance-killing issues in that code.

nickg commented 4 months ago

@Blebowski can you provide some instructions for running the testbench? I found a run.py in that repository but looks like it needs some additional arguments.

Blebowski commented 4 months ago

Hi, I will prepare a reproducer that compares these two.

Blebowski commented 4 months ago

Hi,

the steps to run the comparison:

git clone https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core.git ctu_can_fd_benchmark
cd ctu_can_fd_benchmark
git checkout port-regression-to-nvc
export CTU_TB_TOP_TARGET="tb_ctu_can_fd_rtl_vunit" 
cd test
VUNIT_SIMULATOR=ghdl ./run.py tb_rtl_test_fast_asic -p`nproc`

To run with NVC, just change the value of the VUNIT_SIMULATOR variable.

On my PC nproc has 20 cores. You can see that the individual run-times of each simulation are much longer in NVC than in GHDL.

I first thought this is due to Vunit reusing single elaboration for GHDL, and passing generics to elaborated binary in GHDL (GHDL docs claims it is supported).

When I use only single core at a time (no -p), I get better performance in NVC. The overall run-time in such case is terrible of course since all the simulations are executed one-by-one.

My results are:

ghdl.txt nvc_jit.txt nvc_no_jit.txt nvc_no_parallel_runs.txt

I use NVC from couple of days ago. My GHDL and VUnit installations are from autumn of last year, I hope this should not cause the issue.

Could this be caused by mutex that is held by NVC on compiled libraries during the elaboration ? So if multiple elaborations are done at the same time due to -p, actually only single elaboration can read the code compiled into libraries at a time ?

nickg commented 4 months ago

Can you trying setting the environment variable NVC_MAX_THREADS=1 with -p$(nproc)? NVC will create up to 8 background threads for JIT compilation which is probably too many here. Another combination to try might be NVC_MAX_THREADS=2 with -p$(($(nproc) / 2)).

Could this be caused by mutex that is held by NVC on compiled libraries during the elaboration ?

Yes that might be an issue too. NVC uses a reader/writer lock on the library which means writing to the library requires there are no concurrent readers (but a library can be read concurrently by multiple processes). I think VUnit passes --no-save when elaborating which should avoid this however. Anyway thanks for the instructions, I'll try it at the weekend.

Blebowski commented 4 months ago

Hi, setting the variable helps, the run-times are better. With JIT i now get to 15000 seconds of total run-time instead of 18000. Without JIT, I get to 14000 something.

nvc_max_threads_1.txt

nickg commented 4 months ago

The reference_data_set_* packages take a long time to compile. I think the difference with GHDL is that the GCC/LLVM backends compile these packages once whereas NVC is generating code for it each time it runs a simulation (GHDL mcode should have similar behaviour). I've made some improvements to speed this up, can you try with the latest master branch? I'll also do something to make NVC aware of the how many concurrent jobs VUnit is running to enable it to scale its worker threads accordingly.

nickg commented 4 months ago

Please also try with the VUnit changes in #1037.

Blebowski commented 4 months ago

Hi @nickg,

I will give it a try next weekend (currently vacationing) and let you know.

I confirm that the GHDL is using LLVM backend.

The reference data set packages make sense to compile slowly. These are huge constant arrays. GHDL analysis them way longer than NVC. It makes sense with what you are saying. If GHDL emits the code for these packages during the analysis, but NVC does so during elaboration, it is logical.

Blebowski commented 4 months ago

Hi,

@nickg , I have tested with latest NVC and VUnit commit you referenced. The results are much better: nvc_jit_after_fix.txt

Now the overall runtime of the regression with -p 20 is just one and half minutes longer than with GHDL. Clearly, this is caused by the longer elaborations and code being emmited for constants from reference_data_sets. Short tests take less in GHDL (e.g. device_id), while long tests (e.g. data_set_*) take less time in NVC, showing NVC simulation is indeed faster.

Also, it makes sense that GHDL emits code for reference_data_set_* only once, likely due to this, it takes longer time to analyse those packages in GHDL.

The reference_data_set_* packages contain only long arrays of records (some golden CAN frame data). I originally had this data in text files being read by the TB, but I converted them to packages to make TB bring-up simpler in other frameworks or simulators (no file paths need to be provided, no relative/absolute diffferences, etc...).

These long arrays of constants are only used in data_set_* tests, when a process gets triggered, one of these arrays is assigned to a variable, and iterated through and sent to DUT: reference_test_agent.vhd

Do you think it would be possible to emit the code for such long constant arrays only once the constant gets accessed with --jit ? Then the code for these would be generated only in the data_set_* tests. Sure, once the constants affect hierarchy, or width of a signal that always affects run-time model, it would not be possible. But in my TB this is not the case, it is plain copy from constant into a variable of the same type based on test name.

Blebowski commented 4 months ago

Either way, I will close this issue and thank you for your help.

This has actually unblocked me on porting my project also to NVC and finally trying to utilize the coverage feature :)

nickg commented 4 months ago

Have you tried using a smaller number of concurrent jobs like -p 10? In my testing with current NVC/VUnit master it's faster than ghdl-gcc on all -p values up to 16 on a machine with 24 logical cores. I can't test more because there's not enough memory. Have you checked it isn't swapping with -p 20? One unfortunate issue with the LLVM JIT is that it uses a lot of memory: I saw each process was using about 1.5 GB.

Blebowski commented 4 months ago

Hi,

my CPU is 12th Gen intel with Big+Little cores (6 Power + 8 efficient). Power cores have hyperthread, so the rest 6 logical cores are from there. I have 64 GB of RAM, and I even with -p20 I have about 20 GBs left.

I re-ran with various -p values. The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

As the number of cores grows, the total time spent by the simulation also grows. The differences are somewhat flaky though, so at least 5 iterations of each would be good. If I have some more time I will try to write a script to profile it and spit out some charts.

nvc_jit_after_fix_p10.txt nvc_jit_after_fix_p12.txt nvc_jit_after_fix_p14.txt nvc_jit_after_fix_p16.txt nvc_jit_after_fix_p18.txt nvc_jit_after_fix_p20.txt

Blebowski commented 4 months ago

Runtime of a test likely depends on core kind used to execute it. So, device_id running for around 4 seconds with -p10 and -p12, and 12-14 seconds on higher -p values can be caused by that. Beyond that, cache can have influence, but I don't know how to measure it.

Either way, with -p14 it is the best in my case and comparable to GHDL. Actually better, because GHDL takes about 10 seconds to analyze each reference_data_set_*.vhd, so the overall regression run-time is higher.

nickg commented 4 months ago

Are you using a build with --enable-debug? There's at least a 2x slow-down for elaboration with that on due to extra checking.

The shortest "Elapsed time" is with 14 cores. Could that be explained by only 14 physical cores ? My guess would be yes.

Each pair of hyperthreads shares an L1/L2 cache so if one thread is running it can use all of the cache whereas if both is running it's only guaranteed half of it. So you should see a higher rate of cache misses when both are active and VHDL simulations tend to be quite memory bound (i.e. they don't tend to stress the compute resources of the CPU).

Do you think it would be possible to emit the code for such long constant arrays only once the constant gets accessed with --jit ?

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

Blebowski commented 4 months ago

No, I configure without --enable-debug.

At some point I want to implement a cache for the JIT so that it can re-use machine code if the source code hasn't changed. But it's quite complex to get right so I probably won't do it soon.

I am looking forward.