How to apply the "3-step flow" using Questa

jfrensch commented 1 year ago

Hi all,

my understanding is, that the VUnit framework (always?) uses the "2-step flow" (-> vcom & vsim) of Questa, where the 'vopt' step is automatically applied during 'vsim'. Unfortunaly, I need some intermediate result of the vopt step to be able to use the 'Visuallizer' to analyze the results of a simulation:

  "vopt -debug -designfile design.bin -o tb_opt tb"

not only provides an optimized design 'tb_opt' of the original testbench 'tb' for faster simulation, but also a database (-> design.bin) required by the 'Visualizer' to correlate the simulation results from 'tb_opt' to the design being simulated.

How can/should I apply that 3rd step (vopt) within the VUnit framework?

Many thanks for advice Jochen

tasgomes commented 1 month ago

@SzymonHitachi Currently, VUnit does not distinguish Modelsim from Questa. To use Questa you need to define modelsim as simulator and then provide the path to the Questa installation folder. For instance:

environ["VUNIT_SIMULATOR"] = "modelsim"
environ["VUNIT_MODELSIM_PATH"] = "C:/intelFPGA_pro/21.2/questa_fe/win64"

outdoorsweden commented 1 month ago

Mikael Andersson, Siemens EDA here! I have not read the whole thread. I've just browsed through it. I'd recommend to use the three step flow (compile, optimize, simulate).

Here you need to make some choices when it comes to optimizations. The alternatives are:

NOTE! I've tried the first alternative and that is not working. So it is the second alternative that is the best option.

Optimize each test. Pro: Best performance on each simulation. Drawback: You might get issues with locks when multiple vopt are started at the same time. But this can be handled by using the undocumented switch "-nolock" to vopt and by making sure that the output from vopt has a unique name. So for each test it would run:

>vopt tb -nolock -g <test generics> -o tb_<testname>_opt -debug -designfile design_<test>.bin
>vsim tb_<testname>_opt -qwave=+signal+msg+assertion=pass ....

Create one common optimized version. Pro: Just one optimization. Drawback: Could affect performance significantly if you apply floatgenerics to aggressive. Also you will need to use a hybrid elaboration flow for Visualizer (Not hard).

>vcom .... 
>vopt tb -floatgenerics runner_cfg -g <default test generics> -o tb_dbg_opt -debug -designfile design.bin

And then foreach test:

>vsim tb_dbg_opt -g <test generics> -qwave=+signal+msg+assertion=pass ....

You can also generate different optimizations with vopt. Like in the second case, you could generate one for performance in regression only and one for debug:

>vcom .... 
>vopt tb -floatgenerics runner_cfg -g <default test generics> -o tb_dbg_opt -debug -designfile design.bin
>vopt tb -floatgenerics runner_cfg -g <default test generics> -o tb_opt

And if you want best performance, you use the tb_opt version without logging: foreach test:

>vsim tb_opt -g <test generics> ....

Hope this is helpful! BR Mikael Siemens EDA Siemens Digital Industries Software

Evenemangsgatan 21 SE-169 03 Solna, Sweden Mobile: +46 70 932 9516 andersson.mikael@siemens.com www.sw.siemens.com

outdoorsweden commented 1 month ago

Mikael Andersson, Siemens EDA here again! I have now used our tools to show the effect of using qrun + the flow "Create one common optimized version" described above.

I have used the axi_dma example. The only change is that the "Random AXI configuration" test has been extended to run a bit longer. And I choose to measure the effect on a "clean" start. The way you would do in a Continuous Integration environment.

This picture is a screenshot from Questa Run Manager where I defined a flow that is suitable for regression of Vunit based testbenches: Notice that I compile and optimize for each testbench. Since the compile step is using qrun, the compile step is so fast that it really does not matter that we do three parallel compile. The fact that we use a library location for each testbench also allowes optimization in parallel without any lock files. The actual compile and optimize command I used was this:

qrun -64 -f .../qrun.files -work axi_dma_lib -optimize -top tb_axi_dma  -snapshot tb_axi_dma_opt 
-vopt.options  -floatgenerics+runner_cfg  +cover=bcesf+axi_dma. -end -outdir .../VRMDATA/my_run/testbench~tb_axi_dma/compile_and_optimize/qrun.out

or from the Makefile:

And in simulations I used:

qrun -64 -work axi_dma_lib -simulate -snapshot tb_axi_dma_opt -vsim.options -f .../tests/Perform_split_transfers.vunit.args 
-coverage -end -onfinish stop -do "coverage save Perform_split_transfers.ucdb -onexit;run -all;exit -f" -t ps -error vsim-3040

or from the Makefile:

So what about performance comparison with current Vunit flow and the flow above? The blue is first run, the orange one is the second run.

The main reason that multiple cpu's does not make a bigger difference for Vunit is that each simulation does optimization and creates a lock file which next simulation needs to wait until it is removed.

A comparison on only the compile time qrun vs Vunit:

Hope this is helpful! BR Mikael Siemens EDA Siemens Digital Industries Software

Evenemangsgatan 21 SE-169 03 Solna, Sweden Mobile: +46 70 932 9516 andersson.mikael@siemens.com www.sw.siemens.com

LarsAsplund commented 1 month ago

Hi @outdoorsweden,

The main reason that multiple cpu's does not make a bigger difference for VUnit is that each simulation does optimization and creates a lock file which next simulation needs to wait until it is removed.

VUnit currently runs vopt as part of the internal "simulation" step as it is a simpler first modification of the current VUnit structure. It is still reusing previous vopt runs though. If we have a testbench running 5 times (with different generics), there will only be one vopt run. That is visible in the debug logs of @tasgomes tests.

Currently, I wait for the lock file to be removed before releasing the thread such that a new simulation can begin. That can be improved by checking for lock files before beginning a simulation instead. If the next simulation is towards another library, there will be no wait time at all. Is that what you mean with

The fact that we use a library location for each testbench also allowes optimization in parallel without any lock files.

Or are you actively using the -nolock feature?

Regarding the difference between the 1 CPU run and the 5 CPU run. If I interpret your measurements correctly, there is no difference between qrun and VUnit for the simulation runs. In both cases, the orange bars becomes 24 time units faster in the 5 CPU case. In the best of worlds, the 5 CPU run would be 5x faster but in short tests like these, the simulator startup time becomes dominant.

I think you've confirmed that simultaneous vopts on the same library isn't possible but how deep does an optimization go? If testbench A and B are optimized towards different libraries but they both use module C, will they both try to optimize C? Or is vopt limited the the top-level and whatever C design already existing (optimized or not optimized) is the one being used?

outdoorsweden commented 1 month ago

VUnit currently runs vopt as part of the internal "simulation" step as it is a simpler first modification of the current VUnit structure. It is still reusing previous vopt runs though. If we have a testbench running 5 times (with different generics), there will only be one vopt run. That is visible in the debug logs of @tasgomes tests.

This does indeed look like a race condition. I have never seen anything like it. How do you check that vopt has finished before you start vsim? Because what you normally would do is:

compile
optimize
Launch simulations in parallel when optimize has finished If you can recreate the problem, file a SR on https://support.sw.siemens.com/en-US/ and provide the test case.

Or are you actively using the -nolock feature? No , it is undocumented and does not work the way I expected it.

The fact that we use a library location for each testbench also allowes optimization in parallel without any lock files.

So this is the directory structure that I get with Questa Run Manager (I have filtered away some stuff):

So each testbench have a qrun.out directory that contains all the libraries. And this is why I can run all the optimizations in parallel.

I think you've confirmed that simultaneous vopts on the same library isn't possible but how deep does an optimization go? If testbench A and B are optimized towards different libraries but they both use module C, will they both try to optimize C? Or is vopt limited the the top-level and whatever C design already existing (optimized or not optimized) is the one being used?

No, so all the machine code generated ends up in the optimized version. So if you want to optimize to different libraries, this might work (have not tested if it generates a lock file in the design lib or not:

I will update my Questa vrun application so that it tests this concept.

BR Mikael

LarsAsplund commented 1 month ago

@outdoorsweden Just to be clear. There were race conditions that we fixed and none of us experience any problems at this point. However, since we weren't sure about the inner workings it was hard to be fully confident that it would work for everyone.

Initially I used OS synchronization mechanisms to make sure that a vopt operation performed by one thread returns before another thread starts doing a new vopt in the same lib. A mistake in that code was the cause of one race condition.

That approach was not enough since the lock file may remain on the file system after the vopt call has returned. Probably due to delays in the file system. A second vopt to the same lib may see that lock file before it is deleted and then it fails. That was the cause of the other race condition we've seen. It was fixed by also checking for the presence of the lock file and wait for it to disappear before letting the next thread run.

I don't think checking the lock file alone is enough. I've observed that vopt can create and delete a lock file several times during its execution. For that reason, VUnit will use an OS lock to prevent other threads from starting a vopt on a lib already used by another vopt. That OS lock is removed when the first vopt call returns and there are no remaining lock files.

This means that we do follow your suggestion of 1. compile, 2. optimize, and 3 simulate for each testbench. However, step 2 and 3 are performed in several concurrent threads. No constraint are placed on what can be simulated in parallel. The only constraint now is that no concurrent vopt is allowed on the same lib.

We discussed compiling all libraries into multiple directories or simply make multiple copies of the original set of compiled libs. However, that doesn't feel like a nice solution considering we can have hundreds of testbenches that need their own library copy. I much rather give up the idea of concurrent vopts and have a single library set.

Optimizing each design to a separate lib sounds more interesting as it doesn't involve multiple copies. I will try that and see what lock files are being generated.

Final question about vopt. I feel I don't quite understand the basics. What is being optimized? From what I understand we only call vopt on the testbench (test_counter in your example) but never on the design being tested (the counter)

LarsAsplund commented 1 month ago

@outdoorsweden I made a quick test where the design is optimized into another directory. It looks like the lock files only appear in the new directory. I think it would work as a workaround if the current solution, which seems to work right now, proves to have some yet to be seen issues.

One issue that remains though is the issue with the slow test as discussed earlier in this thread:

Sometimes a random test case takes much longer time than it should but sometimes all tests run as expected. This problem is present with multiple threads even if I completely disable optimization. I looked briefly at it before and concluded that it is vsim that adds the extra time. Do you have any idea why that is? Currently all threads running vsim is doing that from the same working directory. I recall we have had discussions about that in the past. Is that a problem? Should all threads run in separate directories?

outdoorsweden commented 1 month ago

Final question about vopt. I feel I don't quite understand the basics. What is being optimized? From what I understand we only call vopt on the testbench (test_counter in your example) but never on the design being tested (the counter)

Vopt will optimize testbench and everything in it, including the design.

outdoorsweden commented 1 month ago

@LarsAsplund

Sometimes a random test case takes much longer time than it should but sometimes all tests run as expected. This problem is present with multiple threads even if I completely disable optimization. I looked briefly at it before and concluded that it is vsim that adds the extra time. Do you have any idea why that is? Currently all threads running vsim is doing that from the same working directory. I recall we have had discussions about that in the past. Is that a problem? Should all threads run in separate directories? Do you have the example so I can play with it?

LarsAsplund commented 1 month ago

@outdoorsweden This happens on the dummy testbench used in this thread https://github.com/VUnit/vunit/blob/three-step-flow/examples/vhdl/three_step_flow/tb_example.vhd which we run with 5 different settings for the value generic:

for value in range(5):
    test.add_config(name=f"{value}", generics=dict(value=value))

I don't think the test itself is significant so you could also test with the DMA example you had. Run all simulations in parallel in the same directory but remove optimization so that it doesn't play a role.

@tasgomes hasn't experience this so you may not experience anything either. That would suggest that there is another timing depending conflict over shared resources that has to be taken into account. modelsim.ini?

LarsAsplund commented 1 month ago

@outdoorsweden I forgot to "check the plug". Initially I got two one-month eval licenses from Innofour which was later "converted" to one-year licenses... I thought. Looks like my license file only contains a single license so the delay I see is simply that one vsim call gets stuck while waiting for a license. I will get in touch with Innofour and see if I can get this fixed.

LarsAsplund commented 3 weeks ago

I now got another license and the problems I saw disappeared as expected. I think our concept is good to go. I will review the work and take it from prototype to release quality. If nothing new shows up I will release it.

outdoorsweden commented 2 weeks ago

Sounds good! Will you include support for visualizer as well?

LarsAsplund commented 1 week ago

@outdoorsweden Eventually we should have the visualizer fully supported but I will release this feature first.

LarsAsplund commented 6 days ago

@outdoorsweden I tried to invoke Visualizer post-simulation. This works if I add the following vopt and vsim flags to the run script such that the input files needed by Visualizer are generated:

vu.set_sim_option("modelsim.vsim_flags", ["-qwavedb=+signal"])
vu.set_sim_option("modelsim.vopt_flags", ["-debug", "-designfile", "design.bin"])

When trying the live-simulation mode, I run into some problems. Our normal approach to GUI-based simulations is to call vsim like this:

vsim -gui -l path_to_transcript_file -do source path_to_a_do_file

The do file contains the actual vsim call with the design to simulate. One of the reasons for this approach is that the user has the option to define do/tcl files to be sourced before the simulation starts, i.e. such files are sourced before the vsim call in the do file.

If I try running Visualizer by replacing -gui with -visualizer, I get the following error:

An error occured while processing the vsim arguments, a design unit was not specified.

Looks like the -visualizer option needs the design to be specified on the same command line and not be "hidden" in a do file.

If I keep -gui and move -visualizer to the vsim call within the do file, there is no complaint but there is no Visualizer GUI popping up (only the VSIM GUI).

We use the same do file approach when running batch mode simulations (-gui is replaced by -c). I tested that with the same result. Visualizer doesn't start but the simulation completes as normal.

Any idea how this can be fixed?

LarsAsplund commented 5 days ago

I updated the example to emulate what embedded post-simulation Visualizer support would look like. If you run the run script with the --gui option every simulation will start the Visualizer after completion. Works without problems when running multiple threads in parallel

VUnit / vunit

How to apply the "3-step flow" using Questa #899