regression works when running sequentially but fails when using multiple threads

vperrin59 commented 2 years ago

I'm using Vunit with verilog/systemverilog and xcelium I have a strange issue. When in my setup I use multiple thread with the option:

python run.py --num-threads 5

Some tests are randomly failing. After some investigation, it looks like the issue is coming from vunit_results report. In the failing tests the file is empty and that's why they parsed as skipped by the script.

Did somebody face similar issue ?

LarsAsplund commented 2 years ago

It sounds like a buffer that is not flushed properly. When a VUnit simulation ends properly the VHDL part of the tool writes test_suite_done to the result file and then closes it https://github.com/VUnit/vunit/blob/f344c8b5642d7aa13db2e16f6fc7151585ca96d0/vunit/vhdl/core/src/core_pkg.vhd#L43-L49 One would assume that the implementation of file_close also flushes any buffer but I don't think the VHDL standard is explicit about this. What happens if you add a new line line before line 48:

flush(test_results);

Does the error go away?I don't have access to Xcelium so I can't test it myself.

Could also be related to the operating system. Which one are you using?

vperrin59 commented 2 years ago

I forgot to mention, I added $fclose to the vunit_pkg.sv (I'm using SystemVerilog), hoping it would magically fix the issue but it did not.

  $fclose(trace_fd);
  exit_without_errors = 1;
  $finish(1);

I don't know of other ways to force flushing of the buffer in Verilog.

The operating system is CentOS

cmarqu commented 2 years ago

A way to get closer to finding out whether flushing is indeed the issue, you might try running xrun with the -unbuffered option (probably easiest by exporting XRUNOPTS). This will be slower, so not recommended for production use.

vperrin59 commented 2 years ago

I added $fflush(trace_fd) before the $fclose and it seems to have made it work. I'm running more iterations to confirm it's fixing the issue

vperrin59 commented 2 years ago

A way to get closer to finding out whether flushing is indeed the issue, you might try running xrun with the -unbuffered option (probably easiest by exporting XRUNOPTS). This will be slower, so not recommended for production use.

Thanks for the tip, useful to know

LarsAsplund commented 2 years ago

@vperrin59 I'm not really a SystemVerilog user but I think the code we have look a bit dangerous. There is no $fclose so it looks like it relies on implicit $fclose rules in SystemVerilog. Are there such rules? Even if that is implemented it doesn't seem to work since your explicit $fclose doesn't fix the problem. I would expect $fclose to also do a flush but if your fix works that is not the case,

vperrin59 commented 2 years ago

@LarsAsplund I agree with you. I would have thought fclose would fix the issue indeed, even though it looks like simulator automatically close the handler by finishing the sim. On my side, it seems only the $fflush fixes the issue

LarsAsplund commented 2 years ago

@vperrin59 I looked into the standard and I couldn't find any guidance. I think we should add both fflush and fclose. Want to make a PR?

vperrin59 commented 2 years ago

So I've run more iterations during the night, and the issue still appear (less frequent though it seems). I have to debug it further.

vperrin59 commented 2 years ago

Ok I made some progress, it looks like the issue coming from the fact that parameter runner_cfg get assigned to wrong value. It get assigned the value intended for another test. That is really strange because the content of xrun_simulate.args looks correct

LarsAsplund commented 2 years ago

Do you have a way to distinguish between randomly failing to write to file and randomly failing to complete the simulation for other reasons before VUnit writes to file. I've had experiences with other simulators failing to complete without any message whatsoever. You could try to write something to another file before completing. If flushing fails completely random you would not expect both writes to fail. At least not all the time.

vperrin59 commented 2 years ago

I still didn't find the root cause of the issue. This happens randomly. Not always the same test fail. Something I noticed though:

For example if I have the following test start order:

test1, test2, test3, test4,

If the issue appears in test3, test3 will get runner_cfg param value from test2

vperrin59 commented 2 years ago

I may look like a xcelium simulator issue when using gpg arg with multiple thread. I could not find any explanation. I found a work around by assigning runner_cfg with $value$plusargs function. Unfortunately I do not have another simulator to test. That would be interesting to know if someone with similar setup doesn't have any issue (xcelium, verilog, multiple threads)

vperrin59 commented 2 years ago

@LarsAsplund I finally root caused the issue (that was tough debug). It looks like there is a fundamental issue with VUnit incisive/xcelium flow when running multiple threads and overriding parameters. I could fix it locally. The issue is related to cds.lib file. The parallel sims were pointing to the same work library and that caused the issue

LarsAsplund commented 2 years ago

What is your workaround? I'm working on another feature which will require that parallel simulations will have to start from different directories. Have no experience with Xcelium and the cds.lib so it might not make any difference for your issue. @cmarqu , any thoughts on this?

vperrin59 commented 2 years ago

I'm actually suprised that this issue hasn't been reported before. Maybe not so much people use the same config as me. In order to fix the issue, that is the patch I used

diff --git a/vunit/sim_if/xcelium.py b/vunit/sim_if/xcelium.py
index 6196a5ee..c6ff6153 100644
--- a/vunit/sim_if/xcelium.py
+++ b/vunit/sim_if/xcelium.py
@@ -311,6 +311,12 @@ define work "{2}/libraries/work"
         else:
         steps = ["elaborate", "simulate"]

+        sim_cds_lib = os.path.join(script_path, "cds.lib")
+
+        cds = CDSFile.parse(self._cdslib)
+        # Change work path
+        cds["work"] = f"{script_path}/work"
+        cds.write(sim_cds_lib)
         for step in steps:
             cmd = str(Path(self._prefix) / "xrun")
             args = []
@@ -342,7 +348,7 @@ define work "{2}/libraries/work"
             #     '-xmlibdirname "%s"' % (str(Path(self._output_path) / "libraries"))
             # ]  # @TODO: ugly
             args += config.sim_options.get("xcelium.xrun_sim_flags", [])
-            args += ['-cdslib "%s"' % self._cdslib]
+            args += ['-cdslib "%s"' % sim_cds_lib]
             args += self._hdlvar_args()
             args += ['-log "%s"' % str(Path(script_path) / ("xrun_%s.log" % step))]
             if not self._log_level == "debug":

cmarqu commented 2 years ago

@vperrin59 What this is doing (unless I'm wrong) is compiling and elaborating the full snapshot with your design several times. While this works, it's more space and time efficient to do something like this in the elaborate step:

xrun -cdslib cds.lib mymodule.vhdl -elaborate -xmlibdirname ./snapshot

and this in the simulate step:

mkdir -p run1 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run1 -l run1.log ...
mkdir -p run2 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run2 -l run2.log ...
mkdir -p run3 && xrun -cdslib cds.lib -r mymodule -xmlibdirname ./snapshot -cds_implicit_tmpdir run3 -l run3.log ...

(Some of the options already exist in the vunit code.) There is a tutorial for this in the docs somewhere.

vperrin59 commented 2 years ago

@cmarqu I don't think you can use only one elab command. If you have one tb with 2bit param and you have 4 tests that set 4 different values for the param then you need to call elaborate 4 times. Do you agree ?

Moreover with that being said, it would be more flexible to have the runner_cfg not defined as param but as variable string that would be assigned with $value$plusargs. That way we don't need to call elaborate for each test

cmarqu commented 2 years ago

Indeed, the example I was taking this from is using plusargs to differentiate between parallel runs. But it might still work without plusargs - it is possible to stack elaborated snapshots (at least with what they call "Multi-snapshot incremental elaboration (MSIE)" and the related options), though I don't know offhand if this also works with -cds_implicit_tmpdir.

VUnit / vunit

regression works when running sequentially but fails when using multiple threads #788