bespoke-silicon-group / bsg_replicant

BSG Replicant: Cosimulation and Emulation Infrastructure for HammerBlade
BSD 3-Clause "New" or "Revised" License
26 stars 20 forks source link

Updating to VCS 2020 causes simulation to fail #638

Closed drichmond closed 3 years ago

drichmond commented 4 years ago
        Offending '(((reset_r !== 1'b0) | (~link_sif_i_cast.rev.v)) | ({return_packet.y_cord, return_packet.x_cord} == {my_y_i, my_x_i}))'
Error: "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/v/bsg_manycore_endpoint_standard.v", 373: manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd.unnamed$$_0: at time 1000000 ps
## errant credit packet v=x for YX=x,x landed at YX=1,0 (manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd)
$finish called from file "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/v/bsg_manycore_endpoint_standard.v", line 380.
$finish at simulation time              1000000

Steps to reproduce:

  1. Checkout 2020_bug branch in this repository (basejump and manycore are master)
  2. Change VCS_HOME to Q-2020.03-SP1 (now installed on kk9)
  3. Run make test_rom.log

(I have a VPD file but git wont let it be attached)

tommydcjung commented 4 years ago

Does this bug show up with spmd tests as well?

drichmond commented 4 years ago

Unclear -

If I run make in spmd/fib I get this error during compilation:

Error-[MTL] Memory Too Large
/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/testbenches/common/v/bsg_nonsynth_manycore_axi_mem.v, 50
"ram"
  The size of memory is too large to be handled. The maximum is 2GB - 1.
tommydcjung commented 4 years ago

Any particular reason why we need to switch to vcs 2020?

drichmond commented 4 years ago

It's the new hotness.

drichmond commented 4 years ago

2020 is the version of VCS provided on a different system where I am trying to simulate HammerBlade.

If the fix is too onerous then we should discuss it, but if this will appear in the future we should know why.

dpetrisko commented 4 years ago

I'd check if flags have changed? Either way, let me know what the problem/fix ends up being, I'm confident we'll encounter the same thing with BP

drichmond commented 4 years ago

OK, I was able to reproduce an issue in spmd.

Using the 2020_bug branch in bsg_manycore (https://github.com/bespoke-silicon-group/bsg_manycore/tree/2020_bug)

inside of spmd/fib, I switched BSG_MACHINE_PATH to 4x4_fast_fake to get around the MTL error. (This is code change is in the branch)

Running make in spmd/fib produces this error:

"/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", 96: spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor.unnamed$$_0: started at 3000ps failed at 3000ps
        Offending '((reset_i !== 1'b0) | (~((v_i[N] & (~x_eq[N])) | (v_i[S] & (~x_eq[S])))))'
Error: "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", 96: spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor.unnamed$$_0: at time 3000 ps
spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor:Y dim to X dim routing. XY_order_p = 00000000000000000000000000000001
$finish called from file "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", line 104.

Different error, but perhaps related?

ptpan commented 3 years ago

We have upgraded to VCS 2020.12 since our license of the old VCS expired. I'm seeing the same error message from examples/cuda/test_vec_add:

    Offending '(((reset_r !== 1'b0) | (~link_sif_i_cast.rev.v)) | ({return_packet.y_cord, return_packet.x_cord} == {my_y_i, my_x_i}))'
Error: "/work/global/pp482/hammerblade/bsg_bladerunner/bsg_manycore/v/bsg_manycore_endpoint_standard.v", 373: manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd.unnamed$$_0: at time 1000000 ps
## errant credit packet v=x for YX= x, x landed at YX= 1, 0 (manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd)

Any idea why this happened?

drichmond commented 3 years ago

No idea why it's happening (I wasn't able to diagnose, yet), but uncommenting the assertion works as a quick workaround.

drichmond commented 3 years ago

Current theory is a race condition and/or glitching.

@ptpan can you try switching this to assert final instead of assert? (@tommydcjung, @taylor-bsg is this the right application of final, like you suggested)

I will try locally as well

ptpan commented 3 years ago

Current theory is a race condition and/or glitching.

@ptpan can you try switching this to assert final instead of assert? (@tommydcjung, @taylor-bsg is this the right application of final, like you suggested)

I will try locally as well

I tried assert final on that line but I got the same error message. I looked at the waveform and it seems like the reset generator doesn't produce a HIGH async_reset_o from t = 0?

https://github.com/bespoke-silicon-group/basejump_stl/blob/9549455072fb849381da7e0a3157a54d18ef2654/bsg_test/bsg_nonsynth_reset_gen.v#L16

^ I was able to get one test program (make test_vec_add) working by passing in 1 to the reset_cycles_lo_p parameter. Not sure if that's the ultimate solution though...

drichmond commented 3 years ago

Does setting reset_cycles_lo_p to 1 make async_reset_o = 1 at t = 0?

ptpan commented 3 years ago

Does setting reset_cycles_lo_p to 1 make async_reset_o = 1 at t = 0?

Yes it does that with the new VCS... Maybe this is because the new VCS somehow triggers this negedge block at time zero?

https://github.com/bespoke-silicon-group/basejump_stl/blob/9549455072fb849381da7e0a3157a54d18ef2654/bsg_test/bsg_nonsynth_reset_gen.v#L40

ptpan commented 3 years ago

After some discussion with @drichmond we found that inside the async reset generator, the new VCS yields a different value for signal ctr_lo_r as shown in the following screen shot:

VCS 2020.12 waveform

This waveform confirms that ctr_lo_r has a value of 1 with the new VCS, whereas it is expected to have a value of 0 to correctly generate a reset signal. This can be explained if there is a negative edge of clk_i[i] during simulation time zero, which triggers the negedge always block in the reset generator. According to this article, some early version of VCS did treat 2-state variable initialization as a transition to value zero (essentially a negative edge):

VCS version 2006.06 does not fully adhere to the IEEE System Verilog
standard for 2-state types.  The standard says that 2-state types should
begin simulation with a value of zero.  In VCS, 2-state variable types
automatically transition to zero sometime during simulation time zero.  This
will cause a negative edge transition at simulation time zero, which might,
depending on event ordering, trigger the design procedural blocks that are
looking for a negative edge transition.

So the current hypothesis is that VCS 2020 somehow has the same behavior as earlier VCSs -- registering (2-state) variable initializations as pos/neg edges.

taylor-bsg commented 3 years ago

Can you post the screenshot of those same value in the older working version of VCS?

On Thu, Jan 7, 2021 at 3:03 PM Peitian Pan notifications@github.com wrote:

After some discussion with @drichmond https://github.com/drichmond we found that inside the async reset generator https://github.com/bespoke-silicon-group/basejump_stl/blob/9549455072fb849381da7e0a3157a54d18ef2654/bsg_test/bsg_nonsynth_reset_gen.v#L15, the new VCS yields a different value for signal ctr_lo_r as shown in the following screen shot:

[image: VCS 2020.12 waveform] https://user-images.githubusercontent.com/31896789/103952176-f2829180-510d-11eb-8365-87647277e091.png

This waveform confirms that ctr_lo_r has a value of 1 with the new VCS, whereas it is expected to have a value of 0 to correctly generate a reset signal. This can be explained if there is a negative edge of clk_i[i] during simulation time zero, which triggers the negedge always block in the reset generator. According to this http://www.deepchip.com/items/0466-03.html article, some early version of VCS did treat 2-state variable initialization as a transition to value zero (essentially a negative edge):

VCS version 2006.06 does not fully adhere to the IEEE System Verilog standard for 2-state types. The standard says that 2-state types should begin simulation with a value of zero. In VCS, 2-state variable types automatically transition to zero sometime during simulation time zero. This will cause a negative edge transition at simulation time zero, which might, depending on event ordering, trigger the design procedural blocks that are looking for a negative edge transition.

So the current hypothesis is that VCS 2020 somehow has the same behavior as earlier VCSs -- registering (2-state) variable initializations as pos/neg edges.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_replicant/issues/638#issuecomment-756439421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5AE3WXDP3W6VNCF2UNTSYY4TVANCNFSM4SPTC74A .

dpetrisko commented 3 years ago

Two tangential points:

1) there is a verilator option to deal with this, I wonder if VCS has a similar option to emulate other VCS versions.

--x-initial-edge
Enables emulation of event driven simulators which generally trigger an edge on a transition from X to 1 (posedge) or X to 0 (negedge). Thus the following code, where rst_n is uninitialized would set res_n to 1'b1 when rst_n is first set to zero:

    reg  res_n = 1'b0;

    always @(negedge rst_n) begin
       if (rst_n == 1'b0) begin
          res_n <= 1'b1;
       end
    end
In Verilator, by default, uninitialized clocks are given a value of zero, so the above always block would not trigger.

While it is not good practice, there are some designs that rely on X → 0 triggering a negedge, particularly in reset sequences. Using --x-initial-edge with Verilator will replicate this behavior. It will also ensure that X → 1 triggers a posedge.

Note. Some users have reported that using this option can affect convergence, and that it may be necessary to use --converge-limit to increase the number of convergence iterations. This may be another indication of problems with the modeled design that should be addressed.
  1. This is a specialized case of assertion problems during reset. Often, what you really want to do is to disable assertions before reset has lowered. We do this in BP here: https://github.com/black-parrot/black-parrot/blob/master/bp_top/test/tb/bp_tethered/test_bp.sv
initial
  begin
    $assertoff();
    @(posedge clk);
    @(negedge reset);
    $asserton();
  end

Of course, for something like ASIC bringup, you may want those assertions, but most tools let you selectively enable certain assertions (or just use logging messages in those specialized modules)

ptpan commented 3 years ago

Can you post the screenshot of those same value in the older working version of VCS?

Here is the waveform of the same signals with the older but working VCS

waveform-vcs2018

ctr_lo_r has value 0 here. The difference between ctr_lo_r[0:0][0:0] and ctr_lo_r[0:0][-1:0] doesn't seem to be related to the issue.

taylor-bsg commented 3 years ago

Fixed by https://github.com/bespoke-silicon-group/basejump_stl/pull/378