Closed drichmond closed 3 years ago
Does this bug show up with spmd tests as well?
Unclear -
If I run make
in spmd/fib I get this error during compilation:
Error-[MTL] Memory Too Large
/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/testbenches/common/v/bsg_nonsynth_manycore_axi_mem.v, 50
"ram"
The size of memory is too large to be handled. The maximum is 2GB - 1.
Any particular reason why we need to switch to vcs 2020?
It's the new hotness.
2020 is the version of VCS provided on a different system where I am trying to simulate HammerBlade.
If the fix is too onerous then we should discuss it, but if this will appear in the future we should know why.
I'd check if flags have changed? Either way, let me know what the problem/fix ends up being, I'm confident we'll encounter the same thing with BP
OK, I was able to reproduce an issue in spmd.
Using the 2020_bug branch in bsg_manycore (https://github.com/bespoke-silicon-group/bsg_manycore/tree/2020_bug)
inside of spmd/fib, I switched BSG_MACHINE_PATH to 4x4_fast_fake to get around the MTL error. (This is code change is in the branch)
Running make
in spmd/fib produces this error:
"/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", 96: spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor.unnamed$$_0: started at 3000ps failed at 3000ps
Offending '((reset_i !== 1'b0) | (~((v_i[N] & (~x_eq[N])) | (v_i[S] & (~x_eq[S])))))'
Error: "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", 96: spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor.unnamed$$_0: at time 3000 ps
spmd_testbench.nnet.DUT.y[1].x[0].tile.rtr.fwd.bmr.dor_decoder.XY_dor:Y dim to X dim routing. XY_order_p = 00000000000000000000000000000001
$finish called from file "/mnt/users/spin0/no_backup/drichmond/Research/repositories/git/bsg_bladerunner/bsg_manycore/../basejump_stl/bsg_noc/bsg_mesh_router.v", line 104.
Different error, but perhaps related?
We have upgraded to VCS 2020.12 since our license of the old VCS expired. I'm seeing the same error message from examples/cuda/test_vec_add
:
Offending '(((reset_r !== 1'b0) | (~link_sif_i_cast.rev.v)) | ({return_packet.y_cord, return_packet.x_cord} == {my_y_i, my_x_i}))'
Error: "/work/global/pp482/hammerblade/bsg_bladerunner/bsg_manycore/v/bsg_manycore_endpoint_standard.v", 373: manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd.unnamed$$_0: at time 1000000 ps
## errant credit packet v=x for YX= x, x landed at YX= 1, 0 (manycore_tb_top.mc_dpi.mc_ep_to_fifos.epsd)
Any idea why this happened?
No idea why it's happening (I wasn't able to diagnose, yet), but uncommenting the assertion works as a quick workaround.
Current theory is a race condition and/or glitching.
@ptpan can you try switching this to assert final
instead of assert
? (@tommydcjung, @taylor-bsg is this the right application of final, like you suggested)
I will try locally as well
Current theory is a race condition and/or glitching.
@ptpan can you try switching this to
assert final
instead ofassert
? (@tommydcjung, @taylor-bsg is this the right application of final, like you suggested)I will try locally as well
I tried assert final
on that line but I got the same error message. I looked at the waveform and it seems like the reset generator doesn't produce a HIGH async_reset_o
from t = 0?
^ I was able to get one test program (make test_vec_add
) working by passing in 1
to the reset_cycles_lo_p
parameter. Not sure if that's the ultimate solution though...
Does setting reset_cycles_lo_p
to 1
make async_reset_o = 1
at t = 0?
Does setting
reset_cycles_lo_p
to1
makeasync_reset_o = 1
at t = 0?
Yes it does that with the new VCS... Maybe this is because the new VCS somehow triggers this negedge block at time zero?
After some discussion with @drichmond we found that inside the async reset generator, the new VCS yields a different value for signal ctr_lo_r
as shown in the following screen shot:
This waveform confirms that ctr_lo_r
has a value of 1 with the new VCS, whereas it is expected to have a value of 0 to correctly generate a reset signal. This can be explained if there is a negative edge of clk_i[i]
during simulation time zero, which triggers the negedge always block in the reset generator. According to this article, some early version of VCS did treat 2-state variable initialization as a transition to value zero (essentially a negative edge):
VCS version 2006.06 does not fully adhere to the IEEE System Verilog
standard for 2-state types. The standard says that 2-state types should
begin simulation with a value of zero. In VCS, 2-state variable types
automatically transition to zero sometime during simulation time zero. This
will cause a negative edge transition at simulation time zero, which might,
depending on event ordering, trigger the design procedural blocks that are
looking for a negative edge transition.
So the current hypothesis is that VCS 2020 somehow has the same behavior as earlier VCSs -- registering (2-state) variable initializations as pos/neg edges.
Can you post the screenshot of those same value in the older working version of VCS?
On Thu, Jan 7, 2021 at 3:03 PM Peitian Pan notifications@github.com wrote:
After some discussion with @drichmond https://github.com/drichmond we found that inside the async reset generator https://github.com/bespoke-silicon-group/basejump_stl/blob/9549455072fb849381da7e0a3157a54d18ef2654/bsg_test/bsg_nonsynth_reset_gen.v#L15, the new VCS yields a different value for signal ctr_lo_r as shown in the following screen shot:
[image: VCS 2020.12 waveform] https://user-images.githubusercontent.com/31896789/103952176-f2829180-510d-11eb-8365-87647277e091.png
This waveform confirms that ctr_lo_r has a value of 1 with the new VCS, whereas it is expected to have a value of 0 to correctly generate a reset signal. This can be explained if there is a negative edge of clk_i[i] during simulation time zero, which triggers the negedge always block in the reset generator. According to this http://www.deepchip.com/items/0466-03.html article, some early version of VCS did treat 2-state variable initialization as a transition to value zero (essentially a negative edge):
VCS version 2006.06 does not fully adhere to the IEEE System Verilog standard for 2-state types. The standard says that 2-state types should begin simulation with a value of zero. In VCS, 2-state variable types automatically transition to zero sometime during simulation time zero. This will cause a negative edge transition at simulation time zero, which might, depending on event ordering, trigger the design procedural blocks that are looking for a negative edge transition.
So the current hypothesis is that VCS 2020 somehow has the same behavior as earlier VCSs -- registering (2-state) variable initializations as pos/neg edges.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bespoke-silicon-group/bsg_replicant/issues/638#issuecomment-756439421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFG5AE3WXDP3W6VNCF2UNTSYY4TVANCNFSM4SPTC74A .
Two tangential points:
1) there is a verilator option to deal with this, I wonder if VCS has a similar option to emulate other VCS versions.
--x-initial-edge
Enables emulation of event driven simulators which generally trigger an edge on a transition from X to 1 (posedge) or X to 0 (negedge). Thus the following code, where rst_n is uninitialized would set res_n to 1'b1 when rst_n is first set to zero:
reg res_n = 1'b0;
always @(negedge rst_n) begin
if (rst_n == 1'b0) begin
res_n <= 1'b1;
end
end
In Verilator, by default, uninitialized clocks are given a value of zero, so the above always block would not trigger.
While it is not good practice, there are some designs that rely on X → 0 triggering a negedge, particularly in reset sequences. Using --x-initial-edge with Verilator will replicate this behavior. It will also ensure that X → 1 triggers a posedge.
Note. Some users have reported that using this option can affect convergence, and that it may be necessary to use --converge-limit to increase the number of convergence iterations. This may be another indication of problems with the modeled design that should be addressed.
initial
begin
$assertoff();
@(posedge clk);
@(negedge reset);
$asserton();
end
Of course, for something like ASIC bringup, you may want those assertions, but most tools let you selectively enable certain assertions (or just use logging messages in those specialized modules)
Can you post the screenshot of those same value in the older working version of VCS? …
Here is the waveform of the same signals with the older but working VCS
ctr_lo_r
has value 0
here. The difference between ctr_lo_r[0:0][0:0]
and ctr_lo_r[0:0][-1:0]
doesn't seem to be related to the issue.
Steps to reproduce:
make test_rom.log
(I have a VPD file but git wont let it be attached)