StanfordAHA / lassen

The PE for the second generation CGRA (garnet).
15 stars 3 forks source link

Switch to CW floats #178

Open Kuree opened 3 years ago

Kuree commented 3 years ago

Due to the PD environment change, we no longer support DW float IP. As a result, all the tests need to migrate to CW IPs.

Kuree commented 3 years ago

@jack-melchert Running regressions with CW IP. Found couple errors with current lassen. There are several failed tests. My guess is they are all related to the same bug.

p = op(inst=Inst(alu=ALU_t.FP_sub, signed=Signed_t.unsigned, lut=0, cond=Cond_t.Z, rega=Mode_t.BYPASS, data0=0, regb=Mode_... rege=Mode_t.BYPASS, bit1=Bit(False), regf=Mode_t.BYPASS, bit2=Bit(False)), func=<function <lambda> at 0x7f7af442fcb0>)
args = (<class 'hwtypes.fp_vector.FPVector[8,7,RoundingMode.RNE,False]'>(-0.0), <class 'hwtypes.fp_vector.FPVector[8,7,RoundingMode.RNE,False]'>(0.0))

    @pytest.mark.parametrize("op", [
        op(asm.fp_add(), lambda x, y: x + y),
        op(asm.fp_sub(), lambda x, y: x - y),
        op(asm.fp_mul(), lambda x, y: x * y)
    ])
    @pytest.mark.parametrize("args",
        [(BFloat16.random(), BFloat16.random()) for _ in range(NTESTS)] +
        list(product(fp_sign_vec+fp_zero_vec, fp_sign_vec+fp_zero_vec))
    )
    def test_fp_binary_op(op, args):
        inst = op.inst
        in0 = args[0]
        in1 = args[1]
        out = op.func(in0, in1)
        data0 = BFloat16.reinterpret_as_bv(in0)
        data1 = BFloat16.reinterpret_as_bv(in1)
        res, res_p, _ = pe(inst, data0, data1)
        assert res == BFloat16.reinterpret_as_bv(out)
        if CAD_ENV:
>           rtl_tester(op, data0, data1, res=res)

tests/test_pe.py:209: 

Here is an easy way to reproduce:

def test_fp_mul():
    # Regression test for https://github.com/StanfordAHA/lassen/issues/111
    inst = asm.fp_add()
    data0 = Data(0x8000)
    data1 = Data(0x8000)
    res, res_p, _ = pe(inst, data0, data1)
    if CAD_ENV:
        rtl_tester(inst, data0, data1, res=res)
    else:
        pytest.skip("Skipping since DW not available")
jack-melchert commented 3 years ago

Thanks for letting me know, I'll take a look.

Kuree commented 3 years ago

Here is the diff if you want to switch to CW IP:

diff --git a/tests/rtl_utils.py b/tests/rtl_utils.py
index bdf3a24..9629148 100644
--- a/tests/rtl_utils.py
+++ b/tests/rtl_utils.py
@@ -50,10 +50,10 @@ test_dir = "tests/build"
 # the coreir context causing a "redefinition of module" error
 magma.backend.coreir_.CoreIRContextSingleton().reset_instance()
 magma.compile(f"{test_dir}/WrappedPE", pe_circuit, output="coreir-verilog",
-              coreir_libs={"float_DW"})
+              coreir_libs={"float_CW"})

 # check if we need to use ncsim + cw IP
-cw_dir = "/cad/synopsys/dc_shell/J-2014.09-SP3/dw/sim_ver/"   # noqa
+cw_dir = "/cad/cadence/GENUS_19.10.000_lnx86/share/synth/lib/chipware/sim/verilog/CW/"   # noqa
 CAD_ENV = shutil.which("ncsim") and os.path.isdir(cw_dir)

@@ -109,13 +109,14 @@ def rtl_tester(test_op, data0=None, data1=None, bit0=None, bit1=None, bit2=None,
         tester.circuit.O1.expect(res_p)
     if CAD_ENV:
         # use ncsim
-        libs = ["DW_fp_mult.v", "DW_fp_add.v", "DW_fp_addsub.v"]
+        libs = ["CW_fp_mult.v", "CW_fp_add.v"]
         for filename in libs:
             copy_file(os.path.join(cw_dir, filename),
                       os.path.join(test_dir, filename))
         tester.compile_and_run(target="system-verilog", simulator="ncsim",
                                directory="tests/build/",
                                include_verilog_libraries=libs,
+                               flags=["-sv"],
                                skip_compile=True)
     else:
         libs = ["DW_fp_mult.v", "DW_fp_add.v"]
jack-melchert commented 3 years ago

https://github.com/rdaly525/coreir/issues/972

CoreIR has the ieee_complicance bit set to false when instantiating the CW and DW fp units. Changing this to 1 fixes this lassen issue. I'm not sure why it was set to 0 in the first place, but hopefully that will get resolved soon.

rdaly525 commented 3 years ago

Email message from Alex back in June 2019 about this issue:

"Ran synthesis with ieee_compliance on… it increases pe tile area by about 10% , and it also affects our critical path for the PE. It did not meet timing at 454 MHz (2.2ns period), whereas the core without ieee_compliance did.

Alex"

I suspect the reason we wanted to disable denormal numbers was due to the increase in area and longer critical paths.

rdaly525 commented 3 years ago

The PD team should decide if this area/critical path increase is possible before we switch it. Otherwise we should disable any tests that contain denormalized numbers.

Kuree commented 3 years ago

I don't think the test case here is related to denorm issue. 0x8000 is negative 0, a behavior specified by IEEE standard, as seen here: https://en.wikipedia.org/wiki/Signed_zero

The test case basically means if we have -0 + -0, we will get x.

Kuree commented 3 years ago

Based on CW fp_add documentation, when ieee_compliance=0, this is expected behavior:

A sA.expA.sigA        B sB.expB.sigB      Infinitely Precision Result (F)
± Zero                      ± Zero                     ± Zero

The documentation also states that -0 is not a denormal number.


Sign (s)    Exponent (exp)  Significand (sig)   Value               Short Name

s               0                0                   (-1)s0                  ±Zero

s               0                    0 < sig             (-1)s 21-bias0.sig Denormal
rdaly525 commented 3 years ago

I see. Are the X's originating from the CW_fp_add module itself? Or are we getting Xs as a result of some logic in lassen

Kuree commented 3 years ago

I did cross-examination on the documentation and waveforms, and I'm leaning towards stating that this is a bug in the CW_fp_add simulation IP.

For those of you who have access to Cadence support, this is the link to the documentation.

Here is the signals coming in/out from the IP: image

The status flag is set properly: 01 means zero. However, the z is set to x.

rdaly525 commented 3 years ago

I do not have access to the documentation, but I wonder if this is specified in the documentation. It might say that if the status indicates 0, then the z output could be anything (although that is a very stupid design decision)

Kuree commented 3 years ago

I don't think it's specified anywhere in the documentation. 0x0000 + 0x000 gives correct z and status.

Here is my take on how to fix this:

  1. turn on ieee_compliance. this is ideal if timing/area allowed
  2. extra logic to gate output to zero if status indicates zero.
rdaly525 commented 3 years ago

I suspect 2) would be a cheaper (area/timing) solution. I can stamp out a mux (with +0) in the CoreIR implementation. Tests still might break if they are looking for a -0 though.

Kuree commented 3 years ago

I asked Kathleen to synthesize PE with ieee_compliance on. I will report back the result once it's finished.

Did more testing and it seems like 0 + 0 returns x as well. -0 + 0, however, returns the correct result.

Kuree commented 3 years ago

Targeting at 1.1ns clock speed, here is the timing/area comparison. Notice that both CW_fp_add and CW_fp_mult are modified.

TL;DR; timing is similar; area is less than 2% increase.

ieee_compliance=1:

ieee_compliance=0:

Kuree commented 3 years ago

@rdaly525 I got green light to turn on ieee_compliance. Would you please make necessary changes sometime today or tomorrow so that we can test more? Thanks.

rdaly525 commented 3 years ago

https://github.com/StanfordAHA/lassen/pull/179 and https://github.com/rdaly525/coreir/pull/973 have the changes