CommonEvaluationPlatform / CEP

The Common Evaluation Platform (CEP), based on UCB's Chipyard Framework, is an SoC design that contains only license-unencumbered, freely available components.
BSD 3-Clause "New" or "Revised" License
19 stars 6 forks source link

COSIM: XCellium *occasional* failures on RHEL7 #7

Closed bchetwynd closed 1 year ago

bchetwynd commented 1 year ago

Running some bareMetal tests on xcellium on RHEL7 "occasionally" fail.

Initial thoughts on unintentional make parallelism was proven false by repeatedly doing one-off runs on regTest with an eventual error occurring.

The error seems to be that the baremetal executable does not properly get loaded into main memory. ...

bchetwynd commented 1 year ago

Some tests occasionally get hung-up in the CEP bootrom on the "wfi" instruction. The issue is that smp_pause/smp_resume does not work as expected. For some reason, core #0 gets "locked up" in the main function of sd.c and thus never re-enables the other cores.

Others seems to cause a trap to occur in the vprintfmt bare metal function, even when KPUTC is disabled.

bchetwynd commented 1 year ago

cpuId[X]..driver C_LOG doesn't correctly print the simulation time under XCellium (see log snippet below) ... 0 cep_tb.cpuId[2].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 2, i = 4 0 cep_tb.cpuId[1].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 1, i = 4 0 cep_tb.cpuId[2].driver C_LOG: access::RunClk mSlotId=0 mLocalId=2 numClk=1000 0 cep_tb.cpuId[1].driver C_LOG: access::RunClk mSlotId=0 mLocalId=1 numClk=1000 ...

bchetwynd commented 1 year ago

When comparing two runs on macro4Mix, one which fails and one that passes, all four CPU drivers are NOT processing the program loading complete at the same time.

Pass example:

       0 cep_tb.system_driver C_LOG: loadMemory: flushing cache line
       0 cep_tb.system_driver C_LOG: loadMemory: Setting program loaded flag
INFO:   384915 cep_tb.system_driver Program is now loaded
       0 cep_tb.cpuId[1].driver C_LOG: access::RunClk mSlotId=0 mLocalId=1 numClk=1000
       0 cep_tb.cpuId[2].driver C_LOG: access::RunClk mSlotId=0 mLocalId=2 numClk=1000
       0 cep_tb.cpuId[0].driver C_LOG: access::RunClk mSlotId=0 mLocalId=0 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: access::RunClk mSlotId=0 mLocalId=3 numClk=1000
       0 cep_tb.cpuId[1].driver C_LOG: OK: Program loading is completed
       0 cep_tb.cpuId[2].driver C_LOG: OK: Program loading is completed
       0 cep_tb.cpuId[0].driver C_LOG: OK: Program loading is completed
       0 cep_tb.cpuId[3].driver C_LOG: OK: Program loading is completed
INFO:   400835 cep_tb.cpuId[1].driver.release_tile_reset Releasing Tile #1 reset...
INFO:   400835 cep_tb.cpuId[2].driver.release_tile_reset Releasing Tile #2 reset...
       0 cep_tb.cpuId[1].driver C_LOG: check_bare_status: cpuId = 1, maxTimeOut = 500
       0 cep_tb.cpuId[2].driver C_LOG: check_bare_status: cpuId = 2, maxTimeOut = 500
INFO:   400845 cep_tb.cpuId[0].driver.release_tile_reset Releasing Tile #0 reset...
INFO:   400845 cep_tb.cpuId[3].driver.release_tile_reset Releasing Tile #3 reset...
       0 cep_tb.cpuId[0].driver C_LOG: check_bare_status: cpuId = 0, maxTimeOut = 500
       0 cep_tb.cpuId[3].driver C_LOG: check_bare_status: cpuId = 3, maxTimeOut = 500
       0 cep_tb.cpuId[1].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 1, i = 1
       0 cep_tb.cpuId[2].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 2, i = 1
       0 cep_tb.cpuId[0].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 0, i = 1
       0 cep_tb.cpuId[1].driver C_LOG: access::RunClk mSlotId=0 mLocalId=1 numClk=1000
       0 cep_tb.cpuId[2].driver C_LOG: access::RunClk mSlotId=0 mLocalId=2 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 3, i = 1
       0 cep_tb.cpuId[0].driver C_LOG: access::RunClk mSlotId=0 mLocalId=0 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: access::RunClk mSlotId=0 mLocalId=3 numClk=1000

Fail example:

    0 cep_tb.system_driver C_LOG: loadMemory: flushing cache line
       0 cep_tb.system_driver C_LOG: loadMemory: Setting program loaded flag
INFO:   388405 cep_tb.system_driver Program is now loaded
       0 cep_tb.cpuId[1].driver C_LOG: access::RunClk mSlotId=0 mLocalId=1 numClk=1000
       0 cep_tb.cpuId[0].driver C_LOG: access::RunClk mSlotId=0 mLocalId=0 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: access::RunClk mSlotId=0 mLocalId=3 numClk=1000
       0 cep_tb.cpuId[2].driver C_LOG: access::RunClk mSlotId=0 mLocalId=2 numClk=1000
       0 cep_tb.cpuId[1].driver C_LOG: OK: Program loading is completed
       0 cep_tb.cpuId[0].driver C_LOG: OK: Program loading is completed
       0 cep_tb.cpuId[3].driver C_LOG: OK: Program loading is completed
INFO:   400835 cep_tb.cpuId[1].driver.release_tile_reset Releasing Tile #1 reset...
       0 cep_tb.cpuId[1].driver C_LOG: check_bare_status: cpuId = 1, maxTimeOut = 500
INFO:   400845 cep_tb.cpuId[0].driver.release_tile_reset Releasing Tile #0 reset...
INFO:   400845 cep_tb.cpuId[3].driver.release_tile_reset Releasing Tile #3 reset...
       0 cep_tb.cpuId[0].driver C_LOG: check_bare_status: cpuId = 0, maxTimeOut = 500
       0 cep_tb.cpuId[3].driver C_LOG: check_bare_status: cpuId = 3, maxTimeOut = 500
       0 cep_tb.cpuId[1].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 1, i = 1
       0 cep_tb.cpuId[0].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 0, i = 1
       0 cep_tb.cpuId[1].driver C_LOG: access::RunClk mSlotId=0 mLocalId=1 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: check_bare_status: NOT DONE Status: cpuId = 3, i = 1
       0 cep_tb.cpuId[0].driver C_LOG: access::RunClk mSlotId=0 mLocalId=0 numClk=1000
       0 cep_tb.cpuId[3].driver C_LOG: access::RunClk mSlotId=0 mLocalId=3 numClk=1000
INFO:   401025 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 19 [1] pc=[0000000000010000] W[r2=0000000000080000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00080137] DASM(00080137)
INFO:   401035 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 20 [1] pc=[0000000000010004] W[r2=000000000008007f][1] R[r2=0000000000080000] R[r0=0000000000000000] inst=[07f1011b] DASM(07f1011b)
INFO:   401045 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 21 [1] pc=[0000000000010008] W[r2=000000008007f000][1] R[r2=000000000008007f] R[r0=0000000000000000] inst=[00c11113] DASM(00c11113)
INFO:   401055 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 22 [1] pc=[000000000001000c] W[r18=0000000000000008][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00800913] DASM(00800913)
INFO:   401065 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 23 [1] pc=[0000000000010010] W[r0=0000000000000000][1] R[r18=0000000000000008] R[r0=0000000000000000] inst=[30491073] DASM(30491073)
INFO:   401115 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 28 [1] pc=[0000000000010014] W[r9=0000000000000000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00000493] DASM(00000493)
INFO:   401115 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 27 [1] pc=[0000000000010000] W[r2=0000000000080000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00080137] DASM(00080137)
INFO:   401125 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 29 [1] pc=[0000000000010018] W[r18=0000000000000001][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[f1402973] DASM(f1402973)
INFO:   401125 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 28 [1] pc=[0000000000010004] W[r2=000000000008007f][1] R[r2=0000000000080000] R[r0=0000000000000000] inst=[07f1011b] DASM(07f1011b)
INFO:   401135 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 29 [1] pc=[0000000000010008] W[r2=000000008007f000][1] R[r2=000000000008007f] R[r0=0000000000000000] inst=[00c11113] DASM(00c11113)
INFO:   401145 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 30 [1] pc=[000000000001000c] W[r18=0000000000000008][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00800913] DASM(00800913)
INFO:   401155 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 32 [1] pc=[000000000001001c] W[r0=0000000000000000][0] R[r9=0000000000000000] R[r18=0000000000000001] inst=[03249263] DASM(03249263)
INFO:   401155 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 31 [1] pc=[0000000000010010] W[r0=0000000000000000][1] R[r18=0000000000000008] R[r0=0000000000000000] inst=[30491073] DASM(30491073)
INFO:   401195 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 35 [1] pc=[0000000000010000] W[r2=0000000000080000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00080137] DASM(00080137)
INFO:   401205 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 36 [1] pc=[0000000000010014] W[r9=0000000000000000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00000493] DASM(00000493)
INFO:   401205 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 36 [1] pc=[0000000000010004] W[r2=000000000008007f][1] R[r2=0000000000080000] R[r0=0000000000000000] inst=[07f1011b] DASM(07f1011b)
INFO:   401215 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 37 [1] pc=[0000000000010018] W[r18=0000000000000000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[f1402973] DASM(f1402973)
INFO:   401215 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 37 [1] pc=[0000000000010008] W[r2=000000008007f000][1] R[r2=000000000008007f] R[r0=0000000000000000] inst=[00c11113] DASM(00c11113)
INFO:   401225 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 38 [1] pc=[000000000001000c] W[r18=0000000000000008][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00800913] DASM(00800913)
INFO:   401235 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 39 [1] pc=[0000000000010010] W[r0=0000000000000000][1] R[r18=0000000000000008] R[r0=0000000000000000] inst=[30491073] DASM(30491073)
INFO:   401245 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 40 [1] pc=[000000000001001c] W[r0=0000000000000000][0] R[r9=0000000000000000] R[r18=0000000000000000] inst=[03249263] DASM(03249263)
INFO:   401255 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 41 [1] pc=[0000000000010020] W[r1=0000000000010024][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[18c000ef] DASM(18c000ef)
INFO:   401285 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 44 [1] pc=[0000000000010014] W[r9=0000000000000000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[00000493] DASM(00000493)
INFO:   401295 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 45 [1] pc=[0000000000010018] W[r18=0000000000000003][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[f1402973] DASM(f1402973)
INFO:   401325 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 48 [1] pc=[000000000001001c] W[r0=0000000000000000][0] R[r9=0000000000000000] R[r18=0000000000000003] inst=[03249263] DASM(03249263)
INFO:   401385 cep_tb.dut.system.tile_prci_domain_1.tile_reset_domain.tile.core  C1: 55 [1] pc=[0000000000010040] W[r0=0000000000000000][0] R[r0=0000000000000000] R[r0=0000000000000000] inst=[10500073] DASM(10500073)
INFO:   401475 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 63 [1] pc=[00000000000101ac] W[r2=000000008007efb0][1] R[r2=000000008007f000] R[r0=0000000000000000] inst=[fb010113] DASM(fb010113)
INFO:   401485 cep_tb.dut.system.tile_prci_domain.tile_reset_domain.tile.core  C0: 64 [1] pc=[00000000000101b0] W[r15=0000000070100000][1] R[r0=0000000000000000] R[r0=0000000000000000] inst=[701007b7] DASM(701007b7)
INFO:   401555 cep_tb.dut.system.tile_prci_domain_3.tile_reset_domain.tile.core  C3: 71 [1] pc=[0000000000010040] W[r0=0000000000000000][0] R[r0=0000000000000000] R[r0=0000000000000000] inst=[10500073] DASM(10500073)
       0 cep_tb.cpuId[2].driver C_LOG: OK: Program loading is completed
INFO:   401595 cep_tb.cpuId[2].driver.release_tile_reset Releasing Tile #2 reset...
bchetwynd commented 1 year ago

Issue seems to be the once a bare metal program has been "backdoor" loaded into main memory, the mechanism that releases the four rocket tiles from reset would occasionally only release 3 out of 4 at the same time, with the last once being release 1000+ clock cycles later. This resulted in some internal logic locking up within the rocket chip.

The testbench has been modified to ensure once the program has been loaded ALL the tiles resets are released at the same time. This fix has been rolled into v4.5_development_bchetwynd branch and will be included in a v4.42 release.