chipsalliance / caliptra-sw

Caliptra software (ROM, FMC, runtime firmware), and libraries/tools needed to build and test
Apache License 2.0
94 stars 44 forks source link

MEPC goes to zero when waking from clock gating on FPGA #1220

Open jlmahowa-amd opened 10 months ago

jlmahowa-amd commented 10 months ago

With the firmware changes WFI in #1193 after the CPU wakes back up MEPC is zero instead of the interrupted instruction. image image (1)

interrupt_valid causes mepc to sample npc_r. The most relevant seeming driver for npc_r is tlu_flush_path_r_d1 (delayed one clock from the version captured above) when tlu_flush_lower_r_d1 is high.

At this point tlu_flush_path_r_d1 has zero, which was loaded from the interrupt_valid_r assertion at clock 513. This wipes out the value of 0x20003161, which had previously captured the PC value.

jlmahowa-amd commented 10 months ago

tlu_flush_path_r_d1 does not reflect the value of 0x20003161 one clock after dec_tlu_flush_path_r. tlu_flush_lower_r is the "enable" to ungate the clock to a DFF. Just drawing out the waveform I suspect that the DFF would latch the data a cycle late with the current implementation. There is a good reason RV_FPGA_OPTIMIZE cuts out that logic but I don't know if re-enabling RV_FPGA_OPTIMIZE will prevent us from hitting the conditions we wanted to test. image

jhand2 commented 10 months ago

@kgugala would you be able to help debug this isse we're seeing with on VeeR (on FPGA) when existing sleep states?

jlmahowa-amd commented 10 months ago

The CG module of the FPGA only captures en_ff on the negedge of the clock. In the case where the input clock to the cg module is itself gated and the enable goes high first, I think that explains why using the FPGA implementation doesn't capture tlu_flush_path_r like I expected. Below is not a capture, but a diagram I created to plot out the transitions. I'll check if the TEC_RV_ICG implementation from beh_lib.sv is able to work in Vivado. image

jlmahowa-amd commented 10 months ago

Changing the FPGA back to TEC_RV_ICG massively increases the FPGA utilization to the point where it has no chance of fitting on the ZCU104.

jlmahowa-amd commented 10 months ago

By changing flush_lower_ff to use free_l2_clk instead of the gated clock the scenario from Arthur passes.

There are a number of other rvdff's instantiated in el2_dec_tlu_ctl that also use free_l2clk instead of clk. In the case of flush_lower_ff using free_l2clk removes the dependency of en_ff on the gated clock and allows Q to toggle at the expected cycles.

jlmahowa-amd commented 10 months ago

Captured a waveform with the W/A: image tlu_flush_path_r_d1 (not shown) captures the value of npc_r_d1 (0x20003161) instead of zero, which then is used to set npc_r. npc_r is what gets captured in mepc_ns when interrupt_valid goes high.

jlmahowa-amd commented 10 months ago

Similar behavior noted in a VEER issue: https://github.com/chipsalliance/Cores-VeeR-EL2/issues/88

jhand2 commented 10 months ago

Once we address this, should we enable clock gating in the CI FPGA bitstreams to make sure all the tests run against this fix?

Nitsirks commented 10 months ago

It appears that the VeeR issue with verilator is related to how the TEC_RV_ICG file attempts to convert the clock gate latch to a negedge flip flop for verilator runs. This causes the delayed sampling.

Are you running on FPGA without RV_FPGA_OPTIMIZE? What's the reasoning for turning this off? If so, I suspect that your issue is similar. Probably something to do with the clock gate latch not playing nicely on FPGA?

jlmahowa-amd commented 10 months ago

@jhand2 I haven't wanted to enable it by default, but we could enable it in an additional FPGA run. I'll talk to Kor about it, I worry about adding too many FPGA builds in CI. @Nitsirks You might be missing some of the context. I am disabling RV_FPGA_OPTIMIZE when trying to allow clock gating since since that flag disables it, but I am normally using that flag. I have to replace the latch with a FF because Vivado cannot synthesize reasonable logic (either fails synthesis or blows up the resource utilization). The FF logic, not being logically identical to the latch is the root cause of the issue.