chipsalliance / caliptra-sw

Caliptra software (ROM, FMC, runtime firmware), and libraries/tools needed to build and test
Apache License 2.0
94 stars 44 forks source link

Mailbox driver needs to handle error FSM states correctly. #718

Open korran opened 1 year ago

korran commented 1 year ago

While fixing some driver test cases to pass with the latest RTL, I noticed that there are cases where the SoC can force the mailbox internal mbox_fsm_ps to the MBOX_ERROR state by writing to the "wrong" registers. The driver does not handle this, and the uC firmware can get stuck after SoC bad behavior (noticed during a uC->SoC transaction, but there are likely other scenarios).

Also, we need better hardware documentation in this area.

calebofearth commented 1 year ago

@korran The mailbox is only intended to enter an ERROR state if both of these are true:

  1. SoC has acquired the lock using a valid PAUSER
  2. SoC performs an access outside of the defined order (e.g. write to DLEN prior to write to CMD).

If the uC has lock because it's running a uC->SOC transaction, the mailbox should never be able to enter the ERROR state. If you have a testcase that shows mailbox entering ERROR state while uC has lock, please share it with the HW team and we'll debug.

EDIT: There is one exception to the above for a uC->SOC transaction which is probably the case you hit. When the execute bit is set, the FSM transitions to MBOX_EXECUTE_SOC. At that point the SOC is expected to read dataout, then perform a write to the Status register to indicate the command processing is completed. In this case, any write to a register other than the Status register causes a transition to the ERROR state. There is no specific APB agent (i.e. PAUSER value) associated with the transfer, so any valid APB agent can cause this. A valid APB agent is one whose PAUSER value matches one of the values configured to the register CPTRA_MBOX_VALID_PAUSER. APB agents whose PAUSER value is not programmed to that register can not trigger a protocol violation.

korran commented 1 year ago

Yes, that sounds like the scenario I'm hitting. In a uC->SoC transaction, the SoC is [incorrectly] clearing the execute bit while the FSM is in the MBOX_EXECUTE_SOC state.

What should the uC be doing when it sees the FSM go into the error state as part of a uC->SoC transaction? I assume it should eventually write to the unlock register to get things back to normal?

calebofearth commented 1 year ago

I broke out a new section in the Integration Specification to discuss Mailbox protocol errors and handling. There are two ways to recover from the error - soft reset (by the SOC), or uC writing mbox_unlock to reset the mailbox FSM as you mentioned.

While in the Error state, we haven't prescribed any tasks for the uC to undertake. This state essentially just exists to enforce the protocol rules and not allow SOC agents to interact with the mailbox erroneously. I'm not sure what security related tasks FW may want to perform here - in the future there may be some interest in logging the occurrence of protocol violations to create notifications about misbehaving agents.

rusty1968 commented 1 year ago

@wmaroneAMD

andreslagarcavilla commented 1 year ago

is this fixed?

mhatrevi commented 11 months ago

Moving this task to 2.0