antmicro / alkali-csd-fw

Apache License 2.0
2 stars 2 forks source link

nvme reset spec non-compliant #17

Open twilfredo opened 1 year ago

twilfredo commented 1 year ago

Hey guys,

Looking at the NVME 1.4 specification, it looks like when a reset is issued, the controller does not do all that it should for compliance. The following is from the spec.

Enable (EN): When set to ‘1’, then the controller shall process commands
based on Submission Queue Tail doorbell writes. When cleared to ‘0’, then the
controller shall not process commands nor post completion queue entries to
Completion Queues. When this bit transitions from ‘1’ to ‘0’, the controller is
reset (i.e., a Controller Reset). That reset deletes all I/O Submission Queues
and I/O Completion Queues, resets the Admin Submission Queue and
Completion Queue, and brings the hardware to an idle state. That reset does
not affect PCI Express registers (including MMIO MSI-X registers), nor the
Admin Queue registers (AQA, ASQ, or ACQ). All other controller registers
defined in this section and internal controller state (e.g., Feature values defined
in section 5.21.1 that are not persistent across power states) are reset to their
default values. The controller shall ensure that there is no data loss for
commands that have had corresponding completion queue entries posted to an
I/O Completion Queue prior to that Controller Reset. Refer to section 7.3.
When this bit is cleared to ‘0’, the CSTS.RDY bit is cleared to ‘0’ by the controller
once the controller is ready to be re-enabled. When this bit is set to ‘1’, the
controller sets CSTS.RDY to ‘1’ when it is ready to process commands.
CSTS.RDY may be set to ‘1’ before namespace(s) are ready to be accessed.
Setting this bit from a ‘0’ to a ‘1’ when CSTS.RDY is a ‘1’ or clearing this bit from
a '1' to a '0' when CSTS.RDY is cleared to '0' has undefined results. The Admin
Queue registers (AQA, ASQ, and ACQ) shall only be modified when EN is
cleared to ‘0’

So looking at the controller side https://github.com/antmicro/alkali-csd-fw/blob/bb0e116874087753572988c4c753c7595fc75a1d/rpu-app/src/tc.c#L61 we aren't doing any of these.

I've noticed this as I was testing the device through reboots, it never works if the host machine has rebooted (device will typically become unresponsive). Only works when gone through a clean power cycle. Any ideas on how best to address this?

Also this looks like it should be a bitwise and https://github.com/antmicro/alkali-csd-fw/blob/bb0e116874087753572988c4c753c7595fc75a1d/rpu-app/src/tc.c#L57

rw1nkler commented 1 year ago

Yes, it seems that we do not handle the reset procedure correctly. Although there exist commands responsible for deleting submission and completion queues on corresponding admin commands, we can't directly use them when reset is detected.

This problem should be fixed by separating the code responsible for queue creation/deletion into dedicated functions. Which in turn should be called both on the reset and on the admin commands.

Here are the functions responsible for deleting and creating submission/completion queues:

Regarding the cc && NVME_TC_REG_CC_EN, yes, it should use bitwise and (&) instead of logical and (&&)