SPI CTRL0 register access unsafe - race condition can cause spurious transmissions

gavanfantom commented 10 months ago

The SPIn_CTRL0 register contains the '''start''' field, which triggers the start of a transaction. The user guide cautions that this bit should not be set until pending transactions have finished. The SPI driver performs a read/modify/write operation on this register while a transaction is in progress, in order to change different fields.

This read/modify/write causes a 1 bit to be written to the '''start''' field if a transaction was in progress when the read was performed. It's not clear what happens if the 1 bit is written back while the transaction is still in progress, but it appears to be harmless. However, if the transaction finishes after the register is read but before it is written back, this will start a new transaction in the hardware.

It just so happens that on the MAX32670, if the instruction cache is enabled, and a 1 byte read/write transaction is performed, it hits this race condition and causes 2 bytes to be sent. Disabling the instruction cache works around the issue, but the underlying problem is still present. It is not safe to do a read/modify/write on this register when a SPI transaction is in progress, unless the '''start''' field is cleared to zero before the value is written back.

While I've found this with the spi_reva1.c driver, a quick read of the reva2 driver suggests that this issue may affect that driver too.

Jake-Carter commented 10 months ago

Thanks for reporting this.

The most likely culprit is MXC_SPI_RevA1_MasterTransHandler.

This logic seems terrible...

The start bit is labeled as R/W1O, so clearing it to 0 as a solution should be safe and have no effect on device operation.

We're working on improved SPI drivers in #662. These v2 drivers have done away with the poorly placed ctrl0 interactions for SS assertion/deassertion, though we should fix this for v1 too.

@sihyung-maxim FYI v2 might still be susceptible to this here and here, but since we clear the EN bit beforehand it seems a lot less likely.

Jake-Carter commented 10 months ago

@sihyung-maxim I'm linking the PR as a potential candidate to close this. We should test the reported condition with v2:

ICC enabled
TX length = 1 byte
RX length = 1 byte

Bug: TX byte is transmitted twice

gavanfantom commented 10 months ago

The most likely culprit is MXC_SPI_RevA1_MasterTransHandler.

This logic seems terrible...

That is exactly the problem, yes.

The start bit is labeled as R/W1O, so clearing it to 0 as a solution should be safe and have no effect on device operation.

That's right. I tested exactly that in those two places, and the fix worked:

    spi->ctrl0 = (spi->ctrl0 & ~MXC_F_SPI_REVA_CTRL0_START) | MXC_F_SPI_REVA_CTRL0_SS_CTRL;

    spi->ctrl0 &= ~(MXC_F_SPI_REVA_CTRL0_START | MXC_F_SPI_REVA_CTRL0_SS_CTRL);

sihyung-maxim commented 10 months ago

@sihyung-maxim I'm linking the PR as a potential candidate to close this. We should test the reported condition with v2:

ICC enabled

TX length = 1 byte

RX length = 1 byte

Bug: TX byte is transmitted twice

This test case passes for v2. TX byte is only transmitted once. However, I will add a check to ensure a transaction is not in progress before setting the start bit.

As a side note: The v1 (non-DMA) implementation works by restarting the transaction every 32 messages. The MXC_SPI_RevA1_MasterTransHandler(...) function can only handle up to 32 messages per function call (the 32 messages limitation being the size of the SPI FIFO). This results in MXC_SPI_RevA1_MasterTransHandler(...) getting called several times for transaction lengths greater than 32 messages which means the SPI_CTRL0.ss_ctrl and SPI_CTRL0.start bits are repeatedly set or cleared every 32 messages.

The v2 does not restart the SPI block over the course of a transaction, so the SPI_CTRL0.ss_ctrl and SPI_CTRL0.start fields are only set/cleared once at the beginning of each transaction.

khpeterson commented 10 months ago

Are there other read/modify/writes of spi->ctrl0 that are vulnerable to this problem? I'm thinking of a case where a master DMA transaction follows a master transaction, but maybe clearing START as described above is enough.

khpeterson commented 10 months ago

FWIW, this fix resolves a case of "stuck DMA" that I have been fighting for a while. Occasionally, my app ends up in a state where SPI thinks its done but DMA is waiting to complete:

(gdb) p/x *MXC_SPI1
$35 = {{data32 = 0x0, data16 = {0x0, 0x0}, data8 = {0x0, 0x0, 0x0, 0x0}}, ctrl0 = 0x10003, ctrl1 = 0x1820182, ctrl2 = 0x800, ss_time = 0x10101, clk_cfg = 0x202, rsv_0x18 = 0x0, dma = 0x80408041, int_fl = 0x8807, int_en = 0x0, wake_fl = 0x0, wake_en = 0x0, stat = 0x0}
(gdb) p/x *MXC_DMA1
$36 = {cn = 0x3, intr = 0x0, rsv_0x8_0xff = {0x0 <repeats 62 times>}, ch = {{cfg = 0x80040210, st = 0x0, src = 0x20003886, dst = 0x14, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x80400011, st = 0x1, src = 0x20003724, dst = 0x20003a04, cnt = 0x4, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}}}

Without the fix this my app gets into this state around 4 times in 8 hours of continuous operation; with it I have been able to run for nearly 24 hours with a single occurrence.

Thank you @gavanfantom!

analogdevicesinc / msdk

SPI CTRL0 register access unsafe - race condition can cause spurious transmissions #713