esp-rs / esp-hal

no_std Hardware Abstraction Layers for ESP32 microcontrollers
https://docs.esp-rs.org/esp-hal/
Apache License 2.0
721 stars 195 forks source link

esp32c3: trouble initiating DMA-driven SPI transfers #489

Closed sethp closed 1 month ago

sethp commented 1 year ago

Apologies for the vague title and the long issue! I'm grateful for all the help you've already given me in getting to this point with your excellent work on the HAL. I very much appreciate your efforts, and my hope for this long letter is that it's received in that spirit of appreciation.

We've recently discovered something strange: our SPI DMA transfers would produce some meaningful data for a few bytes, and then more or less stop long before the configured datalen. We'd get maybe a half dozen to a dozen somewhat distinct values at the beginning before getting "stuck" transmitting the same byte over and over again. Initially, we suspected either something in our code was incompatible with #444 because we happened to pull those changes in at about the same time, but that so far has seemed more coincidental than anything else.

Indeed, this has been a particularly interesting one to track down, because it seems unrelated to (logical) code changes at all. We only need to transmit data, so we ran into this problem using dma_write[^1]. However, changing to using dma_transfer with an empty receive buffer seems to avoid the problem entirely. From what I can tell, dma_transfer has an identical logical flow to dma_write, except for some additions that more or less no-op right out with an empty receiver (foreshadowing!). In fact, no amount of perturbing the source (e.g. git bisect, both on our project's code and the HAL's) revealed a difference in the behavior tied to any particular logic change. No matter the configuration of the peripheral, it seemed, dma_write would work inconsistently but dma_transfer would always behave as we expected. A big clue came when I discovered that we had a "load bearing" #[inline(always)]: removing that annotation would produce the expected DMA transfer with dma_write, and including it would "break" the output.

I believe the main difference there was that forcing inlining acted as a de-optimization. The function in question, transmit_chunk, is called from two places, so inlining transmit_chunk raised the "cost" of space-wise optimizations and thus de-inlined much of its body. Especially, dma_write and its transitive callees. This, along with the behavior difference when doing more "non-functional" work with dma_transfer, is what first suggested to me a timing problem.

Some further context that's probably helpful at this point: we've got a hard-real-time bound on getting our DMA transfer out the door (we're emitting pixel data for a video frame), so we've very aggressively optimized the code path through the HAL that initiates transfers for consistency & speed. Mostly that's been ensuring everything on the hot path is memory-resident so we aren't coupled through the cache: touching the flash, even just once or twice, seems to introduce enough variability that we blow our targets.

We've achieved that mainly by a combination of #[inline(always)] and #[link_section = ".rwtext"][^2]. The latter is a spelling of #[ram] that still allows LLVM to decide to inline the function (which will be important a little later). We've also turned on LTO and are always running in release mode (though, with debug symbols and -C force-frame-pointers). The upshot is that we've gotten the whole "start dma write" down to about 2000 cycles[^3] with almost no variability: either within a given execution or across whole reset/power cycles. It's even remarkably consistent across reflashes, when code "away" from the hot path is changed.

Ok, so that's the setup, now for the punchline: the most reliable way I've found to produce the desired operation of the DMA-driven SPI peripheral is this to add a single nop after tx.prepare_transfer(..):

--- a/esp-hal-common/src/spi.rs
+++ b/esp-hal-common/src/spi.rs
@@ -1904,6 +1906,8 @@ where
         reset_dma_before_load_dma_dscr(reg_block);
         tx.prepare_transfer(self.dma_peripheral(), false, ptr, len)?;

+        unsafe { core::arch::asm!("nop") }
+
         self.clear_dma_interrupts();
         reset_dma_before_usr_cmd(reg_block);

Suffice to say it took a lot of reading disassembly and failed attempts to figure that out, but it does seem so far resilient in the face of code changes that none of my other attempts have been. The problem seems to only present itself when LLVM maximally inlines all the various traits and register access blocks &c, which only happens some of the time. The main variable, I believe, are those functions we said to "either inline or place in RAM, your choice." The outcome is very stable from a given pile of code: it seems to happen 100% of the time or 0% of the time for a given binary. However, I find it very hard to predict what changes to that code will cause LLVM to optimize differently, so none of my other attempts have been robust in terms of getting LLVM to optimize enough that we meet our performance targets, but not so much as to break the output.

The positioning of the nop is meant to act as a "buffer" between the DMA and SPI peripherals during CPU programming. When it does optimize "fully," LLVM produces disassembly that looks like this:

40381784:   lui     a1,0x6003f # set a1 to 0x6003_f000 , the gdma peripheral base address
# ...                          # ... most setup elided (it all seems to match the steps in the TRM) ...
403817c6:   lw      a0,224(a1) # read a0 from gdma_base + 224 (0xe0), which is GDMA_OUT_LINK_CH0_REG
403817ca:   lui     a2,0x200   # set bit 21 in a2
403817ce:   or      a0,a0,a2   # set bit 21 in a0
403817d0:   sw      a0,224(a1) # write a0 back to GDMA_OUT_LINK_CH0_REG, setting GDMA_OUTLINK_START_CH0
403817d4:   lw      a0,0(a1)   # immediately read GDMA_INT_RAW_CH0_REG
403817d6:   andi    a0,a0,64   # check bit 6 (GDMA_OUT_DSCR_ERR_CH0_INT_RAW)
403817da:   bnez    a0,...     # if its set, bail
403817de:   nop                
403817e0:   lui     a0,0x61    # this pair...
403817e4:   addi    a0,a0,3    # ... sets a0 to 0x0006_1003
403817e6:   lui     s1,0x60024 # set s1 to 0x6002_4000 , the SPI2 peripheral base address
403817ea:   sw      a0,56(s1)  # writes a0 to spi2_base + 56 (0x38), SPI_DMA_INT_CLR_REG
                               # setting bits 18, 17, 12, 1, and 0, i.e. clears:
                               #   18: SPI_MST_TX_AFIFO_REMPTY_ERR_INT_CLR
                               #   17: SPI_MST_RX_AFIFO_WFULL_ERR_INT_CLR
                               #   12: SPI_TRANS_DONE_INT_CLR
                               #    1: SPI_DMA_OUTFIFO_EMPTY_ERR_INT_CLR
                               #    0: SPI_DMA_INFIFO_FULL_ERR_INT_CLR
403817ec:   lw      a0,48(s1)  # read a0 from spi2_base + 48 (0x30), SPI_DMA_CONF_REG
403817ee:   lui     a1,0xe0000 # set bits 31, 30, 29 in a1
403817f2:   or      a0,a0,a1   # set bits 31, 30, 29 in a0
403817f4:   sw      a0,48(s1)  # write a0 back to spi2_base + 48 (0x30), SPI_DMA_CONF_REG
                               # i.e. triggers
                               #    31: SPI_DMA_AFIFO_RST
                               #    30: SPI_BUF_AFIFO_RST
                               #    29: SPI_RX_AFIFO_RST
403817f6:   lw      a0,0(s1)   # read a0 from spi_base + 0 ,  SPI_CMD_REG
403817f8:   lui     a1,0x1000  #  ..
403817fc:   or      a0,a0,a1   #  ..
403817fe:   sw      a0,0(s1)   # set bit 24 in SPI_CMD_REG ( SPI_USR )

The nop slots right in between writes to addresses at the 0x6003_f000 (DMA) base, and the subsequent interactions with the hardware registers based at 0x6002_400 (SPI2). I started with a much larger number of nops (128), which also functioned as I hoped, and was more than a little surprised to find that even a single nop makes the difference.

But, now I'm not sure where to go next. Probably adding a single nop isn't the right answer, but it does show how sensitive to change the issue is: it's very easy to "fix" it by accident. From what I can tell, the machine code faithfully implements the procedures described in the TRM for the esp32c3, but maybe there's a missing step? Some bit that needs to be cleared? There's no mention (nor mechanism) that I can find for waiting on the DMA's internal programming like there is with the update bit for SPI; is that not a concern in this case? It does seem strange that we immediately check the DMA's interrupt register for descriptor errors (is that guaranteed to be set within a single cycle of turning on the outlink?), but that can't be our problem: the descriptors are valid, they're just not being used.

I've also got little to no intuition about how broad this problem is: either across projects or esp32* platforms. It's plausible that cranking the optimization settings for one of the bog-standard examples would show the same behavior (I don't think we removed anything that would prohibit LLVM from inlining, just added mandates)?

Thanks again!

[^1]: We haven't yet switched over to the new SpiDataMode + write functionality from #444, but we're looking forward to it! We did run an experiment early on that suggested it suffered from the issue problem, though I haven't gone back and checked. [^2]: As an aside, we'd be glad to take any suggestions y'all have for how to make this process a little more robust/less intrusive. So far the most reliable way I've found to do it is manually tracing through the disassembly for memory addresses that start with 0x42... for functions (or 0x3c... for data) and then manually annotating the things I discover. This works, but isn't exactly error-free or particularly reproducible (though I am eternally grateful to whichever wonderful person taught riscv objdump to recognize lui/addi pairs and put the resultant address in a comment). What's more, I have no idea how to implement that without forking whichever dependency I find myself in to add the annotations, which naturally comes with its own set of costs too. [^3]: 1783 as of the time of this writing, as measured by a static cycle counter and mpccr (which adds about ~16 cycles of measurement overhead).

bjoernQ commented 1 year ago

Wow that is really weird! The fact that a single "nop" makes the difference between working and non-working is quite scary.

I also had a brief look at esp-idf to see if they do some magical things in this direction but I haven't seen anything (which doesn't mean it is not there)

The only thing I could imagine is that writing to the DMA registers somehow affects the SPI interrupt flags and without the "nop" the clearing of the interrupts collide with that change ... but also, that sounds strange

Regarding checking the descriptor-error ... probably it should be checked "during" the transfer since I guess the check is done while reading the descriptors during the transfer

Regarding disassembling code: You probably already know it but anyways: Passing the option S to objdump annotates the disassembly with the original source code producing the code

sethp commented 1 year ago

Regarding disassembling code: You probably already know it but anyways: Passing the option S to objdump annotates the disassembly with the original source code producing the code

:open_mouth: I did not know that, that's so useful! Thank you for sharing!

I also had a brief look at esp-idf to see if they do some magical things in this direction but I haven't seen anything (which doesn't mean it is not there)

ohhh, that's a good thought: I'll see if I can spy anything over that way too.

The only thing I could imagine is that writing to the DMA registers somehow affects the SPI interrupt flags and without the "nop" the clearing of the interrupts collide with that change ... but also, that sounds strange

Yeah, that's kind of what I was thinking when I decided to slot in the nop there? Some sort of asynchronous communication between the two peripherals that needed to happen? But I was more than a little surprised that a single nop covers it: I guess it means that the write to the SPI happens ~at the 8th instruction instead of the 7th following the write to the DMA, and 8 is a multiple of 4? (We run the SPI clock at 40 MHz, so that factor of 4 at least feels possibly relevant?). Can you think of a way to track changes that the DMA would be making to the SPI registers "behind the scenes?" Would the debug assist dealio help us with that?

Also, something I forgot to mention: we don't have an easy way to test the DMA rx behaviors, but I'm really curious whether they've got the same problem or not, since in dma_transfer they're the ones set up immediately before the SPI.

Regarding checking the descriptor-error ... probably it should be checked "during" the transfer since I guess the check is done while reading the descriptors during the transfer

Oh, good point: I went ahead and split that out into https://github.com/esp-rs/esp-hal/issues/491 for us to track separately.

bjoernQ commented 1 year ago

Would the debug assist dealio help us with that?

I think unfortunately debug-assist would only help with the CPU or actual DMA transfers accessing RAM. I don't yet have a good idea how to debug this

sethp commented 1 year ago

Well, also lacking better ideas, I've applied brute force: I reviewed spi_new_trans over in the esp-idf by manually flattening all the definitions until I had one big block o' C, then compared it to the esp-rs hal flow for a half-duplex DMA transmit starting from SpiDma::write_dma, going operation by operation. I am fun at parties, I promise. The full result of that's over here if you're curious, but the upshot is that there's a timing difference between enabling the channel0's outlink and issuing cmd.usr = 1 that seems to be crucial:

esp-rs vs esp-idf

esp-rs (links to callsites) esp-idf (links to callsites) notes
hw->dma_int_raw.trans_done = 0 this is the very first interaction with the SPI peripheral esp-idf has (if SPI device init really is lazily performed once, as the comment claims). I'm not sure if this is equivalent to setting the corresponding bit in dma_int_clr, but I guess so?
... ... ...
::start_out gdma_ll_tx_start sets GDMA_OUTLINK_START_CH0, esp-idf by way of a macro
check ::has_out_descriptor_error as discussed
spi_ll_enable_mosi(.., 1)
spi_ll_enable_miso(.., 0)
esp-rs does this much earlier as part of init_half_duplex
spi::Instance::clear_dma_interrupts clears a bunch of interrupts, including trans_done and outfifo_empty_err; esp-idf only clears outfifo_empty_err & trans_done (above)
spi::reset_dma_before_usr_cmd toggles the dma_conf.dma_afifo_rst bit
spi_hal_user_start [...] the equivalent of spi::Instance::update occurs here
.. w.usr().set_bit() .. [...] spi_hal_user_start sets SPI_USR (initiating the transaction)

Definitions

Discussion

Initially, I thought that clearing the transaction done interrupt so much earlier than we do might be having an effect, but I tested both that and the possibility of other timing dependencies by (re)moving everything from between enabling the outlink and setting SPI_USR and adding a bunch of delays "buffering" that pair of operations, which seemed to have no effect.

So, next, I tried playing with the amount of delay by varying the number of nops; I also inserted a core::arch::asm!("csrwi 0x7E2, 0x0"); just before R::start_out(); to reset mpccr[^1] and core::arch::asm!("csrwi 0x7E1, 0x0"); just after setting SPI_USR to pause counting (so I could more easily aggregate it later). What I found was very interesting: when the cycle counter reads 66 elapsed cycles, the DMA transaction works as expected. When it reads 64[^2], the DMA transaction behaves poorly. I also (loosly) measured writes to peripheral registers: they're fairly expensive, maybe a dozen clock cycles each, so it takes a lot more very nicely pipelined nops to replace a single write than I expected.

What I came up with was this:

super-optmized nopsespressif styleupdate-onlybaseline
```rust unsafe { //reset cycle count core::arch::asm!("csrwi 0x7E2, 0x0"); } R::start_out(); ```
```rust unsafe { use core::arch::asm; asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); for _ in 0..4 { asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); asm!("nop"); } } ``` ```rust reg_block .user .modify(|_, w| w.usr_miso().clear_bit() .usr_mosi().set_bit()); self.update(); ``` ```rust self.update(); ``` ```rust // set SPI_USR immediately ```
```rust reg_block.cmd.modify(|_, w| w.usr().set_bit()); unsafe { // pause counting cycles core::arch::asm!("csrwi 0x7E1, 0x0"); } ```
`mpccr`: 66 cycles ✔️ `mpccr`: 76 cycles ✔️ `mpccr`: 56 cycles ❌ `mpccr`: 30 cycles ❌
super-optmized nopsespressif styleupdate-onlybaseline

(unsatisfying) conclusion

So, uh, what do? It seems to me that the esp-idf works almost entirely by accident,[^3] due to how a loosely related chip (the ESP32) happens to be configured. The nop-driven approach has the advantage of explicitly modeling the timing dependency, I suppose; would y'all be happy with a PR that shuffles the SPI order of operations and adds 36 cycles of nops (specifically on the esp32c3)? I don't know if I would, in your position.

I was pawing through the different hardware-set interrupt bits to see if there's a way we could identify and wait for whatever operation takes ~225ns; the only thing I've found so far is mst_tx_afifo_rempty_err_int_raw, which doesn't appear to be set in this case (though I'm not sure if I need to enable it on the DMA side, maybe?).

The only other bad idea I had was to try and figure out what a DMA segmented transfer looks like; presumably, if we bake the SPI configuration into a CONF buffer before the data buffers, the hardware will handle it appropriately? I'm very suspicious we'll still have the same problem, though: from what I can see in the TRM, initiating a segmented transfer looks almost identical to a single transfer relative to the OUTLINK & SPI_USR bits.

The only other thing a close reading of the TRM revealed was that in between turning on the OUTLINK and setting SPI_USR they've documented a step to reset the SPI's afifos (though esp-idf does it a fair bit earlier, and it didn't make a difference when I tried it), and a step to "wait for SPI [secondary] to get ready for transfer," which I don't believe is specifically meant to wait for the SPI machinery within the chip itself (and it lacks sufficient clarity to do so, anyway).

I haven't tried to minimize the reproduction case yet, because it's very easy to tell when things go wrong in our (complex) setup, but I suspect it's possible to demonstrate with the loopback, if the SPI is set up in receive-mode ahead of time before issuing a dma_write, so I suppose that at least would be a useful next step to try.

If you do get a hankerin' to try this out yourself, feel free to poke at our fork, we're working off the vgaterm-perf branch at the moment. I'm fairly sure the relevant changes that expose the problem are ~this pair of commits in tandem with LTO and opt-level=2; that's what makes it possible to reliably get <64 cycles of delay between turning on the outlink starting the transaction.

[^1]: Having already configured it earlier to be counting cycles with asm!("csrwi 0x7E0, 0x1", "csrwi 0x7E1, 0x1");) [^2]: I couldn't get it to 65; something something clock domains? [^3]: Presuming it even does; I didn't try it

bjoernQ commented 1 year ago

Thanks for the thoroughly done analysis. Probably wasn't much fun

I also had a brief look at the TRM (again) and also weren't able to see anything we can reliably wait for - but that would obviously a saner solution than the NOPs - especially since we don't really know how things exactly look on other chips.

Maybe @MabezDev and @jessebraham should also have a look at this (if they didn't already) - any ideas from your side?

sethp commented 1 year ago

I appreciate you, thanks for taking the time to read it! I do have fun with this kind of stuff, but it's definitely more fun when the conclusion is a nice neat "here's a good idea" rather than "here's a bunch of nops" 🙃

Part of what's got me stumped is that my instinct is to go down a layer at this point; in other contexts, I've found that e.g. reading a syscall implementation in the kernel can reveal the difference between "I'm holding it wrong" and "this ought to be improved," and may even suggest leverage points.

I'm not sure what that'd look like in this domain though; finding a big pile of VHDL and running it through a simulator a whole bunch? Reaching out to the espressif technical support folks? I suppose I'm looking for a way to bound that uncertainty around "what chips will this affect?"

I do want to say: I think this is actually a great problem to have. It really validates y'all's approach and speaks to the excellence of the tools you've built that I'm able to push up against the limits of the hardware like this, and to have done so relatively easily! I mean, I added a handful of linker directives, and now we're talking about introducing targeted de-optimizations because the esp-rs HAL is too good!

MabezDev commented 1 month ago

I believe this is now resolved via #1856, thanks for your investigation @sethp. In the end we just added a micro-second delay before starting the DMA to allow the FIFO to contain at least one item. It's not perfect, but I believe this a limitation of the hw and something we need to work around. I hope this solution ends up being viable for your use-case.

sethp commented 1 month ago

Ah the classic "fixing a timing issue with a sleep call."

I'm not really set up to look at this too closely, but IIRC a microsecond delay is probably a bit more than necessary but should still work fine. Thanks for the follow up!