carlk3 / no-OS-FatFS-SD-SDIO-SPI-RPi-Pico

A FAT filesystem with SDIO and SPI drivers for SD card on Raspberry Pi Pico
Apache License 2.0
75 stars 14 forks source link

Performance improvements #41

Open santolucito opened 1 month ago

santolucito commented 1 month ago

I'm trying to understand what's written here : https://github.com/ZuluSCSI/ZuluSCSI-firmware/issues/269 I'm not getting all the details but I do see performance numbers around 25mb/s for sdio and maybe even 40mb/s if I'm reading it correctly. I see in the readme here, the numbers are closer to 10mb/s. Do you have a sense for the mismatch? This is something I'd be happy to try to address if I had a better sense of where the gap is coming from.

carlk3 commented 1 month ago

There are a bunch of possible factors.

  1. FAT file system: there is a big difference if you're comparing raw block read/write speeds with file read/write speeds on FAT.
  2. SdFAT vs FatFs. ZuluSCSI uses SdFAT.
  3. System clock frequency: I ran my benchmarks at the default Pico system clock frequency (clk_sys) of 125 MHz.
  4. Hardware: even at the default Pico system clock frequency (clk_sys) of 125 MHz, a baud_rate of 31250000 Hz is possible. However, some of my boards get flaky over 20833333 Hz.

Re: CMD12: I've made some effort to exploit that trick. (See Performance improvements, https://github.com/carlk3/no-OS-FatFS-SD-SDIO-SPI-RPi-Pico/pull/30.)

I have some ideas for speeding things up:

  1. First of all, essentially the same SD driver runs faster in FreeRTOS-FAT-CLI-for-RPi-Pico, maybe because it uses both cores. Currently, no-OS-FatFS-SD-SDIO-SPI-RPi-Pico itself only uses one core. There is some locking in the library code that isn't necessary if only one core is used, but is necessary if the application uses the file system from both cores. Perhaps the locking could be disabled with conditional compilation when it is not needed.
  2. There is some room for improvement in the CRC calculations. In particular, I've been looking at slicing-by-8. (There is also DMA Sniffer, but there are reasons I'm avoiding that [see Suggestion for calculating CRC/etc #63], and I don't think it can be made to work for 4-bit SDIO, anyway).
  3. ZuluSCSI's PIO optimizations: rp2040_highspeed_sdio
  4. I'm looking at eMMC. That has additional bus modes that could open up some possibilities:
    • By using the 8 bit interface instead of the 4 bit SD interface, the transfer rate could potentially be doubled.
    • Dual Data Rate (DDR): clocks the data on the rising AND falling edges of the clock. SD has DDR, but only for 1.8 V signaling. eMMC has a DDR mode that uses 3.3 V signaling. Another potential doubling of the transfer rate.

I'd welcome any contributions! Just make a Pull Request.

matsobdev commented 3 weeks ago

I'm new to SD card and just started using great page http://elm-chan.org/docs/mmc/mmc_e.html and there as well as in SD spec they say about using ACMD23 before writing (for write speed improvement). This the concept that have in mind, since I'm ok having no file system :D Never ending write (or read). Ex. for logging data, there could be started CMD25, and just pushing 512 once in a while without stopping it with CMD12 prematurely. So ex. days, months long write. There could be less rewrites of sectors. And btw in CSD 1.0, there is an erase size (for SD card - 2 GiB or less), and in CSD 2.0 (SDHC, SDXC) erase_size is hard coded to 64 KiB with annotation, that AU (or multiple of AU) it a unit of erase size README seems to mix that a bit (I might did it as well). So depending of the type of card, long run writes/logging might have equivalent of 4 MiB buffers in writes and less busy time.

About tuning and reading (might be writing as well). I use only SPI blocking functions right now for starters, and higher lock doesn't means faster read, since there will be more loops and checking for 0xFE token, so maybe just using precalculated/learned sleep_us() just to wait for the card to be ready and does not use horsepower in the meantime.

carlk3 commented 3 weeks ago

ACMD23 before writing (for write speed improvement)

Previously, I used ACMD23, but I was never able to measure any write speed improvement. In fact, it slowed down writes by the extra time that it took to send the ACMD23. So, I removed it.

Ex. for logging data, there could be started CMD25, and just pushing 512 once in a while without stopping it with CMD12 prematurely. So ex. days, months long write. There could be less rewrites of sectors.

Keep in mind that if there is an unexpected interruption, e.g. power failure or SD card removal, you might lose days, months of data.

And btw in CSD 1.0, there is an erase size (for SD card - 2 GiB or less), and in CSD 2.0 (SDHC, SDXC) erase_size is hard coded to 64 KiB with annotation, that AU (or multiple of AU) it a unit of erase size README seems to mix that a bit (I might did it as well).

I agree; I should remove that from the README. I don't know why they even kept SECTOR_SIZE in the specification since it is apparently meaningless now, and they were defining a whole new CSD Version 2.0 structure anyway.

One difficulty with AU_SIZE is that you can only get that in SD mode, not SPI mode. However, 4 MiB is probably a safe guess for today's SD cards.

About tuning and reading (might be writing as well). I use only SPI blocking functions right now for starters

Of course, all the tuning in the world will not make SPI approach 4-bit SD speeds.

matsobdev commented 3 weeks ago

Im just making my SPI lets call it a library, and combination CPU clock. SPI clock of 144/24 MHz and for example 200/33,(3) MHz for SPI almost fully saturates it. SPI polarity 1 and phase 1. Maybe 16 bit SPI can lower CPU speed. Now it is 8. It might not be related at all, but playing with SPI display and sending raw pictures directly from Pico's flash using 16 bit SPI required way lower CPU clock than 8 bit mode. Maybe it is worth trying, at some point I will and report back. At least it is easy for 514 bytes transfer of data payload and CRC.

Keep in mind that if there is an unexpected interruption, e.g. power failure or SD card removal, you might lose days, months of data.

Those might be just my guesses, but lets say we have gigabytes worth of buffer on Pico and doing continuous write. I would guess SD is busy after data response for a reason. I wouldn't expect giant card size buffer inside SD. My conclusion is, it wouldn't be worst case scenario losing all the data, like 200 GiB or so. But to be on the safe side, there could be CMD12 with every last block of 4 MiB or so, depending the card. So if malfunction will happen, it could be less to worry about. When I'm ready with writing part, I'll test it. But having smaller writes of for example 64 KiB ended with CMD12, aiming performance, disaster might also happen in the middle of the transfer.

carlk3 commented 3 weeks ago

I can only speculate, since SD card manufacturers don't disclose the internal workings of their products, but an AU_SIZE of 4 MiB might mean that the SD card has a 4 MiB volatile cache. In a Multiple Block write operation when the transmission is stopped by sending the 'Stop Tran' token (or CMD12) the card might commit the data in the volatile cache to non-volatile flash memory. So, if you never (or rarely) send 'Stop Tran' or CMD12, you could lose up to 4 MiB of data that has been written to the volatile cache but not yet committed to non-volatile flash. Of course, whether or not that is an acceptable loss is up to the application. For some applications, it might be better to use raw flash memory instead of SD cards.

Note: many SD cards support "stream recording" with CMD20.

matsobdev commented 3 weeks ago

Note: many SD cards support "stream recording" with CMD20.

Good to know, need to get familiar with that, but now, for SPI it just says "No". But good news is with 16 bit SPI transfers (just for 512 B of data payload and CRC) it is easier to utilise very most of the SPI bandwidth. So default CPU clock of 125 MHz and SPI at 31250000 Hz reading 64 KiB worth of data takes 17.810 ms for 16 bit transfers and 25.055 ms - 8-bit. So 3,50 MiB/s for 16 bit at the same conditions clockwise, like in your library. But there is no CRC checking and file system.

PS. More like 3,10 MiB/s at 16 bit, including byte swap of little endian from 16 bit SPI. PS2. For example at 200 MHz CPU, 33,(3) MHz 8-bit SPI is faster than 16 bit one, due to the byte swap overhead, so it depends as always. PS3. Using 16 bit DMA and waiting it to finish transfer is as fast as blocking 16 bit SPI transfer and output is the same - little endian, so no go here. Fortunately 8 bit DMA at 8 bit SPI is as fast as 16 bit one, so there is no need to change endianness and even is a bit faster at the end of the day since there is no need to switch twice SPI format settings from 8 to 16 bit and 16 back to 8. So 17,772 ms or 3,52 MiB/s at conditions like in the first paragraph. But since 25 MHz is max for SPI SD it is considered overclocking I guess.

carlk3 commented 3 weeks ago

Are you using channel_config_set_bswap?

I haven't looked into 16 bit SPI transfer at all. Looks like it has some potential. CRC is optional for SPI-attached SD cards, but for many applications I think it is worthwhile. For something like media streaming, maybe not. SDIO mandates CRC but it is so much faster anyway.

Now, I wonder if there is some clever way that slicing-by-8 CRC calculation could be combined with the byte swapping.

carlk3 commented 3 weeks ago

In particular, on the SDIO side, sdio_crc16_4bit_checksum is doing

            // Each 32-bit word contains 8 bits per line.
            // Reverse the bytes because SDIO protocol is big-endian.
            uint32_t data_in = __builtin_bswap32(*data++);

I think there might be some room for improvement there.

carlk3 commented 3 weeks ago

Note: many SD cards support "stream recording" with CMD20.

Good to know, need to get familiar with that, but now, for SPI it just says "No".

If the reason that you're using SPI is to save GPIO pins, 1-bit wide SDIO might be another option, and that should support CMD20 (with the right SD card).

It is also free from the 25 MHz speed limit.

matsobdev commented 3 weeks ago

SPI seems to be simple, so this mostly why :D I need to get familiar with PIO. SDK 1.7.0 - some official SD interface is expected https://github.com/raspberrypi/pico-sdk/issues/1663 so maybe that will be a nice solution. I settled on 8 bit SPI (and 8 bit DMA) since it solved my problem and was as fast as 16 bit SPI. So tested with reading 1 GiB (starting from 1 GiB offset) worth of data and rewriting 64 KiB buffer, it was 3,62 MiB/s, so even higher than just 64 KiB. As for CRC, I'll try to utilise that DMA stuff, but right now when sending with serial USB to test for integrity and it is intact (apart some bump here: https://forums.raspberrypi.com/viewtopic.php?p=2226642#p2226642 but it is mostly USB related, possibly TinyUSB - when I'll wait for DMA to finish, 100 % of serial transfers are intact).

I just peeked into the code and there is already channel_config_set_bswap(), so maybe using other DMA channel just to copy (SRAM to SRAM) data to data_in and utilising DMA byte swap once again.

matsobdev commented 1 week ago

There is a data loss when performing described strategy as described above, eg. writing whole 4 GiB card without Stop Tran (I've mistaken that for CMD12 before), and last 2 MiB was never written (tried couple times, starting from about last 100 MiB), simple waiting did nothing (as expected). But at 125 Mhz CPU, 32,25 MHz SPI 512 bytes transfers (including token, CRC, response and busy wait) was most of the time at about 3,62 MiB/s, but sometimes a bit, and even more sometimes like 3,05 MiB/s, but average for 1 GiB (started from 1 GiB offset) and whole card was 3,52 MiB/s, so it doesn't happen very often. Some applications might not block when card is busy and let it finish in the background for consistent timing, while doing ex. data gathering from sensors.

PS: More accurate data. First two writes are slow, like 170 and 60 ms slow, then it is about 138 us, but it looks like every 4096 blocks (2 MiB, so there actual write from RAM to flash might happen and it might further explain that unwritten 2 MiB since it will be AU size if I get it right) it peaks little above 2,5 ms. So for a logger like scenario there should be a counter measure of non-blocking release for SD to busy wait in the background, while doing main stuff (like presented above). First two writes are removed for clarity:

Bez tytułu

carlk3 commented 1 week ago

Stop Tran (I've mistaken that for CMD12 before)

They seem to do about the same thing, but I think Stop Tran is more efficient.

2 MiB since it will be AU size if I get it right

You can query it with "SD Status" (ACMD13). For all of the modern cards I have, it returns 4 MiB.

Be wary of optimizing too much for one model of one vendor's SD cards (unless you don't mind being locked in) *. Other cards may behave differently. The only guarantees are in the SD Card Association's Physical Layer Specification.

* Just look at the changes in SD cards since 1999!

matsobdev commented 3 days ago

Yes, a bit of cutting corners from me, but I don't mind for personal use. And for example some quite up to date cheap https://www.goodram.com/produkty/goodram-m1a0-m1aa-microcard/ 32 GB version: 28258 It is inconsistent and last 25k ish blocks are even without pauses. And that pauses are even not even numbers of blocks but ex. 3136 and variable. No CRC implemented so no restansmissions, but written data are intact. But card was full before formatting. Maybe there is some non matching CRC internally when writing to flash and it repeats. I'm giving it a break.