ZuluSCSI / ZuluSCSI-firmware

Firmware for the ZuluSCSI advanced SCSI emulator
https://zuluscsi.com
Other
174 stars 19 forks source link

Increasing SD card write performance RP2040 #269

Open juico opened 1 year ago

juico commented 1 year ago

Hey i don't personally own a ZuluSCSI device but am using the 4 bit SDIO implementation of ZuluSCSI. I ran into some limitation in terms of write speed and got stuck around 10 MB/s and the latency of every write seemed inconsistent with spikes between every 2 or 3 multi block writes. In the SD spec they state that sending a CMD12 after a multi block can take quite some time. So i tried to change the code a bit so it can enter a multi block write mode where it pre-erases the amount of block that will be written and sets a internal state such that the CMD12 is not send after the multi_block_write function.

This dramatically changed the performance with latencies that are very consistent throughout the write procedure. The initial or final block write does have a higher latency but for my application this is not a problem. The write speeds i have been able to reach are around 26-28 MB/s(With a Samsung 128GB evo plus) with the clock speed of the SD card at 62.5 MHz.

Commit #253 does mention that it is possible to use of stopTrans tokens but i haven't used them in my implementation which can be found at: https://github.com/juico/pico-sdio-example. The current code is a bit messy and i also tried to implement some commands in order to change to HS mode using CMD6 commands and change the PIO code to try to achieve higher frequencies although i am not convinced this helped a lot. As i am just using a SD card to micro SD card adapter soldered to some header that are plugged in a breadboard i can imagine that higher speeds could be possible on a proper PCB.

Not sure if the write/read performance is a bottleneck for ZuluSCSI but i if needed there seems to be a lot of headroom left.

Edit: I completely missed your high speed and cache enabled branch i might give those i try aswell. I ran into issues with clk_div 3 and a frequency of 150MHz but with clk_div 4 i can reach 250 MHz resulting in a higher SD card frequency.

morio commented 1 year ago

Thanks juico! I was hoping to implement your code on the ZuluSCSI. Does your code require 7 pins for SDIO? I noticed you used sideset on pin 22, here I don't have any PIO experience and was wondering what pin 22 is doing. It kinda looks like it is the sd clock, but you still have that defined here

juico commented 1 year ago

Hi Morio,

The changes in the PIO are just an attempt to increase clock speeds but are not essential. I moved to clock to a seperate SM using an interrupt. I left the old clock in the code and rerouted it to pin 22 for debugging but it can be removed.

The main contribution to the write performance is omitting the CMD12 in the stopTransmission function if a continuous write is happening. At the moment i solved it with the writeStart function that i call before writing a large chunk of data. This sets the multi_write variable and the boundries of the with the multi_write_end variable. As for my application i only write large continuous files of around 1 GB but for for ZuluSCSI the way the data is written is quite different. I can imagine that keeping track when a continous write is happening and when it is ending could be done internally without defining it beforehand. This would make it harder to do pre-erase command but from some tests this does not seem to affect performance that much.

I merged some changes from the SD card cache branch of ZuluSCSI but performance does not change for continuous writes but i haven't tested for smaller writes.

greiman commented 1 year ago

@juico and @morio

I am the author of SdFat and am interested in adding a SDIO mode for RP2040.

I looked at all the versions of SDIO for RP2040 that have evolved from this demo on the Raspberry Pi github site.

None of the existing implementations provide improved performance for the way most applications use SD cards.

Before I start doing yet another version, I decided see if you are interested in developing a version that is fast for all sizes of transfers, not just very large transfers.

First let me describe how SD cards have evolved. The first SD cards were 8 or 16 MB FAT12 and truly had 512 byte flash sectors.

Now cards have huge flash pages and emulate 512 byte sectors. There are large RAM caches in modern cards.

Here are two definitions from the SD standard for how flash is managed in a card.

Allocation Unit (AU) The User Area is divided into units called "Allocation Unit (AU)" (Refer to Figure 4-47). AU is physical boundary in User Area of a card and is not defined by the file system boundary. Each card has its own fixed AU Size (SAU) and the maximum AU Size is defined depending on the card's capacity.

Recording Unit (RU) Each AU is divided into units called "Recording Unit (RU)" (Refer to Figure 4-47). The unit of RU Size (SRU) is 16KByte. The RU Size is a multiple of 16KByte and shall not span across an AU boundary. Larger RU size may improve performance.

Here are sizes of AU and RU for different classes of cards.

SDHC

SDXC

Every time you do a single-block transfer or end a multiple-block transfer, an RU is programmed. Eventually flash becomes fragmented and data must be moved. Here is a description from the standard.

Write Performance Figure 4-48 shows the typical data management of the card when the host writes RUs of an AU. When the host writes to a fragmented AU, the card prepares a new AU by copying the used RUs and writing the new RUs. The location A is at the start of the AU boundary and location B is at the end of the AU boundary. From A to B, the host shall write data to free RUs contiguously and skip used RUs (shall not skip any free RU). The card may indicate busy to the host, so the host can wait, during the time the card controller is writing and moving data. The total write time from A to B can be calculated by summing up the write time of free RUs and the moving time of the used RUs. The number of used RUs (Nu) is available by counting it over one AU and number of free RUs is expressed by (NRU – Nu)

Write

This means that many things in your implementation only apply to twenty year old cards. Like pre-erasing commands.

I don't want to loose the above with an edit error so I will start another post to explain the two mode for card access in SdFat.

greiman commented 1 year ago

SdFat has two modes for SD card access. I have two modes since it is not always possible to implement or use the faster mode.

The fast mode attempts to use the largest read or write transfer possible. For SPI this requires a dedicated SPI interface.

For SDIO I have only been able to implement the fast mode on the NXP processors in Teensy 3.x and 4.x boards. I think there is an implemention for a few STM32 chips but its not possible to use it with the Standard STM32 board support package.

The fast mode on SDIO has the unfortunate name FIFO_SDIO since I couldn't get DMA to work in this mode on NXP so DMA_SDIO is the slow mode.

The fast version depends on implementing these member functions for multi-block transfers:

bool readData(uint8_t* dst);
bool readStart(uint32_t sector);
bool readStop();

bool writeData(const uint8_t* src);
bool writeStart(uint32_t sector);
bool writeStop();

Here is an example of the difference for 512 byte SPI transfers on Pico with Earle Philhower's package at 133 Mhz.

The slow SHARED_SPI mode:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 418.24,21648,1096,1223 418.10,21641,1099,1224

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 821.56,2438,470,620 823.32,1982,470,619

The fast DEDICATED_SPI mode:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 2269.63,249,222,224 2268.60,249,222,225

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 2288.33,235,221,223 2290.43,292,221,222

Even the Teensy 4.1 SDIO is slow for 512 byte DMA transfers:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 622.35,16841,720,822 612.22,17905,719,836

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 2139.50,1254,232,239 2138.58,1255,233,239

Here is Teensy 4.1 FIFO_SDIO for 512 byte transfers at 50MHz SD clock. People have obtained faster rates by over clocking.

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 22123.89,53,22,22 22123.89,53,22,22

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 22624.43,1104,22,22 22624.43,134,22,22

I am hoping there is a way to implement the faster mode on RP2040. I have tried on ESP32 and gave up. The SDIO controller or board support packages for other MCUs have stopped me.

Let me know if you are interested.

Edit; Another problem with SDIO for SD cards is the requirement of 32-bit alignment for most controllers. This means tmp buffers and memcpy. Depending on how a file is written, you can't solve the problem with buffers alignment.

Here is a case, there are many more cases. If you write 511 bytes in the first write, then in the next write one byte will be moved to the cache and the cache will be written. Now the remaining data in the second write is not 32-bit aligned.

All these problems make fast SDIO for apps difficult. Use of direct access to the FIFO eliminated the 32-bit alignment and DMA problems on the Teensy NXP MCU.

greiman commented 1 year ago

Here is a good test of whether an SDIO implementation will provide an improvement over dedicated SPI for most apps.

Try a test with 511 byte file read/write. Dedicated SPI doesn't degrade much on RP2040. This size causes all data to be copied to/from the internal SdFat cache.

FILE_SIZE_MB = 5 BUF_SIZE = 511 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 2055.92,278,5,247 2055.08,278,5,248

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 2074.69,260,29,245 2074.69,277,29,245

Here is what the SPI transfer looks like for a read. There is a bit of a gap between bytes and the clock seems near 24 MHz.

RP2040SPI

juico commented 1 year ago

Hey @greiman , thanks for all the information. It clearly explains why sending the CMD12 in the middle of a AU create such a latency spikes. I tried my code with a smaller buffer size and smaller file size.

FILE_SIZE_MB = 10 BUF_SIZE = 1024 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 20197.2,2646,48,49 19616.9,7770,48,51

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 3282.05,540,240,311 3283.10,540,240,311

Done

In the case above the read commands are still being terminated using CMD12 but i am not quite sure if this causes the speed difference. I used the 1024 byte buffer as i only implemented my speedup in the multi sector write and it does not work with a single sector write.

It seems likely that the SDIO on the rp2040 would be able to be faster than the SPI mode although maybe not as fast as the FIFO mode from the NXP mcus. At the moment i am mostly interested in high speeds on large files as i am reading out an linear CCD from a scanner and writing it directly to an SD card. The ADC of the CCD can generate up to 40MB/s although i am running it at 10MB/s at the moment. The sensor is used to make a scanning camera that produces images of around 100MP in raw tiff 48 bit format so that needs a lot of throughput on the SD card.

I can take a look at implementing these functions although i am no expert as the code i am using is mostly not written by me.

greiman commented 1 year ago

At the moment i am mostly interested in high speeds on large files as i am reading out an linear CCD from a scanner.

I use multi-block writes always. I just don't terminate the write.

The implementation with infinite transfers is as fast or faster for large writes. Unless you have buffer for a whole AU there will be increased latency occasionally.

People with Teensy 4.1 no longer use the DMA mode even with large writes.

To achieve max SD performance you need to write at least 4MB as a single multi-block transfer. SdFat does this and if you do writes of multiples of 512 bytes there are no memcpy calls.

juico commented 1 year ago

I use multi-block writes always. I just don't terminate the write.

I assume you end the write using the sync function which does send the cmd12 and this automatically gets called when a sector is written which does not start where the last write finished. For benchmark results i posted the write also does not terminate, only at the end of the file. So a bit like what you are doing but a bit hacky and not really a neat implementation.

greiman commented 1 year ago

I only use the sync() functions when I switching between read/write mode , the transfer is not contiguous or file close is called.

If multiple files are open large transfers are required for high performance since interleaved access will cause sync() to be called.

Most MCU SDIO controllers use CMD12. You can't send a stop token like SPI.

juico commented 1 year ago

That makes sense, for the write part it seems doable then.

I was wondering about the read part, the Teensy has a FIFO which is filled with data when the readstart is called if i read correctly. So when the read function gets called it grabs the data from the FIFO and returns it. This seems to work a lot faster with small buffer sizes as i assume the overhead of initializing and ending a read is causing the low write speeds seen in my benchmark. Perhaps it is possible to mimic this with the rp2040 with a small internal buffer and a way to pace the reading of the sectors as the buffer gets full.

greiman commented 1 year ago

i assume the overhead of initializing and ending a read is causing the low write speeds seen in my benchmark. Perhaps it is possible to mimic this with the rp2040 with a small internal buffer and a way to pace the reading of the sectors as the buffer gets full.

Yes this is why read is fast on Teensy. I was afraid of this on RP2040. This is also why I have not implemented SDIO on STM32 chips. Only recent STM32 chips have a large FIFO so you can't pause reads.

Attempts to use SDIO on most MCUs has not resulted in improved performance for typical Arduino users.

On Teensy it only takes about 5 μs to fill the FIFO and the same if the read FIFO is full. Big overlap of I/O and processing is possible.

juico commented 1 year ago

I can imagine most of the times the speed of the SD card is not that crucial anyway while using an Arduino. After searching through the non simplified SD card spec it seems like you can pause the read operation by disabling the clock signal during in between blocks. This is specified in "4.12.5.2 Read Block Gap", not sure if this only works for UHS or also for the slower SD protocol. I guess this is possible for the rp2040 as we have full control over the clock signal and can stop/start it any time we want.

greiman commented 1 year ago

I guess this is possible for the rp2040 as we have full control over the clock signal and can stop/start it any time we want.

Yes, It does not work for most MCU SDIO controllers.

greiman commented 1 year ago

I really hope there will be an improved RP2040 SDIO driver. I hate to take-on maintenance of another custom driver.

I can't get Arduino to improve any SPI drivers. They won't accept any changes to the Standard Arduino SPI API.

juico commented 1 year ago

I wrote some code that should work similar to your write functions, don't have the pi pico with sd card here so i will test it when i get home and see how it manages with the 511 byte buffer size.

In terms of the reading part i am bit uncertain what the right approach would be as the current implementation writes the received data directly to the dst buffer using DMA. When introducing a FIFO buffer in software it would write to that location but it would mean that the data has to be transferred from the software FIFO to the dst buffer, basically doubling the memory access. Maybe i am overthinking it but i would imagine that a neater solution with some kind of queue could be possible. Where initially the data would be read from the FIFO but as the FIFO gets empty the requested transfers would go in queue and configure the DMA to directly write to addresses in the queue and skipping the FIFO.

The other hurdle would be stopping and starting the clock on the right time as the timing seems quite crucial. I will have to see if i can attach my logic analyzer to see if i can get the timing right. Can i ask which logic analyzer you are using? I only have cheap 24Mhz one from Ali express so i probably have to reduce the clock speed quite a bit. Do you know if the NXP mcu does stop to the clock when the read FIFO gets full or does it just stop the transfer?

benchmark with new code for 511 bytes:

FILE_SIZE_MB = 5 BUF_SIZE = 511 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 11280.4,8884,3,44 11666.7,79,3,42

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 1926.12,288,15,265 1926.85,288,15,264

Done

with the 1024 byte buffer:

FILE_SIZE_MB = 10 BUF_SIZE = 1024 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 16569.6,7551,58,60 16925.6,93,58,59

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 3103.97,378,245,329 3105.85,363,245,329

Done

Somehow the performance it decreased a bit but atleast the 511 byte write is still faster than the SPI mode. The performance loss seems related to the fact that the the WriteSectors calls WriteData in a loop and WriteData has some initialization so there is some overhead. Moving the initialization to the WriteStart and allowing to queue the writes would reduce the overhead a bit.

greiman commented 1 year ago

Looks good for write. I expected read to be difficult.

In terms of the reading part i am bit uncertain what the right approach would be as the current implementation writes the received data directly to the dst buffer using DMA. When introducing a FIFO buffer in software it would write to that location but it would mean that the data has to be transferred from the software FIFO to the dst buffer.

I was also thinking a software FIFO would be a possible solution for read.

I expect memcpy will be needed for read and write. Users read/write data that is not 32-bit aligned. Also unless all read/write calls are for a multiple of four bytes it is possible that part of the transfer completes a sector in the cache and the remainder is not 32-bit aligned.

This is not a problem with NXP since Teensy is Cortex M4 or M7. I can transfer data in 32-bit chunks between the FIFO and user buffers in a loop. At 600 MHz this takes about 5 μs.

Can i ask which logic analyzer you are using?

I recently bought a high end Saleae Logic Pro 8. It can do 500 MS/s digital 100 MHz bandwidth. Expensive but I really like it. I have a 200 MHz 2GS/s mixed signal scope but I rarely use it after buying the Saleae.

Do you know if the NXP mcu does stop to the clock when the read FIFO gets full or does it just stop the transfer?

The NXP has a 512 byte FIFO and stops the clock at the end of a block. Really simple and reliable to use.

The number you are getting look good. I will offer both SPI and SDIO on RP2040. For most apps SPI will be fine.

After some thought I decided that even if SDIO only offers high speed for big transfers that are 32-bit aligned it will be valuable for sophisticated users.

Some users just per-allocate a contiguous file and write by doing raw writes. I make it easy with this member function:

 /** \return Pointer to SD card object. */
  SdCard* card() { return m_card; }

They then can do calls like this:

  SdFs sd;
  ...
  sd.card()->writeSectors(sector, buf, n);
LinusHeu commented 1 year ago

After some thought I decided that even if SDIO only offers high speed for big transfers that are 32-bit aligned it will be valuable for sophisticated users.

I'm not completely sure, but I think I have an application like that. For audio, I always read 1024 bytes at a time (but the exact size could be reconfigured). Or does that not apply when using the Arduino SDFat & File classes?

Thank you so much for looking into SDIO!

PetteriAimonen commented 1 year ago

Just back from vacation, so I haven't taken a deep look into what was written above.

But this branch may also be of interest, raising SDIO clock rate to 42 MHz: https://github.com/ZuluSCSI/ZuluSCSI-firmware/tree/rp2040_highspeed_sdio It still lacks negotiation with the card about which clock rates are usable.

In ZuluSCSI there are other factors currently limiting speed to around 10 MB/s, so I haven't focused on improving SDIO performance further.

juico commented 1 year ago

But this branch may also be of interest, raising SDIO clock rate to 42 MHz: https://github.com/ZuluSCSI/ZuluSCSI-firmware/tree/rp2040_highspeed_sdio It still lacks negotiation with the card about which clock rates are usable.

I have already implemented something like that in : https://github.com/juico/pico-sdio-example/blob/78bbb70ba843d1cb6095a4c64a30250720d1bb26/src/sdio/sd_card_sdio.cpp#L267 Although i am running it a bit above 50 MHz actually.

Also i tried adding the code to get the cache working(from the cache branch) but i get the feeling writing to the extended registers does not work completely, perhaps i messed something up myself. The bit enabling the cache is read as a 0 after writing a 1 to it so that is a bit wierd. Not sure if you did end up getting the cache to work?

In ZuluSCSI there are other factors currently limiting speed to around 10 MB/s, so I haven't focused on improving SDIO performance further.

I already assumed something like that. But it seems for writing the solution can be quite simple and does yield lower variations in write latency for the bigger files and faster write speeds when smaller chunks are written.

The branch https://github.com/juico/pico-sdio-example/tree/fast_test shows the inifite write mode without specifying how many bytes to be written. It does not pre-erase the sectors but it seems that this does not matter that much in terms of write speed. It perhaps not the neatest implementation, but hopefully you can see the mechanism.

greiman commented 1 year ago

It does not pre-erase the sectors but it seems that this does not matter that much in terms of write speed.

pre-erase is not used in modern cards. See my post above about AUs and RUs. AUs are not related to file system allocation. AU are about flash management and the mapping of emulated blocks to physical flash and wear leveling.

Modern cards maintain a pre-erased cash of AUs. AUs can be as large as 64 MB in large SDXC cards. Understanding modern card flash management is important to get performance.

Small writes cause incredible amounts of data copying and flash wear. Small reads cause excessive re-reads of huge flash pages into internal RAM buffers in the card. This depends on the cards buffering strategy. Some cards define different strategies by AU. If small write were use, on read the card will optimize for small reads. The amount of RAM buffering and cache policy varies by card class/product.

I am now trying to get a basic SDIO driver working with this Arduino board package. I just took files form this repository and got them to compile with SdFat. That was simple but when I tried to init an SD it was not reliable. I think it is a clocking problem. I need to look with a logic analyzer. I just tried the default 133 MHz CPU speed.

I plan to start over with just code from this repository for a test with the board package so I don't mix any SdFat code. I will then try to understand any problems.

I need to decide which GPIOs to use with the Earle Philhower's Arduino Pico package. I probably should offer options.

Any suggestions which GPIOs to use?

PetteriAimonen commented 1 year ago

Also i tried adding the code to get the cache working(from the cache branch) but i get the feeling writing to the extended registers does not work completely, perhaps i messed something up myself. The bit enabling the cache is read as a 0 after writing a 1 to it so that is a bit wierd. Not sure if you did end up getting the cache to work?

Only A2 class SD cards support the SD-card cache. But it didn't seem to help much for write performance in my tests.

greiman commented 1 year ago

Only A2 class SD cards support the SD-card cache. But it didn't seem to help much for write performance in my tests.

All cards have internal RAM buffers and an internal cache policy. A2 cards expose an API.

I worked with Teensy developers to find the best SDs for multi-track audio recording. At the time CANVAS Go! Plus 256GB provided the best performance. I just let the card use it's default management.

This card is capable of recording 16 streams to 16 open files with a max latency of 2022 μs.

nf: 16 maxLat: 2022 total KB/sec: 17458.01 file KB/sec: 1091.13

What you are doing is fairly simple so just using big transfers and if possible write the file as a single multi-block transfer should work with most cards.

Actually the transfer size doesn't matter much. It's the infinite multi-block write that matters. I just put the card in write mode and write GB size files as a single transfer.

The Teensy Audio library is an impressive accomplishment - pro audio at low cost.

It has graphical design tools so you just you just draw your recording setup and it generates the code.

greiman commented 1 year ago

Here is single file performance for 512 byte write with a CANVAS Go! Plus 256GB clocked at 100 Mhz:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 40647.80,102,11,11 40980.98,103,11,11

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 42370.17,237,11,11 43100.69,237,11,11

juico commented 1 year ago

Any suggestions which GPIOs to use?

The only crucial thing about the GPIO is that the D0-D3 pins are mapped to ascending and neighboring GPIO pins as this required for the way the PIO operates when sending parallel data. Not sure if you are taking the code from my repo or directly from ZuluSCSI but i applied pull-down to the clock pin which could be a problem in your case. Sometimes it does help to add some additional sleep commands in initialization. Not sure if you are familiar with the pico-sdk as you could just compile my code and play around with it.

Only A2 class SD cards support the SD-card cache. But it didn't seem to help much for write performance in my tests.

Yes i am using a A2 card and it shows in the registers that the cache can be enabled.But after enabling the the cache by writing a 1 to byte[260] of the performance register it does not seem to enable it. As i copied the code from the cache branch from ZuluSCSI is was just wondering if it did show that the cache was enabled in your case as you also didn't notice any performance increase.

This is is what i see during startup:

SD card cache support: 1, command queue support: 31 SD card cache state: 0

I don't expect a lot of improvement in throughput with the cache enabled but for my application i would like get the data off the rp2040 as quick as possible. So i am mostly worried about sudden latency spikes during writes that might cause the buffer capturing the data to get full as stopping the acquisition is not really possible.

What you are doing is fairly simple so just using big transfers and if possible write the file as a single multi-block transfer should work with most cards.

At the moment it seems like the current solution works so i will not really focus on getting the cache working but i was just curious.

greiman commented 1 year ago

As i copied the code from the cache branch from ZuluSCSI is was just wondering if it did show that the cache was enabled in your case as you also didn't notice any performance increase.

When I discovers how adaptive high end cards are I just let the card decide. Did you see above what the Canvas GO can do? Over 40 MB/sec with 512 byte transfers. It means only the the internal SdFat cache and a 512 byte user buffer.

At the moment it seems like the current solution works so i will not really focus on getting the cache working but i was just curious.

I think I will just offer the DMA solution that I can put together for existing code.

I will also look into higher SPI clock. The board package currently limits the rate to less than 25 MHz. on other boards I can get over 4 MB/sec SPI.

Not sure if you are familiar with the pico-sdk as you could just compile my code and play around with it.

Yes that's how I started using Pico before there was Arduino support.

juico commented 1 year ago

Here is single file performance for 512 byte write with a CANVAS Go! Plus 256GB clocked at 100 Mhz:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait. write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 40647.80,102,11,11 40980.98,103,11,11 read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 42370.17,237,11,11 43100.69,237,11,11

Those are some impressive speeds you are getting, have not been able to push past the 67.5MHz which resulted in around 29MB/s. Hopefully the rp2040 can reach 100MHz on SDIO, at the moment i am probably limited by my breadboard setup. For large transfers i do seem to have a large latency spike, probably at the end of the write but that is not really a issue.

Picture of the breadboard: PXL_20230802_164154938

greiman commented 1 year ago

Picture of the breadboard:

Thanks that helps. That's the config I was setting up with the same GPIOs for my next test. Same short wire to the SD.

I am definitely going to play with this for a while before releasing anything for Arduino. Too may people use SdFat and Pico. I would be overwhelmed with issues if it isn't solid for beginners.

Here is proof that card policy matters.

greiman commented 1 year ago

One last thing, A counterfeit card kills performance. Here is an bad Evo Select card at 100 MHz:

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 9651.89,53198,11,48 8976.09,59161,11,51

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 22725.82,36703,11,22 22220.80,38423,11,22

Here is a $10 real Evo Select. For one file almost any card works.

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 39062.50,6502,11,12 39062.50,6482,11,12

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 42735.04,175,11,11 42735.04,175,11,11

greiman commented 1 year ago

@juico

When you switch to High Speed mode do you change how you read card output. I have always thought the change in valid time was strange.

Default speed:

DefaultSpeed

High Speed:

HighSpeed

Clock is low for most of the Card Output valid.

juico commented 1 year ago

I leave the timing the same, i have tried switching the PIO code when going into high speed mode but i removed it as with the default timings it seems to work fast enough. Maybe with some better measurements the timings could be improved a bit. From the timing diagrams one would say to to read the data a bit earlier in high speed mode if you could get it working it would be nice to see the difference in card output on the logic analyzer.

greiman commented 1 year ago

@juico

You comments and examples have been very helpful.

I am making progress with SDIO and Earle Philhower's Arduino board package. I see why it gets 1.3k stars. There have been over 90 contributors to the repo. I think it is mostly used with PlatformIO. So you get existing Arduino libraries , Pico SDK features, and PlatformIO.

The first SDIO break-out I tried had weak pull-ups and the code I was using had the default internal pull-downs enabled. So the lines were at about 1.3V and noise caused failures.

I changed to weak internal pull-ups. I am testing with three popular SD breakouts to make sure it will work for most users. They vary from no pull-ups to 10K pull-ups.

I will need to offer a number of pin configurations. The above package supports over 50 boards. I will pick a few popular boards and try to make adapting simple. pioasm is in the package.

juico commented 1 year ago

@greiman

You might also want to check out https://github.com/carlk3/no-OS-FatFS-SD-SPI-RPi-Pico/tree/sdio which is also based on the ZuluSCSI SDIO code base but did release the code as a standalone library.

I have messed around a bit with disabling the clock during read commands. At the moment it does not use a FIFO or anything but it just starts the clock again when it needs more data. Can't say it is working well as the timings are probably messed up. It somehow works when i print some text after the clock is disabled. Without this it does not work and i don't have the proper gear at the moment to measure the timing of the signal at the moment. Perhaps you could check it out if you want to. Without the print command after disabling the clock it does give a checksum error so it does receive some data but not the correct data.

As the print function itself consumes a lot of time i cant really test if performance is increased. You can find the code in the https://github.com/juico/pico-sdio-example/tree/clock_test tree.

greiman commented 1 year ago

@juico

I downloaded your clock_test tree and will look at it.

I am now doing tests to see what constraints the Arduino board support package presents. I need to refresh my understanding of the RP2040 SDK also. I started using the RP2040 with the SDK as soon as it was released and was one of the people to discover this fault in the ADC INL. ENOB about 8.6 which killed use in my project.

RP2040ADC

I am also considering PIO SPI. There is a hard limit of 24 MHz with the board support package, since SPI is shared.

If I could get close to 50 MHz PIO SPI, I would have over 4 MB/sec for small read/write transfers.

I am also looking for any RP2040 boards with builtin SDIO. Sparkfun has one but the wiring is wrong for PIO. I am on the list to get this Adafruit board. The Adafruit board looks good for Arduino users who started with Uno and need more performance.

I want to rewrite the basic low level PIO for SDIO to look more like the hardware SDIO controllers I have used on STM32, NXP and other chips.

If I can get PIO SPI to go fast, I may just limit SDIO to reliable large transfers, at least for a first release.

greiman commented 1 year ago

@juico

I got a version of full-duplex PIO SPI to work fairly fast but I don't need full-duplex.

I found this so I may be able to get more than 50 MHz write for a SD in SPI mode.

; This is just a simple clocked serial TX. At 125 MHz system clock we can
; sustain up to 62.5 Mbps.
; Data on OUT pin 0
; Clock on side-set pin 0

.wrap_target
    out pins, 1   side 0 ; stall here if no data (clock low)
    nop           side 1
.wrap
greiman commented 1 year ago

@juico

I finished the pio SPI and get good performance for 512 byte buffers.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 7173.60,91,70,71 7173.60,91,70,71

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 7142.86,108,70,71 7173.60,94,70,71

I am now writing new SDIO pio. I decided not to use DMA so I can overlap transfer with crc calculation. I must allow for stalls to prevent FIFO overruns.

Here is the time to receive and checksum the mbr sector of an SD.

SdioRx

512 bytes in 17.3 μs is about 29 MB/sec so I should get very good speed for small buffers.

Here are the state machines:

.program rx_data
    wait 1 irq RX_IRQ
.wrap_target
    wait 0 gpio SCK_PIN
    wait 1 gpio SCK_PIN
    in pins, 4 
.wrap

.program rx_sck
.side_set 1
wait_d0:
    nop              side 0 [5]
    jmp PIN wait_d0  side 1 [3]
    irq RX_IRQ       side 0    
.wrap_target
    out null, 1      side 0 [1]  ; Stall if done or read is too slow.
    nop              side 1 [1]
.wrap
Here is the main loop for read sector:
 uint64_t crc = 0;
  for (uint i = 0; i < N-M; i++) {
    while (*fstat & rxFifoEmpty) {
      if (micros() > m) {
        Serial.println("rxFifo timeout bug");
        return false;
      }
    }
    uint32_t tmp = *rxFifo;
    *txFifo = 0XFF;  // enable SCK for another 32-bit word.
    buf[i] = __builtin_bswap32(tmp);  // maybe return bytes instead of words.
    crc = crc16(crc, tmp);
  }

I should get close to 25 MB/sec for 512 byte buffers.

juico commented 11 months ago

@greiman That looks good for the SPI, probably more than enough for most use cases. Have you been able to get your hands on the new AdaFruit board yet?

The PIO version seems promising as well with it automatically stopping the clock. Not using the DMA would allow for a easier overlap for the CRC calculations but i wonder if it the slight advantage is worth keeping the CPU busy while reading. The overlapping CRC calculation could also be triggered by the end of a DMA transfer. I haven't checked how many cycles it costs for the rp2040 to calculate the CRC, but if this is fast enough it could be started after the data is transferred and be finished before the end of the CRC transfer. I haven't had any time lately to look into it as i am writing my thesis for my masters but i hope to have some more time in a few weeks.

greiman commented 11 months ago

I have an Adafruit Metro but have not tried it yet. I have been using a Pico for development. I may try the Metro soon.

I have a first version of SDIO working but spent a lot of wasted time looking for a timing problem. I tested over 50 SD cards and found three that were flaky. Finally I realized they were the only cards I have with proprietary DDR208 clocking.

I looked at the errors with a logic analyzer and discovered the slightest spike on SCK caused the card to clock out an extra nibble on read.

I have the wires piled together on a bread board and data-out from the card may be putting spikes on SCK.

I will try the Metro and try to make a Pico setup that allows the UHS-I DDR208 cards to run.

Here are first results at 250 MHz.

For 512 byte transfers.

Type is FAT32 Card size: 32.01 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B OEM ID: SM Product: 00000 Revision: 1.0 Serial number: 0XE30F5501 Manufacturing date: 10/2015

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 23474.18,3187,20,21 23923.45,63,20,20

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 23809.52,712,20,21 23696.68,1258,20,21

For 8192 byte transfers:

Type is FAT32 Card size: 32.01 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B OEM ID: SM Product: 00000 Revision: 1.0 Serial number: 0XE30F5501 Manufacturing date: 10/2015

FILE_SIZE_MB = 5 BUF_SIZE = 8192 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 25773.20,344,306,315 25641.03,344,306,315

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 25125.63,1920,320,325 25380.71,458,320,323

greiman commented 11 months ago

I tried the Metro and it works with SDIO. The bad news is my board won't overclock at 175 MHz. The USB COM port doesn't even show up after load and reset.

metro The COM port shows up at 200MHz but the program fails to run.

It will run at 150 MHz with this result:

Type is FAT32 Card size: 31.44 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B OEM ID: SM Product: 00000 Revision: 1.0 Serial number: 0XA4820278 Manufacturing date: 5/2014

FILE_SIZE_MB = 5 BUF_SIZE = 512 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 14326.65,7045,33,35 14285.71,7047,33,35

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 14326.65,163,34,35 14285.71,163,34,35

Hope I just have a board with a poor RP2040.

earlephilhower commented 11 months ago

Hope I just have a board with a poor RP2040.

This might just be a flash QSPI speed problem. You can try changing the flash boot stage 2 to SPI/4 if you use the Generic RP2040 board option.

greiman commented 11 months ago

@juico

I am slowly making progress. I have been testing many apps and about 60 SD cards.

I finally have most DDR208 cards working. I will try DMA to see if it is more reliable. These cards seem to be very sensitive to noise on SCK. They are extremely fast so I suspect they see glitches in clock when an occasional stall happens.

Here is a result with a DDR208 card that in DDR208 mode is capable of 180MB/sec read and 130 MB/sec write.

Type is exFAT Card size: 250.40 GB (GB = 1E9 bytes)

Manufacturer ID: 0XAD OEM ID: LS Product: USD00 Revision: 1.0 Serial number: 0X35744F7B Manufacturing date: 10/2022

FILE_SIZE_MB = 50 BUF_SIZE = 32768 bytes Starting write test, please wait.

write speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 26822.97,15084,1209,1220 26736.87,19011,1209,1223

Starting read test, please wait.

read speed and latency speed,max,min,avg KB/Sec,usec,usec,usec 26953.18,1918,1212,1215 26938.65,1916,1213,1215