carlk3 / no-OS-FatFS-SD-SPI-RPi-Pico

A FAT filesystem with SPI driver for SD card on Raspberry Pi Pico
Apache License 2.0
272 stars 47 forks source link

Suggestion for calculating CRC/etc #63

Open callous4567 opened 1 year ago

callous4567 commented 1 year ago

Hi;

This is just a suggestion regarding the calculation of CRC values in the library (I'm unsure as to whether this has been implemented in the SDIO port you produced but... anyway, I digress;)

The manual calculation of the CRC by the library could be replaced by (what I assume, at least) the hardware implementation the RP2040 provides- in my current use case I've changed spi_transfer (explicitly for TX-purposes- I'm strictly writing) via

// Enable the sniffer dma_sniffer_enable( pSPI->tx_dma, 0x02, true ); channel_config_set_sniff_enable( &pSPI->tx_dma_cfg, true ); dma_hw->sniff_data = 0; // Get CRC *crc = dma_hw->sniff_data;

I'm currently trying to make a specific case of using "full hardware" for multiple block write to minimize power use and have just come across this method for hardware evaluation of the CRC- thought it useful.

Thanks again for the library!

(Also- if you remember the previous topic I had here- I got my SPI up to 50 MHz! It writes crazy fast now ^_^)

-Sebastian

carlk3 commented 1 year ago

Interesting idea! I now know of three ways to do the CRC:

  1. Algorithm in software
  2. Lookup table (this is what is currently used)
  3. DMA sniffer

It would be interesting to do a comparison of the three methods. I've had my doubts about whether 1 or 2 is faster on the RP2040, and 2 takes significantly more memory. 3 is probably hard to beat for both speed and memory consumption.

Are you looking at https://github.com/raspberrypi/pico-extras/blob/master/src/rp2_common/pico_sd_card/sd_card.c? Are you doing any specific configuration of the sniffer?

callous4567 commented 1 year ago

I never noticed that page- I was skulking around the code trying to turn anything and everything I could into DMA chains when I hit that CRC roadblock and found the sniffer in the SDK- useful to see it used by them though!

At the moment I've just done this-


// With DMA sniffy sniff 
bool crc_spi_transfer(
    spi_t *pSPI, 
    const uint8_t *tx,
    uint8_t *rx, 
    uint16_t *crc, 
    size_t length) {

    myASSERT(tx || rx);

    // tx write increment is already false
    if (tx) {
        channel_config_set_read_increment(&pSPI->tx_dma_cfg, true);
        // Enable the sniffer
        dma_sniffer_enable(
            pSPI->tx_dma,
            0x02,
            true
        );
        channel_config_set_sniff_enable(
            &pSPI->tx_dma_cfg,
            true
        );
        dma_hw->sniff_data = 0;
    } else { // NULL=FALSE
        static const uint8_t dummy = SPI_FILL_CHAR;
        tx = &dummy;
        channel_config_set_read_increment(&pSPI->tx_dma_cfg, false);
    }

    // rx read increment is already false
    if (rx) {
        // Enable the sniffer
        dma_sniffer_enable(
            pSPI->rx_dma,
            0x02,
            true
        );
        channel_config_set_sniff_enable(
            &pSPI->rx_dma_cfg,
            true
        );
        dma_hw->sniff_data = 0;
        channel_config_set_write_increment(&pSPI->rx_dma_cfg, true);
    } else {
        static uint8_t dummy = 0xA5;
        rx = &dummy;
        channel_config_set_write_increment(&pSPI->rx_dma_cfg, false);
    }

    // Clear the interrupt request.
    dma_hw->ints0 = 1u << pSPI->rx_dma;

    // Launch tx
    dma_channel_configure(pSPI->tx_dma, &pSPI->tx_dma_cfg,
                        &spi_get_hw(pSPI->hw_inst)->dr,  // write address
                        tx,                              // read address
                        length,  // element count (each element is of
                                // size transfer_data_size)
                        false);  // start
    dma_channel_configure(pSPI->rx_dma, &pSPI->rx_dma_cfg,
                        rx,                              // write address
                        &spi_get_hw(pSPI->hw_inst)->dr,  // read address
                        length,  // element count (each element is of
                                // size transfer_data_size)
                        false);  // start

    // start them exactly simultaneously to avoid races (in extreme cases
    // the FIFO could overflow)
    dma_start_channel_mask((1u << pSPI->tx_dma) | (1u << pSPI->rx_dma));

    // sleep appropriate time 
    sleep_us(0.5*BLOCK_SAMPLE_TIME_US);

    /* Timeout 1 sec */
    uint32_t timeOut = 1000;
    /* Wait until master completes transfer or time out has occured. */
    bool rc = sem_acquire_timeout_ms(
        &pSPI->sem, timeOut);  // Wait for notification from ISR
    if (!rc) {
        // If the timeout is reached the function will return false
        DBG_PRINTF("Notification wait timed out in %s\n", __FUNCTION__);
        return false;
    }

    // Shouldn't be necessary:
    dma_channel_wait_for_finish_blocking(pSPI->tx_dma);
    dma_channel_wait_for_finish_blocking(pSPI->rx_dma);

    myASSERT(!dma_channel_is_busy(pSPI->tx_dma));
    myASSERT(!dma_channel_is_busy(pSPI->rx_dma));

    // Get CRC
    *crc = dma_hw->sniff_data;

    return true;

}

as a substitute for the spi_transfer function call in spi.c; it hasn't failed me yet (I did a 12 hour test run of writes with the Pico entering dormancy between recording sessions/etc) so clearly the CRC's it's generating are fine- 0x02 sets CRC to CRC16-CCITT (there are others available, too.)

I only put the bit of code under the RX area too because I have no clue if the library would need to use CRCs in that case- it might not be correct to do so (I really just need performance for long write sessions so the RX is inconsequential and I've not even bothered testing it at all...)

It'd be cool to know if you have any ideas on "fully DMA'ing" multiple-block write- at the moment I can see that a block is TX'd, then an RX needs to compare the CRC and get the response- I can't see any way on the Pico to non-CPU compare the CRC's and fire off interrupts if they're not identical. I had a thought of getting an external logic AND gate + using PIO for automatic CRC comparison (where it would fire an IRQ if the AND doesn't work to logic 1) but honestly I don't know about how critical that "response" bit is- and then there's the issue of getting an AND gate :(

carlk3 commented 11 months ago

Actually, CRC is optional for SPI. You should be able to turn it off by putting something like

add_compile_definitions(SD_CRC_ENABLED=0)

in CMakeLists.txt.

Anyway, I looked into using the DMA Sniffer. The Sniffer is a single, global resource. So, using that would reduce opportunities for concurrency. For example, if I have two SD cards on two SPIs (or an SPI and an SDIO) that are trying to read or write simultaneously, only one can use the Sniffer. So, the other will have to wait (or fall back to software CRC calculation). So, I'd need to have some kind of mutual exclusion lock. That's fine within the library, but what if some other DMA user in the system is using the Sniffer? There does not seem to be any sort of global reservation mechanism in the SDK for the Sniffer. This might seem irrelevant to what you're doing, but have you considered using multiple SD cards and striping the data across them? I'm not sure what that would do to power consumption.

Anyway, the ZuluSCSI-firmware SDIO code overlaps the CRC calculation with the DMA transfer, and I realized that the same idea could be applied to the SPI. I've been able to get a significant speedup by overlapping CRC calculation with the DMA transfer. The time to write a 512 byte block went from 377 us to 313 us. The DMA transfer of the block data takes about 244 us, but the CRC16 calculation takes only about 66 us, so there is plenty of extra time.

Here's what it looks like when writing multiple blocks:

Old code: image

New code with overlapped CRC calculation: image

With this change, for writing there is nothing to be gained by using the Sniffer to calculate the CRC unless the processor cores have something better to do during that 66 us than calculating the CRC for the block while the DMA is transferring it. For reading, one can't calculate the CRC until the data has been received, but for multi-block transfers the CRC check of a block can be delayed until the DMA completion wait time of the following block, so it is possible to overlap the processing for all but one block.

These are the results I'm getting now with system clock at default 125 MHz and SPI baud rate 20833333 Hz:

> bench 0:
Type is FAT32
Card size: 31.95 GB (GB = 1E9 bytes)

Manufacturer ID: 0x0
OEM ID:
Product: USD
Revision: 1.0
Serial number: 0x301e
Manufacturing date: 8/2022

FILE_SIZE_MB = 5
BUF_SIZE = 20480
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1299.7,79466,12675,15741
1500.1,34339,12669,13641

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1743.6,12907,11520,11743
1744.1,12562,11520,11740

Done

> big_file_test bf 20 3
Writing...
Elapsed seconds 15.9
Transfer rate 1.26 MiB/s, or 1290 KiB/s (1321 kB/s) (10570 kb/s)
Reading...
Elapsed seconds 14.8
Transfer rate 1.35 MiB/s, or 1381 KiB/s (1415 kB/s) (11317 kb/s)
carlk3 commented 11 months ago

I checked the changes into branch dev_sdio if you'd like to try them.