espressif / esp-hosted

Hosted Solution (Linux/MCU) with ESP32 (Wi-Fi + BT + BLE)
Other
695 stars 163 forks source link

About CS pin control in SPI mode #311

Closed Evlers closed 9 months ago

Evlers commented 9 months ago

About CS pin control in SPI mode

I am using the master branch as the driver and have finished porting the SPI interface. However, when the device executes iperf -c(that is, TCP TX), the ESP chip will display the following error log:

E (29296) SPI_DRIVER: rx_pkt len[32864]>max[1600], dropping it
E (29306) SPI_DRIVER: rx_pkt len[57307]>max[1600], dropping it
E (32166) SPI_DRIVER: process_spi_rx: cal_chksum[5086] != exp_chksum[46388], drop len[118] offset[13312]
E (32166) SPI_DRIVER: rx_pkt len[26881]>max[1600], dropping it
E (33686) SPI_DRIVER: rx_pkt len[20695]>max[1600], dropping it
E (33696) SPI_DRIVER: process_spi_rx: cal_chksum[57193] != exp_chksum[33635], drop len[995] offset[17187]
E (35126) SPI_DRIVER: rx_pkt len[27276]>max[1600], dropping it
E (35126) SPI_DRIVER: rx_pkt len[18502]>max[1600], dropping it
E (35126) SPI_DRIVER: rx_pkt len[6691]>max[1600], dropping it
E (36666) SPI_DRIVER: rx_pkt len[41646]>max[1600], dropping it
E (38166) SPI_DRIVER: rx_pkt len[64607]>max[1600], dropping it
E (38166) SPI_DRIVER: rx_pkt len[7966]>max[1600], dropping it
E (39686) SPI_DRIVER: rx_pkt len[53584]>max[1600], dropping it
E (39686) SPI_DRIVER: rx_pkt len[58981]>max[1600], dropping it
E (39686) SPI_DRIVER: rx_pkt len[5560]>max[1600], dropping it
E (41176) SPI_DRIVER: rx_pkt len[49681]>max[1600], dropping it
E (42656) SPI_DRIVER: rx_pkt len[12577]>max[1600], dropping it
E (42656) SPI_DRIVER: rx_pkt len[39577]>max[1600], dropping it
E (42666) SPI_DRIVER: rx_pkt len[58049]>max[1600], dropping it
E (44156) SPI_DRIVER: rx_pkt len[5120]>max[1600], dropping it
E (44186) SPI_DRIVER: rx_pkt len[44038]>max[1600], dropping it
E (44196) SPI_DRIVER: rx_pkt len[48897]>max[1600], dropping it
E (44196) SPI_DRIVER: rx_pkt len[11266]>max[1600], dropping it
E (45586) SPI_DRIVER: rx_pkt len[37376]>max[1600], dropping it
E (45586) SPI_DRIVER: rx_pkt len[12585]>max[1600], dropping it
E (47156) SPI_DRIVER: rx_pkt len[13766]>max[1600], dropping it
E (47166) SPI_DRIVER: rx_pkt len[31866]>max[1600], dropping it
E (47166) SPI_DRIVER: process_spi_rx: cal_chksum[63619] != exp_chksum[576], drop len[240] offset[57924]
E (48646) SPI_DRIVER: rx_pkt len[22446]>max[1600], dropping it
E (48646) SPI_DRIVER: rx_pkt len[35594]>max[1600], dropping it
E (48656) SPI_DRIVER: process_spi_rx: cal_chksum[30715] != exp_chksum[21323], drop len[168] offset[17155]

I think there's something wrong with the SPI transmission!

But before the SPI transmission, I first pull the CS pin low, and then delay 1ms to perform the spi transmission is much better. At the end of the SPI transmission, another 1ms delay to pull the CS pin high is completely fine!

Obviously, this is not feasible, and the network speed will slow down a lot.

Does this CS pin need to wake up wait time? But why can't the CS pin be released directly after the data transfer is complete!

mantriyogesh commented 9 months ago

There is very clearly something wrong at the SPI timing for host and slave spi transactions. Unless this is corrected, you will fall into packet loss issues.

Some checks:

  1. SPI mode should be used same at both host and slave
  2. Change SPI mode at both places to find out which spi mode suites you better.
  3. Are you using GPIO matrix instead of IOMUX pins (These pins depend highly on which ESP chipset being used)
    • If GPIO matrix is being used for SPI, Depending upon the ESP chipset used, there will be expected delays for the signal. Check TRM for that ESP chipset.
  4. Have you checked the porting_guide already?
  5. Very long or less quality wires used as jumper cables
  6. Boot strapping pins used

Typically, we suggest to get the SPI timings correct in between your host and slave. Also these issues generally can be clearly understood using the logic analyzer.

mantriyogesh commented 9 months ago

Also if you are using something like STM32 as host, you might want to double-check NSS is correct. We last time (some time ago) checked and found NSS is not working very correctly.

So instead of using NSS, we relied on the normal GPIO as CS. While doing so, we removed the NSS functionality of SPI CS for same GPIO pin number and used it as CS as manual GPIO.

mantriyogesh commented 9 months ago

Check SPI version 2 IOC: GPIO is normal GPIO and not NSS.

https://github.com/espressif/esp-hosted/blob/504210623ec632efae1d39ad2636e235485ab3d4/esp_hosted_fg/host/stm32/proj/spi/stm_spi_host_v2.ioc#L87-L91

GPIO is used as CS: https://github.com/espressif/esp-hosted/blob/504210623ec632efae1d39ad2636e235485ab3d4/esp_hosted_fg/host/stm32/driver/transport/spi/spi_drv.c#L553-L557

Evlers commented 9 months ago

Thanks for your guidance!

On the host side, I used manual GPIO instead of SPI NSS. And I also specially checked the ESP32 side code, which is consistent with the SPI mode used by the host, which is SPI_MODE_2. I also tried three other modes:

In addition, I used the default GPIO of the master branch for ESP32c3 and did not change to other ports, there should be no matrix problem. And our PCB line is within 3cm, although there is no isometric treatment, but the length is not far apart.

Our company logic analyzer is broken, borrow one tomorrow to analyze it again!

mantriyogesh commented 9 months ago

Screenshot_2024-01-10-15-03-57-815_com brave browser_beta-edit Slave side mode 0 is anyway should not be used as pointed in Porting guide.

mantriyogesh commented 9 months ago

https://docs.espressif.com/projects/esp-idf/en/latest/esp32c6/api-reference/peripherals/spi_slave.html#speed-and-timing-considerations

Although next link is for esp32 (not C6), but the concepts for timing are similar. https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/spi_slave.html#restrictions-and-known-issues

mantriyogesh commented 9 months ago

The logic analyser will be helpful to understand which transitions are not falling correctly. In second link above there is explanation of timing as well.

Evlers commented 9 months ago

Okay, I'll take a look at the link you sent me. Thank you very much!

mantriyogesh commented 9 months ago
  1. Also have you waited till SPI transaction is complete,
    while( hspi1.State == HAL_SPI_STATE_BUSY ); 

    in host?

You can further use DMA for this. but anyway you need to make sure that the SPI is only initiated once prior transaction is complete.

  1. Are you getting HS down immediately in your slave code like this: https://github.com/espressif/esp-hosted/blob/504210623ec632efae1d39ad2636e235485ab3d4/esp_hosted_fg/esp/esp_driver/network_adapter/main/spi_slave_api.c#L545-L567

This will also avoid rare condition that explained in : https://github.com/espressif/esp-hosted/issues/308#issuecomment-1873944557 and https://github.com/espressif/esp-hosted/issues/308#issuecomment-1873957008

  1. Have you check the spi_drv.c from feature/esp_as_mcu_host branch? https://github.com/espressif/esp-hosted/blob/835a6670fdab619eb31ad44e1e29f8893e041437/host/drivers/transport/spi/spi_drv.c#L286-L383

Please note that the spi_bus_lock mutex and spi_trans_ready_sem semapore for synchronisation. I remember there was OOOSeq ( Out of order sequence packets) problem in master code, as Tx / SPI_transaction was allowed done through multiple threads. In the new branch we made sure that the transaction can only be trigger from one place, any place the transaction is to be triggered, was simply giving the semaphore.

mantriyogesh commented 9 months ago

We also allowed some extra transactions as dummy in end from either side, just to make sure that all valid buffers either side are drained completely.

https://github.com/espressif/esp-hosted/blob/feature/esp_as_mcu_host/host/drivers/transport/spi/spi_drv.c#L377-L378

I think code quality of https://github.com/espressif/esp-hosted/blob/feature/esp_as_mcu_host/host/drivers/transport/spi/spi_drv.c is better as it is evolved from issues and race conditions. Better to check logical diff for slave and host spi files from master to feature/esp_as_mcu_host to understand the changes.

From experience, I think the issue you face is definitely because of the timings mismatch in between host and slave. But at the same time, I am trying to also point some code changes and fixes we had added in recent branch, if it may probably solve issues.

Evlers commented 9 months ago

First of all, as can be seen from the spi_drv.c file, the transmission of spi is only carried out in one place, and is triggered by semaphore, there is no mutual exclusion problem.

The code I transplanted is in rt-thread_esp-hosted, could you please help me check if there is any problem with the logic

I am using GD32F42, and SPI transmission is waiting for completion:

/* Wait for transmission to complete */
while(!dma_flag_get(spi_device->dma.rx.periph, spi_device->dma.rx.channel, DMA_FLAG_FTF));
while(!dma_flag_get(spi_device->dma.tx.periph, spi_device->dma.tx.channel, DMA_FLAG_FTF));

External interrupts are also timely:

static void gpio_interrupt(void *args)
{
    /* Post semaphore to notify SPI slave is ready for next transaction */
    if (osSemaphore != NULL) {
        rt_sem_release(osSemaphore);
    }
}
Evlers commented 9 months ago

I am now trying to use rt-thread_esp-hosted on the ART-PI(STM32H750) development board.

Let's eliminate the GD32 driver issue!

mantriyogesh commented 9 months ago

@Evlers Definitely we can have look, but overall, to study in detail, will take some time,

I think, to less complicate and have only transport in picture for testing, We can focus on 'Raw throughput Tx and Rx test' , where dummy buffers are just passed to find out transport link throughout.

This way better to avoid networking and first concentrating base transport correctness.

The master doc is : https://github.com/espressif/esp-hosted/blob/master/esp_hosted_fg/docs/Linux_based_host/Raw_TP_Testing.md

For feature/esp_as_mcu_host, slave: https://github.com/espressif/esp-hosted/blob/feature/esp_as_mcu_host/slave/main/Kconfig.projbuild#L327-L350

host: https://github.com/espressif/esp-hosted/blob/835a6670fdab619eb31ad44e1e29f8893e041437/Kconfig#L474-L489

Code: slave: https://github.com/espressif/esp-hosted/blob/feature/esp_as_mcu_host/slave/main/stats.c host: https://github.com/espressif/esp-hosted/blob/feature/esp_as_mcu_host/host/utils/stats.c check for symbol TEST_RAW_TP

mantriyogesh commented 9 months ago

I honestly tried not to complicate and was not yet giving you the intermediate branch patch, as it would then introduce three branches, 1. master, 2. intermediate branch, 3. feature/esp_as_mcu_host branch and hell lot of confusion with all these.

intermediate branch I will provide as commit. Please note following things:

  1. We have tested & verified lwip on this intermediate branch with stm32 (Can check and tell which exact hardwares if needed)
  2. These patches / intermediate branch has no future (at least in near future)
  3. these are better than master but still behind feature/esp_as_mcu_host
  4. This branch is not actively updated / supported. Changes in this will move to feature/esp_as_mcu_host in long future.

Base Hosted master should point to: 4840528810457f393e0e65fe2bb1442dcb6dbc10

On top, apply git patches, using git am ./patches/*.patch patches_over_4840528810457f393e0e65fe2bb1442dcb6dbc10.tgz

This branch is however tested on stm32 and can import fixes from feature/esp_as_mcu_host easily (if need be) That is why shared this.

mantriyogesh commented 9 months ago

However, I still think you should get logic analyzer reading first, to find out anything obvious that you can fix. I still not want you to get confused over three branches confusion.

Evlers commented 9 months ago

Yeah, too many branch let me very confusing!

The master branch has been ported so we can improve it based on this. Of course, I will refer to the patch you sent on the way to repair, thank you!!

next, then I will try other branches to test.

In addition, I don't think I need to delete my lwip code, because the code of the network layer has been tested on Infineon's WiFi

This error has been very clear is the problem of SPI transmission, I try to solve it first, wait for my good news!

mantriyogesh commented 9 months ago

two places of check_and_execute_spi_transaction() is not good ( your branch, similar to master) better to refer feature/esp_as_mcu_host -> spi_drv.c how it was removed.

Evlers commented 9 months ago

Found the bug!

The mutex needs to be adjusted as shown in the figure below: image

I'm guessing because handshake pin was read ahead of time. ESP32 can still transmit data while reading the pin (there is an empty place in the queue), the pin state is low level (ready). The mutual exclusion wait is entered after reading the pin status.

But at this point, the task that gets the mutex is sending data!! After data is sent, the EPS32 queue cache is full and the handshake pin is high. At this point, the task that has completed the data transfer releases the mutex lock The pin state is read early due to the task waiting for the mutex (ready state), leading to the misconception that the ESP32 was in ready state, and initiated the data transfer!!!

mantriyogesh commented 9 months ago

May be..

feature/esp_as_host_mcu has https://github.com/espressif/esp-hosted/blob/835a6670fdab619eb31ad44e1e29f8893e041437/host/drivers/transport/spi/spi_drv.c#L300-L380

Evlers commented 9 months ago

Yes, it looks like master branch is way behind the feature/esp_as_host_mcu branch

Now I have tested the rate through iperf, it is only about 7.5Mbps, is this the only throughput of the master branch? My host spi works at 25MHz.

mantriyogesh commented 9 months ago

The data rate depends on

  1. how much clock alive on the bus (gaps in between two clocks)
  2. Actual clock frequency (not the requested)
Evlers commented 9 months ago

I actually measured the clock signal with the oscilloscope:

SDS1204X_HD_PNG_20 SDS1204X_HD_PNG_21 SDS1204X_HD_PNG_24

Evlers commented 9 months ago

Sometimes even longer, the interval time is 655us, and no data ready pin has a falling edge that does not trigger the host to transmit data

SDS1204X_HD_PNG_25

Evlers commented 9 months ago

Anyway, now the test has been stable, thank you very much for your help! I will continue to test stability and try to port new branches! Thank you again for your patient response!!!

image a9342c25b0c4928d1901f76a05ba182

mantriyogesh commented 8 months ago

@Evlers , does your stm ported solution still exist? Also can I refer your solution to him/her ? They seem to be using STM32 as well.

Evlers commented 8 months ago

That solution uses the master branch, so there are still stability issues. I'm waiting for your new branch to perfect the portability before we do it again porting. For now, we have chosen Infineon's solution. The code is related to the company's project, so it is not open source. Sorry!