espressif / esp-hosted

Hosted Solution (Linux/MCU) with ESP32 (Wi-Fi + BT + BLE)
Other
706 stars 169 forks source link

Features branch - slave resets after initialization #283

Open claydonkey opened 12 months ago

claydonkey commented 12 months ago

From an stm32 host. Initialization for TEST_RAW_TP works as expected but soon after transmission of packets the slave panics: Guru Meditation....

I had to increased the Host packet size to be 32bit aligned which is strange because I didn't notice this alignment issue in previous versions. I am using heap 1_4 for FreeRTOS

#define MAX_SDIO_BUFFER_SIZE              (1600) // (32*50) was 1536
#define MAX_TRANSPORT_BUFFER_SIZE        MAX_SDIO_BUFFER_SIZE
#define MAX_PAYLOAD_SIZE (MAX_TRANSPORT_BUFFER_SIZE-H_ESP_PAYLOAD_HEADER_OFFSET) 1587?
#define MALLOC(x)                        pvPortMalloc(x)

#define CONFIG_ESP_RAW_TP_HOST_TO_ESP_PKT_LEN 1576

https://github.com/espressif/esp-hosted/assets/5519575/28a258ff-98b9-4485-b869-7a58c34dca13

SohKamYung-Espressif commented 12 months ago

I see #define CONFIG_ESP_RAW_TP_HOST_TO_ESP_PKT_LEN 1576. This is bigger than the original MAX_SDIO_PACKET_SIZE, which should not be the case.

By default, 1460 is the size of CONFIG_ESP_RAW_TP_HOST_TO_ESP_PKT_LEN.

mantriyogesh commented 12 months ago

Hello @claydonkey 1. Do you have any changes in header structure? it should be always 12 bytes (till date). it is packed structure BTW.

2. MAX_SDIO_BUFFER_SIZE should always be 512 multiple and (512*3) is chosen as we do not expect any transport frame to exceed this size. Currently, raw ESP->Host allows evading this scenario, which is incorrect, and we will add limit here. But please do not change the above sizes, you might face unexpected issues..

3.

I had to increased the Host packet size to be 32bit aligned which is strange

Can you please explain this little in depth? I expected 512*3 (max) to be aligned. Any smaller packet size also need to be aligned to DMA (either side, according to their DMA alignments) We will check if the slave Tx packets are aligned or not (I expect them to be already although!)

4. Are you using the mempool (host)? Mempool will make sure that the mem to be allocated is always aligned on 4 bytes ( DMA expectation for ESP_as_host, check your DMA alignment expectation). If you do not use mempool, make sure that you allocate on 4bytes aligned start address and allocated size also with 4 bytes aligned size. https://github.com/espressif/esp-hosted/blob/7dc00559a1d7991f06e07cc3ff1ede1a9a427ff3/host/port/os_wrapper.h#L112-L113

claydonkey commented 12 months ago

@mantriyogesh @SohKamYung-Espressif

  1. No Changes ...I made a mistake in the post (13 because I saw 2 bytes but they were :4 each)
  2. I reduced the BUFFER SIZE to 512*3 still the same result.
  3. This relates to 2. I forgot to rename some of the MAX_SPI_BUFFER_SIZE > MAX_SDIO_BUFFER_SIZE
  4. I am using mempool...

but all this may be academic as I was still using an old ported sdio_drv.c. Having updated that:

The problem becomes 2 fold:

a. problems with sdio_drv.c

i. In function sdio_read_task:

"is_transport_in_reset()" only ever reports transport_state == TRANSPORT_INACTIVE
because ii.g_h.funcs->_h_queue_item(from_slave_queue,... is never called and_interface_buffer_handlet item is never queued to the from_slave_queue queue

ii. In function sdio_process_rx_task:

g_h.funcs->_h_dequeue_item(from_slave_queue`... never dequeues as the queue is empty because i.sdio_drv_evt_handler_fp is never executed to set transport_state = TRANSPORT_ACTIVE

transport_state is not set to TRANSPORT_ACTIVE anywhere else in the code from what I can see, so the app is locked at

I Starting SDIO process rx task

b. still crashing

If I remove:

while (true) {
            if (is_transport_in_reset()) {
                ESP_LOGI(TAG, "transport has been reset");
                break;
            }
            vTaskDelay(pdMS_TO_TICKS(100));
        }

The slave bombs as before. see next post

claydonkey commented 12 months ago

@mantriyogesh @SohKamYung-Espressif

The slave Guru Meditation :bomb:

Happens when the host sends:

sdio_generate_slave_intr(ESP_OPEN_DATA_PATH);

_(ESP_OPEN_DATAPATH==0)

uint8_t intr_mask = BIT(intr_no + ESP_SDIO_CONF_OFFSET) == BIT(0+0) == 1; with lock_required = true;

The CMD52 argument:

Hex:0x90011801

Comprised of:

func:1 (SDIO_FUNC_1) reg: 140 (ESP_SLAVE_SCRATCH_REG_7&= ESP_ADDRESS_MASK) data: 1 (int_mask) size: 1 flags: CMD52_WRITE | CMD52_READAFTERWRITE (Hex:0x80000000)

#define SD_ARG_CMD52_FUNC_SHIFT     28
#define SD_ARG_CMD52_FUNC_MASK      0x7
#define SD_ARG_CMD52_REG_SHIFT      9
#define SD_ARG_CMD52_REG_MASK       0x1ffff
#define SD_ARG_CMD52_DATA_SHIFT     0
#define SD_ARG_CMD53_LENGTH_SHIFT   0
#define SD_ARG_CMD53_LENGTH_MASK    0x1ff

arg = (func & SD_ARG_CMD53_FUNC_MASK) << SD_ARG_CMD53_FUNC_SHIFT; arg |= (reg & SD_ARG_CMD53_REG_MASK) << SD_ARG_CMD53_REG_SHIFT; arg |= (size& SD_ARG_CMD53_LENGTH_MASK) << SD_ARG_CMD53_LENGTH_SHIFT; arg |= data << SD_ARG_CMD52_DATA_SHIFT; arg |= flags;

and after this the slave :bomb:

***** Slave: Raw Throughput testing (Report per 1 sec)*****

I (22431) NETWORK_ADAPTER: Initial set up done
I (22431) slave_ctrl: event ESPInit
I (22433) main_task: Returned from app_main()
I (22543) stats: 28-29 sec       Rx: 0.00 Tx: 0.00 kbps
I (22543) stats: slave: sta_rx_in: 0 sta_rx_out: 0 sta_tx_in: 0 sta_tx_out: 0
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x400d6e90  PS      : 0x00060e30  A0      : 0x800949ec  A1      : 0x3ffd8370
A2      : 0x00000000  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x00000000
A6      : 0x00000001  A7      : 0x00000000  A8      : 0x00000000  A9      : 0x00000000
A10     : 0x00000000  A11     : 0xdeadbeef  A12     : 0x00000000  A13     : 0x00000001
A14     : 0x3ffc355c  A15     : 0x3ffcc05c  SAR     : 0x0000001a  EXCCAUSE: 0x0000001d
EXCVADDR: 0x00000000  LBEG    : 0x4000c28c  LEND    : 0x4000c296  LCOUNT  : 0x00000000

Backtrace: 0x400d6e8d:0x3ffd8370 0x400949e9:0x3ffd83b0

https://github.com/espressif/esp-hosted/assets/5519575/665cf8cf-86dc-426a-88a5-c207b2e5a2bf

SohKamYung-Espressif commented 12 months ago

a. problems with sdio_drv.c i. In function sdio_read_task:

"is_transport_in_reset()" only ever reports transport_state == TRANSPORT_INACTIVE

transport_state is put into reset via transport_drv_reconfigure() https://github.com/espressif/esp-hosted/blob/7dc00559a1d7991f06e07cc3ff1ede1a9a427ff3/host/drivers/transport/transport_drv.c#L112

In our host code, transport_drv_reconfigure() gets called via esp_hosted_wifi_init(), which is currently not ideal.

The initialization sequence is:

  1. On boot-up, transport_state starts out as TRANSPORT_INACTIVE
  2. The host then calls transport_drv_reconfigure() to trigger a slave reset (via GPIO) and puts transport_state into the state TRANSPORT_RESET
  3. sdio_read_task() sees that the transport is in reset, and proceeds to send sdio_generate_slave_intr() to open the data path and put transport_state into TRANSPORT_ACTIVE state.

Calling transport_drv_reconfigure() in your init sequence should resolve this. You could put this at the start of sdio_read_task() before the check, for example.

mantriyogesh commented 12 months ago

@claydonkey

Thanks for the video, it eases the understanding.

Trace in video is little blurry, I couldn't understand, 'Uninitialized sem id 3' or 'Uninitialized sem id 1' but in any of the cases, looks the semaphore/queue is not init correctly.

is the transport_init_internal() called to init from_slave_queue queue (or other smaphores)? Can you please assert and check in os_wrapper.c functions where possible access without init is made?

SohKamYung-Espressif commented 12 months ago

The slave Guru Meditation 💣 Happens when the host sends: sdio_generate_slave_intr(ESP_OPEN_DATA_PATH);

Can you modify the function event_cb() in sdio_slave_api.c to this version, which outputs debug info:

IRAM_ATTR static void event_cb(uint8_t val)
{
    ESP_EARLY_LOGI(TAG, "enter event_cb");
    if (val == ESP_RESET) {
        ESP_EARLY_LOGI(TAG, "got ESP_RESET");
        sdio_reset(&if_handle_g);
        return;
    }

    if (context.event_handler) {
        ESP_EARLY_LOGI(TAG, "call event_handler");
        context.event_handler(val);
    }
}

You should see debug output like this when the host calls sdio_generate_slave_intr(ESP_OPEN_DATA_PATH) :

I (8517) SDIO_SLAVE: enter event_cb I (8517) SDIO_SLAVE: call event_handler I (8517) NETWORK_ADAPTER: Start Data Path

This is to check if the event_cb() is being triggered when you write to the SDIO slave interrupt register.

claydonkey commented 11 months ago

@SohKamYung-Espressif + @mantriyogesh Firstly might I just say that having lived in your code for a while - Nice Job! - It is turning out to be rather lovely and I am learning alot as it is a great example of nice C 🥇

@SohKamYung-Espressif

Problem with transport_drv_reconfigure()

transport_drv_reconfigure() transport_state == TRANSPORT_RESET but the problem is now displaced to: while (!is_transport_up()) {} reset_slave() is executed but there are no interrupts or async code to set transport_state = TRANSPORT_ACTIVE, consequently the loop does not exit.

Alteration to Slave Code

I have added the code to the slave. The output : (as you said) is:

I (834) SDIO_SLAVE: sdio_init: ESP32 SDIO timing: 0

I (1640) stats: 0-1 sec       Rx: 0.00 Tx: 0.00 kbps
I (1838) SDIO_SLAVE: enter event_cb
I (1838) SDIO_SLAVE: call event_handler
I (1838) NETWORK_ADAPTER: Start Data Path
I (1840) stats:

***** Slave: Raw Throughput testing (Report per 1 sec)*****

I (1742) NETWORK_ADAPTER: Initial set up done
I (1742) slave_ctrl: event ESPInit
I (1744) main_task: Returned from app_main()
I (2456) stats: 1-2 sec       Rx: 0.00 Tx: 0.00 kbps
I (3256) stats: 2-3 sec       Rx: 0.00 Tx: 0.00 kbps
I (4056) stats: 3-4 sec       Rx: 0.00 Tx: 0.00 kbps
I (4856) stats: 4-5 sec       Rx: 0.00 Tx: 0.00 kbps
I (5656) stats: 5-6 sec       Rx: 0.00 Tx: 0.00 kbps

Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

@mantriyogesh Again thanks for the help.

transport_init_internal()

is the transport_init_internal() called to init from_slave_queue queue (or other smaphores)? Can you please assert and check in os_wrapper.c functions where possible access without init is made?

In TEST_RAW_TP transport_init_internal() is executed in esp_hosted_init() :

void app_task(void const *arg)
{
#if !TEST_RAW_TP
    app_main();
#else
    H_ERROR_CHECK(esp_hosted_init(NULL)); //<<here (not trying to teach you how to suck eggs)
#endif
    while (1) {
        hosted_sleep(5U);
    }
}

'E Uninitialized sem id #'

Problems with queue_handle_t *

I thought that was because the slave crashed, and there were no items to dequeue.

Now correct me if I'm wrong but:

I believe as queue_handle_t == QueueDefinition * It is already a pointer and is malloced in xQueueGenericCreate

As a consequence after the code exited hosted_create_queue, the queue_handle_t was destroyed (disappeared from scope) because the pointer was allocated within the scope of hosted_create_queue. (The pointer a pointer to the pointer QueueDefinition )

Queue sizes and other properties of queue_handle_t were garbled upon return to the caller function (sizes of 20 became random large numbers ex 6847268346) So replaced the type queue_handle_t * with queue_handle_t

queue_handle_t q_id = NULL >> queue_handle_t q_id = NULL q_id = (queue_handle_t )queue_handle >> q_id = (queue_handle_t)queue_handle item_added_in_back = xQueueSendToBack(q_id, item, timeout) >> item_added_in_back = xQueueSendToBack(q_id, item, timeout) vQueueDelete(q_id ) >> vQueueDelete(q_id)

etc. I will look at other the function types in hosted_osi_funcs_t and perhaps change their signatures.

The new error now makes sense and pertains to the STM32 not handling SDIOIT interrupts leaving me to poll the interrupt pin

int hosted_sdio_wait_slave_intr(uint32_t ticks_to_wait) {
   SDIO_FAIL_IF_NULL(hsd1->Instance);
   return (g_h.funcs->_h_read_gpio(GPIOC, GPIO_PIN_9));
}

Annoying but previously I just removed the warning and everything worked. I have been communicating with STM about this issue but polling or waiting for the pin seems the only option at the moment

E wait_slave_intr error: 1
E wait_slave_intr error: 1
(repeats)

is hidden by commenting out

static void sdio_read_task(void const *pvParameters){
//...
    if (res != ESP_OK) {
      // ESP_LOGE(TAG, "wait_slave_intr error: %d", res);
      continue;
   }
//...
}

The errors now pertain to the resetting of the slave: These are from the stm32 port sdio_drv.c and the msgs are in my driver. It should speak for itself:

I sending block
E sdmmc_io_write_bytes_dma CMD ERROR. Size=1024
E sdio_write_task: 0: Failed to send data: -2 1472 1472
I sending block
SohKamYung-Espressif commented 11 months ago

I have added the code to the slave.

From the output, the slave interrupt is now being sent from the host to the slave, and transport_state should now be TRANSPORT_ACTIVE.

Some more troubleshooting for the Guru Meditation Error:

Testing Tx on Slave Side

On the slave side, CONFIG_ESP_RAW_THROUGHPUT_TRANSPORT should be enabled for the raw throughput test.

As a test, on the Host, DO NOT enable CONFIG_ESP_RAW_THROUGHPUT_TRANSPORT. This should cause the host to not transmit raw data packets. But it can still receive, so this is to test that the slave Tx is fine.

You should see output like this from the slave:

I (1217) NETWORK_ADAPTER: Initial set up done
I (1217) slave_ctrl: event ESPInit
I (1227) main_task: Returned from app_main()
I (1617) stats: 0-1 sec       Rx: 0.00 Tx: 0.00 kbps
I (2617) stats: 1-2 sec       Rx: 0.00 Tx: 0.00 kbps
I (3617) stats: 2-3 sec       Rx: 0.00 Tx: 0.00 kbps
I (4617) stats: 3-4 sec       Rx: 0.00 Tx: 0.00 kbps
I (5617) stats: 4-5 sec       Rx: 0.00 Tx: 1266.09 kbps
I (6617) stats: 5-6 sec       Rx: 0.00 Tx: 24888.44 kbps
I (7617) stats: 6-7 sec       Rx: 0.00 Tx: 24888.44 kbps
I (8617) stats: 7-8 sec       Rx: 0.00 Tx: 24888.44 kbps
I (9617) stats: 8-9 sec       Rx: 0.00 Tx: 24899.84 kbps
I (10617) stats: 9-10 sec       Rx: 0.00 Tx: 24888.44 kbps
I (11617) stats: 10-11 sec       Rx: 0.00 Tx: 24899.84 kbps
I (12617) stats: 11-12 sec       Rx: 0.00 Tx: 24888.44 kbps

If this is fine, the problem may lie in what the host is sending to the slave.

Testing Rx on Slave Side

First, enable CONFIG_ESP_RAW_THROUGHPUT_TRANSPORT on the host so it will transmit raw data packets.

To verify what the slave is receiving from the host: in slave_sdio_api.c in the function sdio_read():

Replace

    ret = sdio_slave_recv(&(buf_handle->sdio_buf_handle), &(buf_handle->payload),
            &(sdio_read_len), portMAX_DELAY);

with this:

    ESP_LOGI(TAG, "calling sdio_slave_recv");
    ret = sdio_slave_recv(&(buf_handle->sdio_buf_handle), &(buf_handle->payload),
            &(sdio_read_len), portMAX_DELAY);
    ESP_LOGI(TAG, "sdio_read_len: %d", sdio_read_len);

You should see a lot of repeated output like this:

I (6257) SDIO_SLAVE: sdio_read_len: 1472
I (6257) SDIO_SLAVE: calling sdio_slave_recv
I (6257) SDIO_SLAVE: sdio_read_len: 1472
I (6267) SDIO_SLAVE: calling sdio_slave_recv
I (6267) SDIO_SLAVE: sdio_read_len: 1472
I (6277) SDIO_SLAVE: calling sdio_slave_recv

For receiving data, the max buffer size on the slave side is currently set to 1600. If the host is sending more data that this, it might be the cause of the Guru Meditation Error.

mantriyogesh commented 11 months ago

queue_handle_t redundant double pointer

As a consequence after the code exited hosted_create_queue, the queue_handle_t was destroyed (disappeared from scope) because the pointer was allocated within the scope of hosted_create_queue. (The pointer a pointer to the pointer QueueDefinition )

Actually, pointer to pointer concept was intended to make even non-FreeRTOS have their own queue type within opaque pointer. But as I see,

#define queue_handle_t                               QueueHandle_t

defeats the purpose, and outer structure should really not be of any OS dependent struct. Would take care of this. Ideally it should have been something like:

struct queue_handle_t {
QueueHandle_t os_q_handle;
};

For your simplicity, you can even directly use outer pointer directly as xQueueCreate. But you might have to be cautious to even change hosted_queue_item(), hosted_dequeue_item(), etc accordingly. On a different angle, these porting functions should be all the same which you had tested as part of https://github.com/espressif/esp-hosted/issues/270#issuecomment-1803576426. I am not saying they are perfect at the moment (needs correction), but still should have worked as is?

scope of queue handle

hosted_create_queue is used with , g_h.funcs->_h_create_queue() returning the handle of the newly allocated queue, agnostic of its OS. This handle is important and is the sole scope within the sdio_drv.c for that queue.

    to_slave_queue = g_h.funcs->_h_create_queue(TO_SLAVE_QUEUE_SIZE,
            sizeof(interface_buffer_handle_t));
    assert(to_slave_queue);

No interrupt from STM HAL

Yes, if you have to rely on the D1 line raw value, you would need to comment

// ESP_LOGE(TAG, "wait_slave_intr error: %d", res);

as there would not be any other option in your code. We would still keep this line as is to cater any wrong doing for any other host in future. As part of polling in your code, you would need to rely upon code till:

        if (!(NEW_PACKET_INTR & interrupts)) {
            // interrupt is not for a new packet
            ESP_LOGD(TAG, "Slave intr is not for 'new packet'");

            SDIO_DRV_UNLOCK();
            continue;
        }

In next block we get legth to be read:

        ret = sdio_get_len_from_slave(&len_from_slave, ACQUIRE_LOCK);
        if (ret || !len_from_slave) {
            ESP_LOGD(TAG, "invalid ret or len_from_slave: %d %ld", ret, len_from_slave);

            SDIO_DRV_UNLOCK();
            continue;
        }

sdio_get_len_from_slave() takes care of valid length to be populated. But as you are doing polling, for extra safety, try in your testing version,

assert(len_from_slave < MAX_SDIO_BUFFER_SIZE)
mantriyogesh commented 11 months ago

I have yet not responded to

I sending block
E sdmmc_io_write_bytes_dma CMD ERROR. Size=1024
E sdio_write_task: 0: Failed to send data: -2 1472 1472
I sending block

but this would be too dependent upon the bus handling and might need debugging. If you have last working code, semantic diff would be faster than debugging on the bus.

Urgent attention would be on resolving Uninitialized sem id.