espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.46k stars 7.25k forks source link

USB serial jtag occasionally skipping characters if console task and USB_serial_jtag interrupt running on different core (IDFGH-12988) #13939

Closed congyue closed 1 month ago

congyue commented 3 months ago

Answers checklist.

General issue report

Hi Espressif team,

I have searched and read extensively about past issues on USB serial JTAG, (like https://github.com/espressif/esp-idf/issues/12518, https://github.com/espressif/esp-idf/issues/12628, https://github.com/espressif/esp-idf/issues/12605, https://github.com/espressif/esp-idf/issues/9318), most of them have been fixed in 5.2.2 but none of them directly related to this one.

First of all, here is my setup:

I have prepared sample code to reproduce the issue:

#include <stdio.h>
#include <unistd.h>

#include "driver/usb_serial_jtag.h"
#include "esp_check.h"
#include "esp_log.h"
#include "esp_vfs_dev.h"
#include "esp_vfs_usb_serial_jtag.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "sdkconfig.h"

#define BUF_SIZE (512)
#define ECHO_TASK_STACK_SIZE (4096)

static void init_console(void)
{
    /* Minicom, screen, idf_monitor send CR when ENTER key is pressed */
    esp_vfs_dev_usb_serial_jtag_set_rx_line_endings(ESP_LINE_ENDINGS_CR);
    /* Move the caret to the beginning of the next line on '\n' */
    esp_vfs_dev_usb_serial_jtag_set_tx_line_endings(ESP_LINE_ENDINGS_CRLF);

    // Configure USB SERIAL JTAG
    usb_serial_jtag_driver_config_t usb_serial_jtag_config =
        USB_SERIAL_JTAG_DRIVER_CONFIG_DEFAULT();

    /* Install USB-SERIAL-JTAG driver for interrupt-driven reads and writes */
    usb_serial_jtag_driver_install(&usb_serial_jtag_config);

    /* Tell vfs to use usb-serial-jtag driver */
    esp_vfs_usb_serial_jtag_use_driver();

    ESP_LOGI("usb_serial_jtag echo", "USB_SERIAL_JTAG init done");
}

static void echo_task(void *arg)
{
    char *prompt = "# ";
    write(fileno(stdout), prompt, 2);
    fsync(fileno(stdout));

    int i = 0;
    char cmd[BUF_SIZE] = {0};
    while (1)
    {
        // Read and echo input back to console in real time
        char c;
        read(fileno(stdin), &c, 1);
        write(fileno(stdout), (const char *)&c, 1);

        // If newline or buffer full, process the buffer
        if ((c == '\n') || i == BUF_SIZE - 1)
        {
            if (i > 1)
            {
                // If command is valid, print simulated output
                char output[BUF_SIZE + 16] = {0};
                snprintf(output, BUF_SIZE + 16, "Run cmd: [%s] ...\n", cmd);
                write(fileno(stdout), output, strlen(output));
            }

            write(fileno(stdout), prompt, 2);
            fsync(fileno(stdout));

            memset(cmd, 0, BUF_SIZE);
            i = 0;
        }
        // Otherwise add to command buffer
        else
        {
            cmd[i] = c;
            i++;
        }
    }
}
void app_main(void)
{
    init_console();
    xTaskCreatePinnedToCore(&echo_task, "USB SERIAL JTAG_echo_task", ECHO_TASK_STACK_SIZE, NULL, 2,
                            NULL, APP_CPU_NUM);
}

It can be considered as a refactor of the usb_serial_jtag_echo project to use read()/write() VFS interface instead of the driver API.

Expected behavior

The firmware suppose to simulate a shell and taking commands from input, then print a simulated output to the screen, e.g.

# version
Run cmd: [version] ...
# help
Run cmd: [help] ...

Most of the time this is fine and the issue is hard to be noticed when typing manually.

To make the issue easier to reproduce, I'm using this host script to stress testing the USB_serial_jtag input/output

while true; do echo "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890" > /dev/cu.usbmodem114301; sleep .1; done

Actual behavior

With the help of above script, sometimes I can see the output would randomly skip one byte:

# 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Run cmd: [12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890] ...
# 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Run cmd: [1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789] ...  // --> This line is missing a '0' as the last digit!
# 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Run cmd: [12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890] ...
# 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Run cmd: [1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123457890] ... // --> This line is missing a '6' in the last sequence

Findings

I also noticed that below modification would cause the issue disappear:

This symptom is hard to find and it could lead to hard-to-debug issues on factory provisioning and automation test when relying on the USB port. It would be great if it can be root caused and fixed in future version!

@igrr @mythbuster5

SoucheSouche commented 3 months ago

Hi @congyue, thanks for reporting the issue. I will have a look at it as soon as possible.

pggh commented 3 months ago

Because I have other problems with the JTAG output, I scrolled through the driver code and have spotted this:

    if (p_usb_serial_jtag_obj->fifo_status == FIFO_IDLE) {
        portENTER_CRITICAL(&p_usb_serial_jtag_obj->spinlock);
        atomic_store(&p_usb_serial_jtag_obj->fifo_status, FIFO_BUSY);
        sent_data = usb_serial_jtag_write_and_flush(src, size);
        portEXIT_CRITICAL(&p_usb_serial_jtag_obj->spinlock);
    }

checking fifo_status only outside the lock is wrong, as it could be set between the check and acquiring the lock.

johnboiles commented 3 months ago

I'm seeing this issue too without using console if I call usb_serial_jtag_driver_install. If i remove that the issue goes away, but of course I can't read bytes from the serial_jtag anymore.

pggh commented 3 months ago

@johnboiles Are you referring to #14031? I still don't understand the exact cause of that but my guess is that wr_done is set twice, once from the actual writer and once from the driver ISR and the second one somehow discards the data. I wish espressif would extend the driver to support unbuffered write (with busy-waiting) when tx_buffer_size is set to 0.

johnboiles commented 3 months ago

I generally get output but it skips characters. Maybe 5% of characters.

I did notice a couple of times that output stopped altogether until a reboot. I didn't try to figure out the pattern. It might have been when I sent some bytes in the output stopped.

congyue commented 2 months ago

I expect both above spinlock bug and this issue can be fixed by this commit once it lands 5.2.3 release. Let's check it out!

SoucheSouche commented 2 months ago

@congyue, It seems that the issue persists even with the recent changes to the usb serial jtag driver you mentioned. It may not occur as often as it used to, but I can still see that once in a while, a character is skipped. As of now, I haven't been able to find the root cause of this skipping but I am still working on it.

congyue commented 2 months ago

@SoucheSouche Thanks for the update and spending time on it!

Just want to highlight again that the issue only happens if the caller task of printf is on the different core that the usb_serial_jtag ISR registered to. So there is an easier way to repro the issue:

Right now my workaround is to register the ISR to core 1 and make sure all important tasks are also pinned to core 1. But looking forward to know if this can be root caused!

Spritetm commented 2 months ago

FWIW, seems the root cause of the issue in the current master source is in the fsync implementation: it still pokes the hardware directly rather than ask the driver to (wait for a) flush. Should be a fix incoming soon.