adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.11k stars 1.22k forks source link

Copying multiple larger files can hang the copy if autoreload is on #3986

Closed dhalbert closed 3 years ago

dhalbert commented 3 years ago

First reported by @theacodes. Considerable discord discussion starting about 7pm ET 2021-1-12 in #circuitpython

Seems to happen fairly often (though not always) on SAMD21. Not sure if it happens on SAMD51. Not tried on other boards.

Example, copying to a Metro M0 Express running near the tip of main. First erase the filesystem. Then, do not enter the REPL, but copy multiple larger files, e.g. like this set:

32278 Aug 29  2018 hit.wav
89134 Aug 29  2018 idle.wav
61164 Aug 29  2018 off.wav
75820 Aug 29  2018 on.wav
43438 Aug 29  2018 swing.wav

For me, one out of two times this caused the copy to hang.

@theacodes reports that a workaround is to copy one file at a time or to enter the REPL, so that auto-reload is not happening, and then copy.

This problem was not evident to @theacodes on 5.3.1.

theacodes commented 3 years ago

Confirming workaround: suspending autoreload through supervisor.disable_autoreload() or entering the REPL allows files to be copied without issue.

This was discovered while programming Big Honking Buttons. The factory setup script copies a set of files to the device - one set is the following WAV files:

drwxr-xr-x  9 stargirl  staff   288B Jul 24 13:37 ./
drwxr-xr-x  9 stargirl  staff   288B Jan 12 21:00 ../
-rw-r--r--  1 stargirl  staff    19K Jul 24 13:37 clap.wav
-rw-r--r--  1 stargirl  staff    63K Jul 24 13:37 dist.wav
-rw-r--r--  1 stargirl  staff    21K Jul 24 13:37 go.wav
-rw-r--r--  1 stargirl  staff    17K Jul 24 13:37 honk.wav
-rw-r--r--  1 stargirl  staff    37K Jul 24 13:37 kick.wav
-rw-r--r--  1 stargirl  staff    38K Jul 24 13:37 reverse.wav
-rw-r--r--  1 stargirl  staff    65K Jul 24 13:37 snare.wav

These in particular trigger the issue and would cause writes to the device to hang. It seems BHB's library code accessing the filesystem at startup may contribute to triggering the issue more often as well.

As a workaround, we are forcing the device into the REPL before copying files:

def force_into_repl(usb_device_id: str):
    """Forces a circuitpython device into the REPL."""
    port = serial_connect(usb_device_id)

    # Interrupt with Ctrl + C
    port.write(b"\x03")
    # Enter repl with enter after slight delay
    time.sleep(0.2)
    port.write(b"\n")
    port.close()

With that workaround, we can program devices consistently without hanging:

========== DEPLOYING CODE ==========
Waiting for CIRCUITPY drive...
Forcing BHB into repl (workaround for CircuitPython issue #3986)
Cleaning temporary files from src directories...
Downloading files to cache...
Using cached winterbloom_voltageio.py.
Copying files...
Copied factory/.cache/winterbloom_voltageio.py to lib
Copied firmware/winterbloom_bhb to lib
Copied samples to .
Copied examples to .
Copied firmware/LICENSE to .
Copied firmware/README.HTM to .
Copied examples/default.py to code.py
Flushing... 🚽
Done copying files, resetting...
Done!
dhalbert commented 3 years ago

I found the copying is much slower if the serial connection is not open.

This test is with code.py just being print("Hello, world").

Not connected:

halbert@salmonx:~/Downloads/sounds$ /bin/rm /media/halbert/CIRCUITPY/*.wav;sync
halbert@salmonx:~/Downloads/sounds$ cp -v *.wav /media/halbert/CIRCUITPY/
'hit.wav' -> '/media/halbert/CIRCUITPY/hit.wav'
'idle.wav' -> '/media/halbert/CIRCUITPY/idle.wav'
    [big delay; NeoPixel pulses green]
'off.wav' -> '/media/halbert/CIRCUITPY/off.wav'
    [big delay; NeoPixel pulses green]
'on.wav' -> '/media/halbert/CIRCUITPY/on.wav'
    [big delay; NeoPixel pulses green]
'swing.wav' -> '/media/halbert/CIRCUITPY/swing.wav'
    [big delay; NeoPixel pulses green]

Connected, in REPL or not (doesn't seem to matter on timing), and repeating the commands above, I see much smaller delays, still a few seconds for copying larger files (sometimes two at a time). I do not see the pulsing green NeoPixel as above.

I am not sure this completely jives with @theacodes' technique of forcing the REPL, and having trouble before that.

theacodes commented 3 years ago

Should I try with just opening a serial connection and not sending anything?

On Tue, Jan 12, 2021, 11:08 PM Dan Halbert notifications@github.com wrote:

I found the copying is much slower if the serial connection is not open.

This test is with code.py just being print("Hello, world").

Not connected:

halbert@salmonx:~/Downloads/sounds$ /bin/rm /media/halbert/CIRCUITPY/.wav;sync halbert@salmonx:~/Downloads/sounds$ cp -v .wav /media/halbert/CIRCUITPY/ 'hit.wav' -> '/media/halbert/CIRCUITPY/hit.wav' 'idle.wav' -> '/media/halbert/CIRCUITPY/idle.wav' [big delay; NeoPixel pulses green] 'off.wav' -> '/media/halbert/CIRCUITPY/off.wav' [big delay; NeoPixel pulses green] 'on.wav' -> '/media/halbert/CIRCUITPY/on.wav' [big delay; NeoPixel pulses green] 'swing.wav' -> '/media/halbert/CIRCUITPY/swing.wav' [big delay; NeoPixel pulses green]

Connected, in REPL or not (doesn't seem to matter on timing), and repeating the commands above, I see much smaller delays, still a few seconds for copying larger files (sometimes two at a time). I do not see the pulsing green NeoPixel as above.

I am not sure this completely jives with @theacodes https://github.com/theacodes' technique of forcing the REPL, and having trouble before that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/3986#issuecomment-759191881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5I445BKEIZVGTJMQE3DLSZUME3ANCNFSM4WADE4LQ .

dhalbert commented 3 years ago

Should I try with just opening a serial connection and not sending anything?

I thought you were doing that initially, and you added the "getting into the REPL" part. But if were not opening the connection before, yes, you could try that.

I think this may have to do with not calling the tinyusb "do tasks" call often enough. It probably gets called more often when a serial connection is open, because there's traffic that generates interrupts. The fix may be to enable tick interrupts all the time if there's any kind of USB connection, including MSC and CDC. I'm going to try that soon.

hathach commented 3 years ago

I am able to reproduce the issue with tip of main by copying all sample files here https://github.com/theacodes/Winterbloom-Big-Honking-Button/tree/master/samples . Bus analyzer show that firmware didn't response to the SCSI status phase, either the stack failed to manage the state and/or could be another buffer overflow, or maybe the tud_task() isn't called e.g cpy doesn't think it needs to call the usb task. Eventually the host have to issue a bus reset -> re-enumerate -> host tries again with on-going files which is the huge pause we see.

I will do more testing with different MCUs, and scenario, will update with more details later on.

image

dhalbert commented 3 years ago

Thanks!! I am going to instrument the calls to tud_task() and the ticks by toggling some pins to see what is going on.

dhalbert commented 3 years ago

@hathach In 5.3.1, we called tud_task() as one of the background tasks that runs on every timer tick, which was 1msec. The calling routine is usb_background().

That was changed in #2879 during 6.0 development. Now tud_task() is only set up to be called once each time the usb_irq_handler() is called: https://github.com/adafruit/circuitpython/blob/409b5ff009811571adef9c519bcbb1f103340127/supervisor/shared/usb/usb.c#L97-L100 (It is not called in the interrupt handler, but it is put on a list of background routines to call (once) when they are given a chance to run.)

I think the idea was that only a USB interrupt would generate a task for TinyUSB to handle, and otherwise TinyUSB would have no work to do. But I don't think that's right, is it? It appears there are other times besides an interrupt when tud_task() has something to do.

As an experiment, I put usb_background() back in supervisor_background_tasks() and took out the special logic described above. That fixes the problem of this issue.

But I am still worried that we need to call tud_task() every msec, no matter what. I thought you told me in the past this was necessary. Currently in 6.x we don't generate ticks every msec unless there's some work going on that depends on that. Doing this reduces power consumption. But if we truly need to do tud_task() every msec, we have to always generate 1 msec ticks, and our tick-saving logic just needs to go away, at least when USB is connected.

2868 may or may not be related to this (fixed an nrf problem), but I am not sure at all. You reviewed that PR extensively.

@jepler Do you have any comment on the above (especially #2879 changes)?

jepler commented 3 years ago

Re #2868 I don't mind at all modifying how we control when usb_background() or tud_task() are and I'm happy to help figure out what needs to be changed. It's not an understatement to say my changes were based on an insufficient understanding of tinyusb as well as just seeing "what worked"...

tannewt commented 3 years ago

Maybe we need to queue up tud_task after we queue a response back into tinyusb. For example, maybe we need to trigger tud_task after a write completes.

hathach commented 3 years ago

@dhalbert thanks for detail explanation, most of the usb task job is queued by ISR, however, there is a few exceptions especially with MSC. E.g when scsi read/write callbback return 0 as busy indicator. The whole transfer is already complete, but since the application is not ready, tinyusb will need to re-schedule the event by itself without ISR (re-submit event to task queue).

https://github.com/hathach/tinyusb/blob/master/src/class/msc/msc_device.c#L366

It is probably our case here if we do return 0 in write callback. If so you only need to make sure to set the isr flag to run background when returning 0 or run background when flash is ready again (once 0 is returned).

However, I haven't narrowed it down to exactly specific cause just yet, since I will be switching to other works requested by @tannewt (from slack) which has a tight timing. I will come back to this one later on.

tannewt commented 3 years ago

Will tud_task_event_ready() return true after tud_task returns in this case? We could schedule another call if that's the case.

dhalbert commented 3 years ago

Will tud_task_event_ready() return true after tud_task returns in this case? We could schedule another call if that's the case.

I think I can test to see if that is true.

dhalbert commented 3 years ago

It is probably our case here if we do return 0 in write callback. If so you only need to make sure to set the isr flag to run background when returning 0 or run background when flash is ready again (once 0 is returned).

Our tud_msc_write10_cb() is synchronous, and never returns 0, so that would not be a time to reschedule.

dhalbert commented 3 years ago

@tannewt @hathach I tested this on a variety of boards with 6.2.0-beta.1. On the problem boards, the writes are severely delayed or never complete. The different behaviors may help in debugging.

Metro M0 Express (SAMD21): problem: finishes, but /var/log/syslog shows USB issues Metro M4 Express (SAMD51): problem: finishes, but /var/log/syslog shows USB issues Feather STM32F405: problem: gets completely stuck and never finishes: /var/log/syslog shows worse problems than SAMD

Feather nRF52840: fine RPi Pico: fine Metro ESP32-S2: fine

hathach commented 3 years ago

thanks @dhalbert , I will take a break from rp2040 and switch to troubleshoot this tomorrow. There are a few of RP2040 usb issues (usbnet, webusb etc..) though those will probably take time since I will need to go through datasheet reading. Will ping you if I could find out anything.

tannewt commented 3 years ago

@hathach Make sure and read the discussion on #4075 too.

hathach commented 3 years ago

update: tracing down the background callback, it does solve the race condition pretty well. However, the issued is caused by run_code.py decide to reload (AUTO_RELOAD) and run background_callback_reset() while background list is still queued up with usb event.

background_callback_run_all: 95:
callback_head = 200029A4
callback_tail = 200029A4
background_callback_run_all: 100: // background_callback_run_all() exit while there is queuing callbacks
run_code_py: 340:  // reload request == true
        usb_irq_handler: 103: // usb isr occurs -> more callback queued up
callback_head = 200029A4
callback_tail = 200029BC
background_callback_add_core: 57:
run_code_py: 305: // cleanup_after_vm() is called which in turn reset all the callbacks including above usb events.
background_callback_reset: 122:
background_callback_run_all: 73:

To sum up the issue is not related to usb stack, it seems to be caused by auto-reload while background task are still queued up. background_callback_run_all() is not guarantee to execute all the background since while executing background, those function can potentially trigger more usb ISR. https://github.com/adafruit/circuitpython/blob/main/supervisor/shared/background_callback.c#L91

We can modify background_callback_run_all() to loop over while callback_head is not NULL. However, there is still race condition where the usb (or other) ISR is triggered during run_code_py: 340: // reload request == true and the cleanup_after_vm(). I actually don't understand the background mechanism enough to propose a sound solution here. @dhalbert let me know if my analysis makes sense to you.

tannewt commented 3 years ago

@hathach That makes sense to me! I think we need probably need to remove VM initiated task only (not USB.) An alternative would be run all tasks while interrupts are off and then we reset before turning them back on.

hathach commented 3 years ago

An alternative would be run all tasks while interrupts are off and then we reset before turning them back on.

This can be a bit tricky to do, since when background_callback_run_all() exit critical section to run task function, the tud_task() will then generate more usb isr to add to background queue. For this example, msc host will continuously bumping data fast enough, that will effectively causes MCU to actually have to complete the whole MSC copying before able to do the reload.

That makes sense to me! I think we need probably need to remove VM initiated task only (not USB.)

This sounds like a better solution. IMHO, there are 2 ways we could do

  1. Modify background_callback_reset() to skip removing usb background if it exists in the queue chain, or just removed the VM background, and leaving all other ISR event as it is.
  2. the tud_task() is cheap enough when executing without event, that we could just add it background_callback_add(&usb_callback, usb_background_do, NULL); unconditionally somewhere after background_callback_reset().

Update: I think (1) to remove all VM background is much better, it will prevent similar issue with other ISR. Let me know which task is VM, I could make an PR to remove those only

hathach commented 3 years ago

I make a quick hack to add usb background after the background_reset() for quick test, it does solve this problems and quickly complete the file copying without issue https://github.com/hathach/circuitpython/pull/1 . There is also a couple of minor code clean up as well.

tannewt commented 3 years ago

I think your 2 idea is the best way to go about it for now. There isn't a single VM task.

We don't have a way to know which tasks to drop because we're stopping and those we can to keep doing. Perhaps we should add a critical flag or something.

hathach commented 3 years ago

I think your 2 idea is the best way to go about it for now. There isn't a single VM task.

We don't have a way to know which tasks to drop because we're stopping and those we can to keep doing. Perhaps we should add a critical flag or something.

Ok, I will make an PR using this method.

Update: https://github.com/adafruit/circuitpython/pull/4122 is submitted.