Closed dhalbert closed 3 years ago
Confirming workaround: suspending autoreload through supervisor.disable_autoreload()
or entering the REPL allows files to be copied without issue.
This was discovered while programming Big Honking Buttons. The factory setup script copies a set of files to the device - one set is the following WAV files:
drwxr-xr-x 9 stargirl staff 288B Jul 24 13:37 ./
drwxr-xr-x 9 stargirl staff 288B Jan 12 21:00 ../
-rw-r--r-- 1 stargirl staff 19K Jul 24 13:37 clap.wav
-rw-r--r-- 1 stargirl staff 63K Jul 24 13:37 dist.wav
-rw-r--r-- 1 stargirl staff 21K Jul 24 13:37 go.wav
-rw-r--r-- 1 stargirl staff 17K Jul 24 13:37 honk.wav
-rw-r--r-- 1 stargirl staff 37K Jul 24 13:37 kick.wav
-rw-r--r-- 1 stargirl staff 38K Jul 24 13:37 reverse.wav
-rw-r--r-- 1 stargirl staff 65K Jul 24 13:37 snare.wav
These in particular trigger the issue and would cause writes to the device to hang. It seems BHB's library code accessing the filesystem at startup may contribute to triggering the issue more often as well.
As a workaround, we are forcing the device into the REPL before copying files:
def force_into_repl(usb_device_id: str):
"""Forces a circuitpython device into the REPL."""
port = serial_connect(usb_device_id)
# Interrupt with Ctrl + C
port.write(b"\x03")
# Enter repl with enter after slight delay
time.sleep(0.2)
port.write(b"\n")
port.close()
With that workaround, we can program devices consistently without hanging:
========== DEPLOYING CODE ==========
Waiting for CIRCUITPY drive...
Forcing BHB into repl (workaround for CircuitPython issue #3986)
Cleaning temporary files from src directories...
Downloading files to cache...
Using cached winterbloom_voltageio.py.
Copying files...
Copied factory/.cache/winterbloom_voltageio.py to lib
Copied firmware/winterbloom_bhb to lib
Copied samples to .
Copied examples to .
Copied firmware/LICENSE to .
Copied firmware/README.HTM to .
Copied examples/default.py to code.py
Flushing... 🚽
Done copying files, resetting...
Done!
I found the copying is much slower if the serial connection is not open.
This test is with code.py
just being print("Hello, world")
.
Not connected:
halbert@salmonx:~/Downloads/sounds$ /bin/rm /media/halbert/CIRCUITPY/*.wav;sync
halbert@salmonx:~/Downloads/sounds$ cp -v *.wav /media/halbert/CIRCUITPY/
'hit.wav' -> '/media/halbert/CIRCUITPY/hit.wav'
'idle.wav' -> '/media/halbert/CIRCUITPY/idle.wav'
[big delay; NeoPixel pulses green]
'off.wav' -> '/media/halbert/CIRCUITPY/off.wav'
[big delay; NeoPixel pulses green]
'on.wav' -> '/media/halbert/CIRCUITPY/on.wav'
[big delay; NeoPixel pulses green]
'swing.wav' -> '/media/halbert/CIRCUITPY/swing.wav'
[big delay; NeoPixel pulses green]
Connected, in REPL or not (doesn't seem to matter on timing), and repeating the commands above, I see much smaller delays, still a few seconds for copying larger files (sometimes two at a time). I do not see the pulsing green NeoPixel as above.
I am not sure this completely jives with @theacodes' technique of forcing the REPL, and having trouble before that.
Should I try with just opening a serial connection and not sending anything?
On Tue, Jan 12, 2021, 11:08 PM Dan Halbert notifications@github.com wrote:
I found the copying is much slower if the serial connection is not open.
This test is with code.py just being print("Hello, world").
Not connected:
halbert@salmonx:~/Downloads/sounds$ /bin/rm /media/halbert/CIRCUITPY/.wav;sync halbert@salmonx:~/Downloads/sounds$ cp -v .wav /media/halbert/CIRCUITPY/ 'hit.wav' -> '/media/halbert/CIRCUITPY/hit.wav' 'idle.wav' -> '/media/halbert/CIRCUITPY/idle.wav' [big delay; NeoPixel pulses green] 'off.wav' -> '/media/halbert/CIRCUITPY/off.wav' [big delay; NeoPixel pulses green] 'on.wav' -> '/media/halbert/CIRCUITPY/on.wav' [big delay; NeoPixel pulses green] 'swing.wav' -> '/media/halbert/CIRCUITPY/swing.wav' [big delay; NeoPixel pulses green]
Connected, in REPL or not (doesn't seem to matter on timing), and repeating the commands above, I see much smaller delays, still a few seconds for copying larger files (sometimes two at a time). I do not see the pulsing green NeoPixel as above.
I am not sure this completely jives with @theacodes https://github.com/theacodes' technique of forcing the REPL, and having trouble before that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/3986#issuecomment-759191881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5I445BKEIZVGTJMQE3DLSZUME3ANCNFSM4WADE4LQ .
Should I try with just opening a serial connection and not sending anything?
I thought you were doing that initially, and you added the "getting into the REPL" part. But if were not opening the connection before, yes, you could try that.
I think this may have to do with not calling the tinyusb "do tasks" call often enough. It probably gets called more often when a serial connection is open, because there's traffic that generates interrupts. The fix may be to enable tick interrupts all the time if there's any kind of USB connection, including MSC and CDC. I'm going to try that soon.
I am able to reproduce the issue with tip of main by copying all sample files here https://github.com/theacodes/Winterbloom-Big-Honking-Button/tree/master/samples . Bus analyzer show that firmware didn't response to the SCSI status phase, either the stack failed to manage the state and/or could be another buffer overflow, or maybe the tud_task() isn't called e.g cpy doesn't think it needs to call the usb task. Eventually the host have to issue a bus reset -> re-enumerate -> host tries again with on-going files which is the huge pause we see.
I will do more testing with different MCUs, and scenario, will update with more details later on.
Thanks!! I am going to instrument the calls to tud_task()
and the ticks by toggling some pins to see what is going on.
@hathach In 5.3.1, we called tud_task()
as one of the background tasks that runs on every timer tick, which was 1msec. The calling routine is usb_background()
.
That was changed in #2879 during 6.0 development. Now tud_task()
is only set up to be called once each time the usb_irq_handler()
is called:
https://github.com/adafruit/circuitpython/blob/409b5ff009811571adef9c519bcbb1f103340127/supervisor/shared/usb/usb.c#L97-L100
(It is not called in the interrupt handler, but it is put on a list of background routines to call (once) when they are given a chance to run.)
I think the idea was that only a USB interrupt would generate a task for TinyUSB to handle, and otherwise TinyUSB would have no work to do. But I don't think that's right, is it? It appears there are other times besides an interrupt when tud_task() has something to do
.
As an experiment, I put usb_background()
back in supervisor_background_tasks()
and took out the special logic described above. That fixes the problem of this issue.
But I am still worried that we need to call tud_task()
every msec, no matter what. I thought you told me in the past this was necessary. Currently in 6.x we don't generate ticks every msec unless there's some work going on that depends on that. Doing this reduces power consumption. But if we truly need to do tud_task()
every msec, we have to always generate 1 msec ticks, and our tick-saving logic just needs to go away, at least when USB is connected.
@jepler Do you have any comment on the above (especially #2879 changes)?
Re #2868 I don't mind at all modifying how we control when usb_background()
or tud_task()
are and I'm happy to help figure out what needs to be changed. It's not an understatement to say my changes were based on an insufficient understanding of tinyusb as well as just seeing "what worked"...
Maybe we need to queue up tud_task after we queue a response back into tinyusb. For example, maybe we need to trigger tud_task after a write completes.
@dhalbert thanks for detail explanation, most of the usb task job is queued by ISR, however, there is a few exceptions especially with MSC. E.g when scsi read/write callbback return 0 as busy indicator. The whole transfer is already complete, but since the application is not ready, tinyusb will need to re-schedule the event by itself without ISR (re-submit event to task queue).
https://github.com/hathach/tinyusb/blob/master/src/class/msc/msc_device.c#L366
It is probably our case here if we do return 0
in write callback. If so you only need to make sure to set the isr flag to run background when returning 0 or run background when flash is ready again (once 0 is returned).
However, I haven't narrowed it down to exactly specific cause just yet, since I will be switching to other works requested by @tannewt (from slack) which has a tight timing. I will come back to this one later on.
Will tud_task_event_ready()
return true
after tud_task
returns in this case? We could schedule another call if that's the case.
Will
tud_task_event_ready()
returntrue
aftertud_task
returns in this case? We could schedule another call if that's the case.
I think I can test to see if that is true.
It is probably our case here if we do
return 0
in write callback. If so you only need to make sure to set the isr flag to run background when returning 0 or run background when flash is ready again (once 0 is returned).
Our tud_msc_write10_cb()
is synchronous, and never returns 0, so that would not be a time to reschedule.
@tannewt @hathach I tested this on a variety of boards with 6.2.0-beta.1. On the problem boards, the writes are severely delayed or never complete. The different behaviors may help in debugging.
Metro M0 Express (SAMD21): problem: finishes, but /var/log/syslog
shows USB issues
Metro M4 Express (SAMD51): problem: finishes, but /var/log/syslog
shows USB issues
Feather STM32F405: problem: gets completely stuck and never finishes: /var/log/syslog
shows worse problems than SAMD
Feather nRF52840: fine RPi Pico: fine Metro ESP32-S2: fine
thanks @dhalbert , I will take a break from rp2040 and switch to troubleshoot this tomorrow. There are a few of RP2040 usb issues (usbnet, webusb etc..) though those will probably take time since I will need to go through datasheet reading. Will ping you if I could find out anything.
@hathach Make sure and read the discussion on #4075 too.
update: tracing down the background callback, it does solve the race condition pretty well. However, the issued is caused by run_code.py decide to reload (AUTO_RELOAD) and run background_callback_reset() while background list is still queued up with usb event.
background_callback_run_all: 95:
callback_head = 200029A4
callback_tail = 200029A4
background_callback_run_all: 100: // background_callback_run_all() exit while there is queuing callbacks
run_code_py: 340: // reload request == true
usb_irq_handler: 103: // usb isr occurs -> more callback queued up
callback_head = 200029A4
callback_tail = 200029BC
background_callback_add_core: 57:
run_code_py: 305: // cleanup_after_vm() is called which in turn reset all the callbacks including above usb events.
background_callback_reset: 122:
background_callback_run_all: 73:
To sum up the issue is not related to usb stack, it seems to be caused by auto-reload while background task are still queued up. background_callback_run_all()
is not guarantee to execute all the background since while executing background, those function can potentially trigger more usb ISR.
https://github.com/adafruit/circuitpython/blob/main/supervisor/shared/background_callback.c#L91
We can modify background_callback_run_all() to loop over while callback_head is not NULL. However, there is still race condition where the usb (or other) ISR is triggered during run_code_py: 340: // reload request == true
and the cleanup_after_vm()
. I actually don't understand the background mechanism enough to propose a sound solution here. @dhalbert let me know if my analysis makes sense to you.
@hathach That makes sense to me! I think we need probably need to remove VM initiated task only (not USB.) An alternative would be run all tasks while interrupts are off and then we reset before turning them back on.
An alternative would be run all tasks while interrupts are off and then we reset before turning them back on.
This can be a bit tricky to do, since when background_callback_run_all()
exit critical section to run task function, the tud_task() will then generate more usb isr to add to background queue. For this example, msc host will continuously bumping data fast enough, that will effectively causes MCU to actually have to complete the whole MSC copying before able to do the reload.
That makes sense to me! I think we need probably need to remove VM initiated task only (not USB.)
This sounds like a better solution. IMHO, there are 2 ways we could do
background_callback_add(&usb_callback, usb_background_do, NULL);
unconditionally somewhere after background_callback_reset()
.Update: I think (1) to remove all VM background is much better, it will prevent similar issue with other ISR. Let me know which task is VM, I could make an PR to remove those only
I make a quick hack to add usb background after the background_reset() for quick test, it does solve this problems and quickly complete the file copying without issue https://github.com/hathach/circuitpython/pull/1 . There is also a couple of minor code clean up as well.
I think your 2 idea is the best way to go about it for now. There isn't a single VM task.
We don't have a way to know which tasks to drop because we're stopping and those we can to keep doing. Perhaps we should add a critical flag or something.
I think your 2 idea is the best way to go about it for now. There isn't a single VM task.
We don't have a way to know which tasks to drop because we're stopping and those we can to keep doing. Perhaps we should add a critical flag or something.
Ok, I will make an PR using this method.
Update: https://github.com/adafruit/circuitpython/pull/4122 is submitted.
First reported by @theacodes. Considerable discord discussion starting about 7pm ET 2021-1-12 in #circuitpython
Seems to happen fairly often (though not always) on SAMD21. Not sure if it happens on SAMD51. Not tried on other boards.
Example, copying to a Metro M0 Express running near the tip of
main
. First erase the filesystem. Then, do not enter the REPL, but copy multiple larger files, e.g. like this set:For me, one out of two times this caused the copy to hang.
@theacodes reports that a workaround is to copy one file at a time or to enter the REPL, so that auto-reload is not happening, and then copy.
This problem was not evident to @theacodes on 5.3.1.