adafruit / Adafruit_nRF52_Bootloader

USB-enabled bootloaders for the nRF52 BLE SoC chips
MIT License
438 stars 393 forks source link

intermittent timeouts when flashing with adafruit-nrfutil #195

Closed khpeterson closed 3 years ago

khpeterson commented 3 years ago

About 1/5 attempts to flash new firmware fail with a timeout (show below). I'm running with:

feather_nrf52840_express Adafruit_nRF52_Bootloader: 0.4.1 adafruit-nrfutil: 0.5.3.post13 osx: 10.15.7 python: 3.6 (ecdsa==0.13.2, Click==7.0, pyserial==3.4)

I have a few feather boards on hand and all exhibit the same problem as do different cables.

Upgrading target on /dev/cu.usbmodem14101 with DFU package firmware.zip. Flow control is disabled, Single bank, Touch disabled Opened serial port /dev/cu.usbmodem14101 Starting DFU upgrade of type 4, SoftDevice size: 0, bootloader size: 0, application size: 375820 Sending DFU start packet Sending DFU init packet Sending firmware file ######################################## ######################################## ######################################## ######################################## ###########Timed out waiting for acknowledgement from device.

Failed to upgrade target. Error is: No data received on serial port. Not able to proceed. Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/main.py", line 294, in serial dfu.dfu_send_images() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/dfu/dfu.py", line 235, in dfu_send_images self._dfu_send_image(HexType.APPLICATION, self.manifest.application) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/dfu/dfu.py", line 206, in _dfu_send_image self.dfu_transport.send_firmware(firmware) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/dfu/dfu_transport_serial.py", line 213, in send_firmware self.send_packet(pkt) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/dfu/dfu_transport_serial.py", line 243, in send_packet ack = self.get_ack_nr() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/adafruit_nrfutil-0.5.3.post13-py3.6.egg/nordicsemi/dfu/dfu_transport_serial.py", line 282, in get_ack_nr raise NordicSemiException("No data received on serial port. Not able to proceed.") nordicsemi.exceptions.NordicSemiException: No data received on serial port. Not able to proceed.

Possible causes:

khpeterson commented 3 years ago

I tracked this down to corruption in the packet handled by hci_transport.c/rx_vendor_specific_pkt_type_handle() which resulted in a bad CRC and no call to ack_transmit(). Still working to understand how the corrupted bytes got there / having a hard time believing the USB hardware on my MacBook is failing.

I tried compiling with DEBUG=1 to capture a log from the feather but found the debug image was too big:

LD feather_nrf52840_express_bootloader-0.4.1.out
/usr/local/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: region FLASH overflowed with .data and user data
/usr/local/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: section .bootloaderConfig LMA [00000000000fd800,00000000000fd857] overlaps section .data LMA [00000000000fd228,00000000000fd843]
collect2: error: ld returned 1 exit status
make: *** [_build/build-feather_nrf52840_express/feather_nrf52840_express_bootloader-0.4.1.out] Error 1
Kevins-MacBook-Pro:Adafruit_nRF52_Bootloader khp$ make BOARD=feather_nrf52840_express DEBUG=1
LD feather_nrf52840_express_bootloader-0.4.1.out
/usr/local/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: region FLASH overflowed with .data and user data
/usr/local/gcc-arm-none-eabi-9-2020-q2-update/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: section .bootloaderConfig LMA [00000000000fd800,00000000000fd857] overlaps section .data LMA [00000000000fd228,00000000000fd843]
collect2: error: ld returned 1 exit status
make: *** [_build/build-feather_nrf52840_express/feather_nrf52840_express_bootloader-0.4.1.out] Error 1

Short of adjusting the image start address is there an easy config change I can make to reduce the image size?

hathach commented 3 years ago

thanks for testing and reporting this, please keep us updated on what you found. Indeed the space is rather tight, and didn't have enough room for debug build. Hmm, maybe I will try to update the tinyusb to latest to see if that could help when having more time.

khpeterson commented 3 years ago

I found the corruption occurred on a 64-byte boundary (well, at index 63 in the rx_ff buffer which I believe excludes the leading 0xC0) so on a hunch I tried increasing the size of the endpoint buffers to 1K bytes. With this change I can't reproduce the timeout (tried 25 times where previously I would see a failure in roughly 1/5 attempts). This is not a fix, of course, but shows there's something going around packet chunking.

khpeterson commented 3 years ago

@hathach Is there a simple/safe config change you could recommend that would allow me to run the debug build? It would be super helpful to have debug logging available in the Segger RTT.

khpeterson commented 3 years ago

I captured a trace of the USB traffic with Wireshark and confirmed the bytes sent to the USB endpoint match what nrfutil sent but are wrong in the fifo. And as before the corruption occurs on a 64-byte boundary. Rather than fill this thread with byte streams I'm attaching the files for each point so you can diff for yourself. I'll debug further as I have time.

nrfutil-debug.txt wireshark-usb.txt fifo.txt

hathach commented 3 years ago

@hathach Is there a simple/safe config change you could recommend that would allow me to run the debug build? It would be super helpful to have debug logging available in the Segger RTT.

It has been a while since last time I have to debug it. It probably got overflow recently, I will try to revise the debug, maybe relax some info to save flash.

I captured a trace of the USB traffic with Wireshark and confirmed the bytes sent to the USB endpoint match what nrfutil sent but are wrong in the fifo. And as before the corruption occurs on a 64-byte boundary. Rather than fill this thread with byte streams I'm attaching the files for each point so you can diff for yourself. I'll debug further as I have time.

nrfutil-debug.txt wireshark-usb.txt fifo.txt

Great, this will help a lot with troubleshooting, I will try to update the tinyusb first to see if that helps. Hopefully it is not the hci packet issue. That code from sdk is a bit complicated even though I have dived in and out of it multiple times.

khpeterson commented 3 years ago

Looking at the diffs of the logs I captured it's clear that the the first 20-24 bytes of the next packet are written on top of the first bytes of the corrupted packet. In other words, two consecutive HOST->ENDPOINT transactions issued before the DMA of the first is issued and when the DMA does finally go, it copies corrupted bytes at the beginning of the packet.

One possibility: a race between writing NRF_USBD->SIZE.EPOUT[epnum] = 0 and setting the xfer->data_received flag. If the call chain back to dcd_edpt_xfer() gets there before the interrupt handler has had a chance to set the data_received flag, then xact_out_prepare() gets called and that initiates a HOST->ENDPOINT transfer even though one may already be running.

I'm testing this idea by inserting a spin wait (with a timeout) at the start of dcd_edpt_xfer() that waits for data_received before doing anything. Close but no cigar - my approach seems to close the window a little but I'm still seeing the corruption (though less frequently)

I'm happy to prepare a pull request but clearly something better than a spin wait is needed. Can you point me at a clean way to tell when dcd_edpt_xfer() is called for the first time at the beginning of a sequence of transfers for a packet? I also thought about checking to see if a HOST->ENDPOINT transfer is active but couldn't find a bit in the NRF52840 USBD registers that would indicate this.

khpeterson commented 3 years ago

I'm feeling more confident about my fix now... I just finished a scripted test run and got through 80+ flash trials with no problems. I commented out the spin wait and ran again and it died the first time (and the 2nd time) so I gave up. Timing/race condition bugs are hard so no guarantees here but it's looking good. But same question as before - is there an easy way to know for certain when a HOST-ENDPOINT sequence is starting in order to kick off the sequence cleanly?

hathach commented 3 years ago

@khpeterson tinyusb stack has lots of development in the last year which also include some of the racing condition fix. I have updated the stack to latest in the following PR , would you mind trying it out to see if it helps

https://github.com/adafruit/Adafruit_nRF52_Bootloader/pull/197

If it still doesn't help, can you submit your fix as a PR, it will make it easier for me to understand your point and also help to quickly test it out. Btw, how do you perform testing (command line or arduino IDE) and which application (sketch, binary) that you used to upload. I will try to mimic your set up as much as I could. Sometime the actual payload can also contribute to the factor especially if it is race condition.

khpeterson commented 3 years ago

I just tried your PR and it fails the same way.

Here's a PR with my fix: https://github.com/hathach/tinyusb/pull/723

For scripted testing, I created a dummy app (in mbed-os) that that drops into DFU mode when a certain BLE GATT characteristic is written. Then in the shell on OS X I just loop over "write char and exit; flash with nrfutil". The "write char and exit" is done via another OS X app that I wrote a while ago to read/write generic BLE characteristics.

Attached is the dummy app. I invoke nrfutil like so:

adafruit-nrfutil --verbose dfu serial -pkg dfu1.zip -p /dev/cu.usbmodem14101 -b 115200 --singlebank

dfu1.zip

hathach commented 3 years ago

I have pushed an PR that hopefully fix this issue without blocking wait https://github.com/hathach/tinyusb/pull/733 , would you mind trying it out to see if that works for your setup.

hathach commented 3 years ago

This issue is an bug in tinyusb stack, got the stack updated to fix the issue. There is no other changes required