Open fughilli opened 1 year ago
The failure condition is DAP_Cmd_queue->free_count == 0 && DAP_Cmd_queue->send_count == 4
.
To aid with debugging, I've put together a fuzzing script that consistently puts the probe into this state:
import os
import random
import usb
def custom_match(dev):
try:
return dev.serial_number == "<your probe unique ID>"
except Exception:
return False
def generate_random_byte_string(max_length):
length = os.urandom(1)[0] % (max_length + 1)
return os.urandom(length)
class AverageRateInvoker:
def __init__(self, target, invocation_rate):
self.target = target
self.invocation_rate = invocation_rate
def maybe_invoke(self, *args, **kwargs):
if random.random() <= self.invocation_rate:
self.target(*args, **kwargs)
if __name__ == "__main__":
device = next(usb.core.find(find_all=True, custom_match=custom_match))
out_ep, in_ep = device.get_active_configuration().interfaces()[1].endpoints()
writes = 0
reads = 0
def write():
global writes
string = generate_random_byte_string(16)
out_ep.write(string)
writes += 1
def read():
global reads
if reads >= writes:
return
response = in_ep.read(16)
reads += 1
write_invoker = AverageRateInvoker(write, 0.5)
read_invoker = AverageRateInvoker(read, 0.5)
while True:
write_invoker.maybe_invoke()
read_invoker.maybe_invoke()
I am using DAPLink on ATSAM3U2C HIC configured with DAPv2 in a CI setup with PyOCD as the host interface and have observed sticky failures of the DAPLink firmware that manifest as USB timeouts, e.g.:
A failure with the above trace would be observed during a CI run when a probe is already open. Subsequent CI runs using the same DAP would observe a different trace, e.g.:
Doing some digging into the DAPLink firmware, it appears that this is happening because the IN endpoint is stuck, i.e., the firmware is not putting any data into the transmit FIFO and subsequent IN EP polls from the host time out. This is because the
DAP_Cmd_queue
has filled, and the current bulk implementation gates writing the IN EP data (DAP command response) on the queue being not full:source/usb/bulk/usbd_bulk.c:
source/daplink/cmsis-dap/DAP_queue.c:
The same gating is performed in the HID implementation, as well.
I'm still not sure what the reproduction criteria are for causing this queue to fill; if PyOCD is always processing transactions one at a time, I would expect the queue never to have more than 1 element in it. Perhaps this has something to do with
USB_ResponseIdle
--I'm not sure what the semantics of this flag are, but it seems to be managed incorrectly in the bulk EP implementation, since it only ever gets written to 1 by theUSBD_BULK_EP_BULKIN_Event
implementation and gets immediately cleared to 0 in the calling context fromUSBD_BULK_EP_BULKOUT_Event
. It seems likeUSBD_BULK_EP_BULKIN_Event
can also be invoked from the main event loop when an interrupt is observed on the corresponding endpoint, but the only configured interrupt source for that is TX complete. As such, I'm guessing that we're observing two EP OUT events before the EP IN TX complete, resulting in a missed call toUSBD_BULK_EP_BULKIN_Event
whenUSB_ResponseIdle
is clear in the second pass throughUSBD_BULK_EP_BULKOUT_Event
. This would result in an off-by-one, where a response is queued inDAP_Cmd_queue
that doesn't get egressed until the next command is received. After 3 such bubbles, the queue fills completely and we get stuck forever due to the conditional logic I highlighted above.To address this bug, I propose unconditionally invoking
USBD_BULK_EP_BULKIN_Event
fromUSBD_BULK_EP_BULKOUT_Event
. We'll still get chaining of packet egress sinceUSBD_BULK_EP_BULKIN_Event
gets invoked in the case of IN EP TX complete, so when the last write finishes, we'll enqueue another write to the FIFO. In the case that the queue is empty, we do nothing in either case.