USB device cannot be re-opened after "stuck USB write"

Windwoes commented 4 years ago

When an Expansion Hub is disconnected, there are, of course, several possible states that the comms link could be in at that momment:

It could be DCed while no transactions are outstanding
It could be DCed during a write
It could be DCed during a read
It could be DCed after writing and while waiting for a response
Probably some other state too

One of these conditions (probably the second) can cause this error to occur:

E/MonitoredUsbDeviceConnection: watchdog: stuck USB write(12 bytes) threadId=345 TID=21672:: serial=DQ1O48DC closing device

If that error occurs, the USB device cannot be re-opened by any normal means. None of the following will cause it to be re-detected:

Restarting robot
Swiping the RC app off and starting it again
Force-stopping the RC app and launching it again
Force-stopping the RC app, waiting a minute or so, and then launching it again
Physically unplugging and re-connecting the Expansion Hub to the OTG adapter

The ONLY way to recover from this state is to physically unplug and replug the actual OTG adapter. This led me to conclude it might possibly be an Android bug. To further add evidence to that case, I tried using a 3rd party USB device enumerator app while the phone was in the bad state. It too failed to pick up the FTDI. Upon unplugging and replugging the actual OTG adapter, both apps were able to successfully find the device.

I dug into the SDK to find what code produces this error, and I found this comment on line 164 of MonitoredUsbDeviceConnection.java:

// We close the device in the hopes of waking up the call. We return a
// distinguished exception so that folks can in fact KNOW we closed the device. We
// also set up so that a read that detects the closed device will return the same sort
// of 'please try again' error rather than its usual 'i give up'.
//
// Note: closing in this way *does* seem to unstick the call. However, we've rarely/never
// been able to successfully re-open after that, even after waiting huge amounts of
// time (like a second). Rather, we get stuck in trying to set the baud rate, inside of
// a different native call (UsbDeviceConnection.native_control_request).
//

Thus it seems then that someone on the Tech Team has seen this before, or at least something similar (notice the "we've rarely/never been able to successfully re-open after that").

I have a feeling this is an Android bug, (and thus not much can really be done about it, except maybe for the Control Hub, because that OS could be patched easily) but I think it should still be looked into.

Windwoes commented 4 years ago

Well, this bug normally only happens like 1/25 times, but yesterday it disconnected me in a match. Normally my robot literally doesn't care about ESD at all, but if it's a particularly strong zap then the USB comms will be interrupted for usually around 300ms. However yesterday Android decided that a real match would be nice time bite :(

I immediately knew it was this bug when it happened on the field, due to the tell-tale sign of the "problem with ExH portal" message appearing 5 seconds after the incident, and a postmortem analysis confirmed I was correct. (I.e. logs indicated "stuck USB write", and after carrying the bot off the field and replugging the OTG adapter, it picked it up right away)

I may have to take a look at the Android USB code and see if I can find anything suspicious.

cmacfarl commented 4 years ago

IIRC a while back we determined that it was the hardware itself that got in a stuck state, and the only recourse was a power cycle.

Is your venue particularly susceptible to ESD? If so, you might want to talk to the hosts about giving the ESD mitigation whitepaper a read. https://www.firstinspires.org/sites/default/files/uploads/resource_library/ftc/analysis-esd-mitigation-echin.pdf

Windwoes commented 4 years ago

@cmacfarl well that's not exactly the behavior I'm seeing. If I replug the actual OTG adapter it comes back. Replugging the ExH from the OTG adapter does nothing, though.

They do not take any mitigation measures as far as I'm aware, so I may mention that to them :)

Windwoes commented 4 years ago

Two more more notes:

1) This bug is not, is far as I know, triggered by ESD. It can be reproduced without ESD as described in my opening post.

2) It seems that native_bulk_request ignores the timeout in this situation. The default timeout the SDK uses is 5000ms, which can result in OpModes getting stuck in loop (even the default Stop Robot OpMode). I tried decreasing this timeout to 250ms, but it had no effect. It still acted like a 5000ms timeout was set.

Windwoes commented 4 years ago

Update: I've been doing some digging. Just for the heck of it, I unbound the Linux kernel driver for the battery management controller (which also happens to control turning on and off the boost converter for 5v out when using USB OTG). This had the result of the phone not supplying any power at all to the Expansion Hub (i.e. no LED instead of flashing orange), but comms were established if the Expansion Hub had 12v power.

However, I was still able to reproduce the issue with the driver for the battery management controller unloaded. (Not that I expected it to change anything, but you never know)

gearsincorg commented 4 years ago

Hi @FROGbots-4634

I have no real idea what you mean when you say: " I unbound the Linux kernel driver for the battery management controller ", but your comment about the phone not powering the Expansion hub had my interest piqued.

At the Tech Team's last conf call, I was asking if it was an expected behavior that when I plug my Moto G phones into an unpowered expansion hub for the first time, that it does not connect and light the orange LED? The ZTE's always did this, and I thought the Moto G phones did too

More recently with the Moto G phones, I've noticed what you describe, if I plug into a cold expansion hub, nothing happens until I power it's 12V on. Then the app auto launches and the lights go green. THEN if I power down, I get the expected flashing orange.

How/Does your discussion about the boost converter impact what I'm seeing. What determines when the App sees the expansion hub and auto launches? Could it be something as trivial as the the type of OTG cable i'm using.

I'm just following a thread down a rabbit hole, and wondering if changed OS behavior is making our disconnects more brutal.

Windwoes commented 4 years ago

@gearsincorg The Linux kernel allows drivers to be manually bound and unbound from a device without having to remove the whole driver with rmmod:

The battery management controller used in my Nexus 5 RC phone is the Texas Instruments BQ24192 and is controlled via an internal I2C bus.

For reference, here is the driver for the BQ24192 in my Nexus 5.

I simply unbound the I2C driver for the battery management controller by doing:

hammerhead:/ # echo '84-006b' > /sys/bus/i2c/drivers/bq24192/unbind

Now, regarding your comments of:

I've noticed what you describe, if I plug into a cold expansion hub, nothing happens until I power it's 12V on. Then the app auto launches and the lights go green. THEN if I power down, I get the expected flashing orange.

How/Does your discussion about the boost converter impact what I'm seeing. What determines when the App sees the expansion hub and auto launches? Could it be something as trivial as the the type of OTG cable i'm using.

So, as I previously mentioned, the driver for the battery management controller decides whether or not to enable the boost converter. With that driver unbound, the boost converter will never be enabled. Thus, with the driver unbound, if I remove the 12v power after comms were established, the ExH goes dark again. (Just for clarity's sake: if I leave the driver bound, I get the expected flashing orange when connecting to a cold ExH)

It is possible that the behavior you are seeing is caused by the driver for battery management controller in your OS. However, it is also possible that it could be the type of OTG cable your are using, as apparently either connecting the ID pin of the USB plug directly to GND, or to GND through 124 kΩ, 68 kΩ, or 36.5 kΩ resistors causes different behavior.

Windwoes commented 4 years ago

I've been hardcore pursuing this issue. Just thought I'd drop a note to say I captured the kernel log when this happens:

[  995.223705] anx7808 anx7808_cbl_det_isr: detect cable removal
[  995.223795] usb 1-1: USB disconnect, device number 2
[  995.629557] xhci-hcd xhci-hcd: remove, state 4
[  995.629852] usb usb2: USB disconnect, device number 1
[  995.633904] xHCI xhci_drop_endpoint called for root hub
[  995.633939] xHCI xhci_check_bandwidth called for root hub
[  995.645735] xhci-hcd xhci-hcd: USB bus 2 deregistered
[  995.646018] xhci-hcd xhci-hcd: remove, state 1
[  995.646291] usb usb1: USB disconnect, device number 1
[  995.647419] bq24192_external_power_changed: usb host mode = 2
[  995.648151] bq24192_enable_otg: otg enable = 0
[  995.648555] bq24192_enable_otg: FTC: PRINT STATEMENT IN KNOWN LOCATION
[ 1000.232536] xhci-hcd xhci-hcd: xHCI host not responding to stop endpoint command.
[ 1000.232971] xhci-hcd xhci-hcd: Assuming host is dying, halting host.
[ 1000.233463] xhci-hcd xhci-hcd: HC died; cleaning up
[ 1000.243309] hub 1-0:1.0: hub_port_status failed (err = -19)
[ 1000.243558] hub 1-0:1.0: connect-debounce failed, port 1 disabled
[ 1000.245073] xHCI xhci_drop_endpoint called for root hub
[ 1000.245104] xHCI xhci_check_bandwidth called for root hub
[ 1000.247514] xhci-hcd xhci-hcd: USB bus 1 deregistered

gearsincorg commented 4 years ago

In this case is the Expansion Hub the "Host" ?

Windwoes commented 4 years ago

Negative. The "host" is the "DesignWare USB3 DRD Controller" in the phone.

Windwoes commented 4 years ago

I compiled a custom version of the Linux kernel for my phone with a workaround (though maybe "hack" would be a better term) for this issue.

I edited dwc3-msm.c and added an extra file to the debugfs:

    if (!debugfs_create_file("frog_reset", S_IRUGO | S_IWUSR,
                dwc3_debugfs_root, mdwc, &frog_reset_write_fops))
        goto error;

I added this file-ops listener to it:

const struct file_operations frog_reset_write_fops = {
    .write = frog_reset_write,
};

So, when the debugfs file is written to, this function is invoked:

static ssize_t frog_reset_write(struct file *file, const char __user *ubuf,
                size_t count, loff_t *ppos)
{
    char buf[8];

    memset(buf, 0x00, sizeof(buf));

    if (copy_from_user(&buf, ubuf, min_t(size_t, sizeof(buf) - 1, count)))
        return -EFAULT;

    if (!strncmp(buf, "zombie", 6))
    {
        frog_toggle_host_mode();
        printk("FTC: FROG RESET WRITE\n");
    }

    return count;
}

As you can see, if write "zombie" to the debugfs file, then frog_toggle_host_mode() will be called.

That is a function I added to a different file, dwc3_otg.c:

void frog_toggle_host_mode(void)
{
    printk("FTC: dwc3_otc.c: FROG_TOGGLE_HOST_MODE\n");
    dwc3_otg_start_host(context, 0);
    dwc3_otg_start_host(context, 1);
}

Essentially, it simply forces the controller into peripheral mode, then back to host mode.

(Note, I had to add static struct usb_otg *context; and context = &(dotg->otg); in the dwc3_otg_init function.)

I can trigger the reset simply by doing:

hammerhead:/sys/kernel/debug/msm_dwc3 # echo zombie > frog_reset

Here's a video showing this in action. First, I demonstrate that the RC is in the bad state where it doesn't pick up the ExH even after multiple tries unplugging and replugging the USB cable from the OTG adapter. Then, I leave the ExH plugged in and run the aforementioned shell command from a remote shell on my computer, and the phone picks it up right away (notice the LED change from flashing blue to green).

So, now I just need to setup my OpMode with a special button combo on my gamepad which will cause the RC to locally invoke that shell command. Then I can essentially have a "zombie apocalypse" trigger to recover comms in a real match, should I ever need to use it. (Of course, now that I've done all the work to implement this, watch how I'll never need to use it sigh).

gearsincorg commented 4 years ago

Freaking cool..

So, the $60K question... How does this relate to a general FTC solution? I assume it can be built into the CH OS code, but is there any hope for regular Phone users?

Windwoes commented 4 years ago

Sadly, this is not easily portable to other phones. Recompiling a modified kernel for each phone would be required. I'm not sure whether the CH is affected by this issue or not - that would depend on the USB controller is uses as well as how the ports are physically wired to said controller.

Additionally, this is very much a "hack" (for one thing, it crashes the kernel if you run it without an OTG adapter connected). However, this is a good starting point for hunting down the bug.

According to the kernel log I posted earlier, it appears that what is actually happening when the bad state occurs is that the USB controller is turning off host mode while the OTG adapter is still connected. That's why replugging the OTG adapter fixes the issue but replugging the ExH from the adapter does not: when the adapter is reconnected, host mode is activated again. Or so goes my theory.

I can see about trying to upload a diff of the changed files so that you could recompile the kernel for your phones, if you're interested.

gearsincorg commented 4 years ago

I can see about trying to upload a diff of the changed files so that you could recompile the kernel for your phones, if you're interested.

Thanks, but I don't think I'm ready to go down that rabbit hole :)

What I find interesting is that the phone is reporting:

anx7808 anx7808_cbl_det_isr: detect cable removal

So, I wonder how it determines this. Is it the 2 data line levels going to a default state. Or could it be the 5th OTG pin level dropping.

Windwoes commented 4 years ago

So, I wonder how it determines this. Is it the 2 data line levels going to a default state. Or could it be the 5th OTG pin level dropping.

Well, if we look at slimport.c, we can see that it registers the ISR on the rising and falling edges of the USB_ID pin:

    //detect pin would not work if we don't configure and request USB_ID pin (gpio 77)
    ret = request_threaded_irq(anx7808->cab_irq, NULL, anx7808_cbl_det_isr,
                    IRQF_TRIGGER_RISING
                    | IRQF_TRIGGER_FALLING,
                    "anx7808_cabel_det", anx7808);

Here's the ISR itself:

static irqreturn_t anx7808_cbl_det_isr(int irq, void *data)
{
    struct anx7808_data *anx7808 = (struct anx7808_data *)data;
    int status;
    SP_DEV_DBG("%s : gpio_get_value(anx7808->pdata->gpio_cbl_det) = %d\n", __func__,gpio_get_value(anx7808->pdata->gpio_cbl_det));
    if (gpio_get_value(anx7808->pdata->gpio_cbl_det)) {
        wake_lock(&anx7808->slimport_lock);
        SP_DEV_DBG("%s : detect cable insertion\n", __func__);
        queue_delayed_work(anx7808->workqueue, &anx7808->work, 0);
        msm_otg_id_pin_irq_enabled(false);
    } else {
        SP_DEV_DBG("%s : detect cable removal\n", __func__);
        status = cancel_delayed_work_sync(&anx7808->work);
        if(status == 0)
            flush_workqueue(anx7808 ->workqueue);
        wake_unlock(&anx7808->slimport_lock);
        msm_otg_id_pin_irq_enabled(true);
    }
    return IRQ_HANDLED;
}

NB: Slimport is a technology similar to MHL, i.e. for HDMI out over USB. (The Nexus 5 has an ANX7808 Slimport chip).

rischconsulting commented 4 years ago

I was under the impression that 5.4 had some sort of fix for this: "Fixes system responsiveness issue when an Expansion Hub is disconnected"

Indeed, when we get the "stuck USB write" error, we are now getting a different orange error message saying that the expansion hub is detached, and the Driver Station remains responsive. But we still aren't able to restart the robot without pulling the OTG cable. Is anyone aware of what the behavior is supposed to be? Or what we can do to reset the robot without pulling the OTG?

Windwoes commented 4 years ago

@rischconsulting Negative. Fixes system responsiveness issue when an Expansion Hub is disconnected means that if the daisy-chained Hub on the 485 bus becomes unresponsive, your entire loop won't become horribly laggy due to continual retransmissions and timeouts when sending commands to the unresponsive Hub.

But we still aren't able to restart the robot without pulling the OTG cable. Is anyone aware of what the behavior is supposed to be? Or what we can do to reset the robot without pulling the OTG?

This IS NOT a bug in the FTC SDK. It is either hardware bug in the USB transceiver, or a bug in the kernel USB transceiver driver. There is nothing you can do to recover from this without unplugging and replugging the OTG cable, unless you feel like recompiling the Linux kernel for your phone to add a hook in the debugfs that invokes (very hacky) reset of the USB transceiver like I did.

rischconsulting commented 4 years ago

We just might - we're pulling our hair out. I guess we need to figure out the cause of the error.

It is interesting, though, that the error reported has now changed, which means that somebody did something on this.

Windwoes commented 4 years ago

I guess we need to figure out the cause of the error.

The cause of the error is a completely random chance when the USB device disconnects. It can be reproduced without ESD, by simply continually unplugging and replugging the OTG adapter. Eventually, on one of the unplugs, you will hit a stuck USB write.

It is interesting, though, that the error reported has now changed, which means that somebody did something on this.

The error changes because the SDK now detects unresponsive hubs as opposed to just outright missing hubs. Since everything disappears permanently off the USB bus, the SDK cannot communicate with the Hub and marks it unresponsive. The changes that were made in v5.4 have absolutely nothing to do with the stuck USB write condition specifically. You can review the diff for yourself if you don't believe me :)

rischconsulting commented 4 years ago

I totally believe you - I already read through the diffs and found nothing, which is why I asked!

The randomness is killer - sometimes we can run for hours with no error, and sometimes we get one every minute. If it's android based, I wonder whether there's a version of Android that doesn't have the problem. Our e4's are on 8.0

Windwoes commented 4 years ago

I don't think it's Android version specific. I think it's caused by either certain USB transceivers or the drivers for said transceivers. What I did was I bound a button on my gamepad to invoke the hook in my kernel to reset the USB transceiver:

if(rstToggle.shouldToggle(gamepad2.right_stick_button))
{
    new RunShellCommand().runAsRoot("echo zombie > /sys/kernel/debug/msm_dwc3/frog_reset");
}

Then if I ever hit the issue in a match, I can be back up and running again in <2s by simply pressing a button on the gamepad.

1tree commented 4 years ago

At the end of the day, this is likely going to need a fix at the driver level. But... I suspect that the best chance for this is going to be for something like what you have done.

======== Kirk Wood

On Thu, Feb 6, 2020 at 10:17 AM FROGbots-4634 notifications@github.com wrote:

I don't think it's Android version specific. I think it's caused by either certain USB transceivers or the drivers for said transceivers. What I did was I bound a button on my gamepad to invoke the hook in my kernel to reset the USB transceiver:

if(rstToggle.shouldToggle(gamepad2.right_stick_button)) { new RunShellCommand().runAsRoot("echo zombie > /sys/kernel/debug/msm_dwc3/frog_reset"); }

Then if I ever hit the issue in a match, I can be back up and running again in <2s by simply pressing a button on the gamepad.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FIRST-Tech-Challenge/SkyStone/issues/145?email_source=notifications&email_token=ACLP2R736FDC4GKSYHK5WJ3RBQZY3A5CNFSM4JO35K22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7ZZNA#issuecomment-582982836, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLP2R2YSMEUZFEKG43UUADRBQZY3ANCNFSM4JO35K2Q .

Windwoes commented 4 years ago

If it's a hardware bug, then a fix at the driver level would probably not be possible. I don't really know whether it's a driver bug or a hardware bug....

For my patch though, I could probably make it much less hacky (i.e. not able to crash the kernel if run without USB devices connected) by not just outright ignoring the massive state machine in the driver, lol. On the other hand, I need it to serve one purpose, and it does that nicely.

Of course, this is a non-issue with Control Hubs (well unless you connect the second Hub over USB instead of 485) because in the CH the embedded Lynx PCB is connected directly to the Android SBC over UART.

Nit for posterity: When connecting to a Hub over USB, you're actually connecting to an FTDI USB to UART bridge... the Control Hub just removes the middleman FTDI.

Windwoes commented 4 years ago

During a scrimmage yesterday, static on the field was very very bad (I was literally hearing arcs when driving under the skybridge), and during one match I got a stuck USB write and gave my "zombie apocalypse" button its first test when it really mattered - and it brought it back :)

Of course I'd prefer for this never to happen in the first place, but it's comforting to know that static isn't going to knock me out for the remainder of a match.

FrowntownPitt commented 4 years ago

I'd like to add my own observations over the last several events I've FTA'd (most notably from this weekend) with regards to this issue.

I agree that this issue is not necessarily caused by ESD, by anecdote that at some events where this happens no static is noticed during normal field operation (touching robots/field without any shocks all weekend).

I started to pull logs any time this issue arises and actually started to comb through them to understand what is happening. Prior to this weekend, it was always a E MonitoredUsbDeviceConnection: watchdog: stuck USB write(14 bytes) threadId=424 TID=0:: serial=DQ2M0VUL closing device log at the time the error happened. When the app would close the connection and attempt to open it again, it does an "open" and "close" immediately after:

02-15 16:03:42.172  4902 20278 V FtDevice: ^^********************DQ2M0VUL opened ********************^^
02-15 16:03:42.172  4902 20278 V FtDevice: resetting DQ2M0VUL
02-15 16:03:42.205  4902 20291 E BulkPacketInWorker: DQ2M0VUL: bulkTransfer() error: -1
02-15 16:03:42.239  4902 20278 V RobotUsbDeviceFtdi: closing DQ2M0VUL
02-15 16:03:42.239  4902 20278 V FtDevice: vv********************DQ2M0VUL closing********************vv 0x03baef14

This weekend I noticed a different pattern with a different configuration. The typical configuration is Phone -> OTG adapter -> Exp hub. In the following case, it's Phone -> OTG adapter -> USB hub -> Exp hub and camera. What the logs show in this case is that the phone's USB doesn't get stuck, but rather doesn't get acks back. In the typical configuration, a physical unplug/replug of the OTG adapter from the phone resets the connection, but in this case the connection between the USB hub and the Exp hub has to be "replugged".

02-16 13:59:56.924 26658 27861 V LynxModule: retransmitting: mod=2 cmd=0x100f msg#=16 ref#=0 
02-16 13:59:57.061 26658 27861 V LynxModule: retransmitting: mod=2 cmd=0x100f msg#=16 ref#=0 
02-16 13:59:57.097 26658 27861 V RobotCore: timeout: abandoning waiting 250ms for ack: cmd=LynxSetMotorConstantPowerCommand mod=2 msg#=16
02-16 13:59:57.098 26658 27861 V RobotCore: nack rec'd mod=2 msg#=16 ref#=0 reason=ABANDONED_WAITING_FOR_ACK:257
02-16 13:59:57.098 26658 27861 W LynxMotor: LynxSetMotorConstantPowerCommand was abandoned waiting for ack
02-16 13:59:57.276 26658 27861 V LynxModule: retransmitting: mod=2 cmd=0x1009 msg#=17 ref#=0 
02-16 13:59:57.415 26658 27861 V LynxModule: retransmitting: mod=2 cmd=0x1009 msg#=17 ref#=0 
02-16 13:59:57.449 26658 27861 V RobotCore: timeout: abandoning waiting 250ms for response: cmd=LynxGetMotorChannelModeCommand mod=2 msg#=17
02-16 13:59:57.450 26658 27861 V RobotCore: nack rec'd mod=2 msg#=17 ref#=0 reason=ABANDONED_WAITING_FOR_RESPONSE:256
02-16 13:59:57.450 26658 27861 W LynxMotor: LynxGetMotorChannelModeCommand was abandoned waiting for response
02-16 13:59:57.639 26658 27861 V LynxModule: retransmitting: mod=2 cmd=0x1009 msg#=18 ref#=0

and so on, until something like the app crashes, you physically restart the app, or replug the Exp hub, or such.

Now, for some speculation. The first lead is the observation in differing behavior between the hardware configurations (having a USB hub middleman). The phone gets stuck on USB when tied in directly to the Exp hub, versus having messages dropped with the USB hub middleman. The commonality in connection being reset is the Exp hub, since in the normal configuration it's the phone/Exp hub link, and in the USB hub configuration its the Exp hub/USB hub link.

The roads are leading to problems with the Exp hub, in one or more ways. The first speculation is that the Exp hub's FTDI firmware/hardware is freezing up (going into a bad state) in a certain circumstance. The other speculation is that the Exp hub is freezing up its host's FTDI driver/firmware/hardware.

This is a wild, but somewhat educated, speculation. I don't know much about how the USB protocol, hardware, or lowest level driver work.

rischconsulting commented 4 years ago

Noting that (to add color to FrowntownPitt's speculation):

We had no disconnects this weekend with 5.4 (after putting in ESD mitigation)
The error changed for us when we went to 5.4
We had the Phone->USB Hub->Camera/Exp Hub already, but still got the USB Stuck error with 5.3

Windwoes commented 4 years ago

In the typical configuration, a physical unplug/replug of the OTG adapter from the phone resets the connection, but in this case the connection between the USB hub and the Exp hub has to be "replugged"

Wait. So you're saying that with the middleman configuration, replugging the OTG adapter from the phone does not fix the issue, but replugging the ExH from the USB hub does fix it? Interesting. I ran with the middleman configuration last year, and I'm pretty sure I got stuck USB writes that could be fixed by replugging the OTG adapter, but I wasn't keeping very close tabs on it then so I can't say that with certainty.

The roads are leading to problems with the Exp hub, in one or more ways. The first speculation is that the Exp hub's FTDI firmware/hardware is freezing up (going into a bad state) in a certain circumstance. The other speculation is that the Exp hub is freezing up its host's FTDI driver/firmware/hardware.

I don't think it's the first one, because Lynx firmware v1.8.2 it will pull the reset line on the FTDI if a KA-timeout occurs. Besides, if it was that, a replug of the connection to the OTG adapter would fix it. Of your two I would think the second is more likely. But here's the thing to remember - I can reproduce this without ESD at all simply by unplugging the OTG adapter while the connection is live. Nearly always that simply produces an "ExH portal missing" error like you'd expect, but occasionally you'll get the stuck USB write. I'm left scratching my head trying to figure out how the FTDI could freeze the host if it's been unplugged....

I think @gearsincorg may be on to something when he said:

cbl_det_isr: detect cable removal

So, I wonder how it determines this. Is it the 2 data line levels going to a default state. Or could it be the 5th OTG pin level dropping.

I posted the ISR that produced that message earlier. It looks like it was simply listening for the rising and falling edges of the USB_ID pin. So this seems to implicate that electrically something very funny is going on, with the state of the USB_ID pin bouncing perhaps and confusing the host driver?

I think that, at the end of the day, the only way to get to the bottom of this is to snoop the USB bus with both a USB packet sniffer and an o-scope while reproducing the problem in a controlled environment without ESD.

Windwoes commented 4 years ago

@rischconsulting

Noting that (to add color to FrowntownPitt's speculation):

1. We had no disconnects this weekend with 5.4 (after putting in ESD mitigation)

2. The error changed for us when we went to 5.4

3. We had the Phone->USB Hub->Camera/Exp Hub already, but still got the USB Stuck error with 5.3

Regarding point 3, when that happened, how could the fault be cleared? Replugging OTG adapter? Or replugging ExH from USB hub?

rischconsulting commented 4 years ago

OTG to phone. We couldn't unplug ExH from USB hub if we wanted to (not at the ExH, anyway)

Windwoes commented 4 years ago

It's just so odd that although ESD can cause a temporary USB disconnection which could induce the issue, ESD itself is not the cause. If it really was the USB_ID pin state bouncing, wouldn't ESD cause it to bounce? On a related note, got some nice arcs while testing auto last night. The effect of the specific arc in the photo below was a USB disconnection, causing my auto to abort itself due to getting all 0 data from sensors, and USB reconnection in quick succession. But no stuck USB write.

Screenshot_20200217-211311

FrowntownPitt commented 4 years ago

Wait. So you're saying that with the middleman configuration, replugging the OTG adapter from the phone does not fix the issue, but replugging the ExH from the USB hub does fix it? Interesting. I ran with the middleman configuration last year, and I'm pretty sure I got stuck USB writes that could be fixed by replugging the OTG adapter, but I wasn't keeping very close tabs on it then so I can't say that with certainty.

Yes. It required a replug on the expansion hub link (from the USB hub, since the teams were using the retention mount on the expansion hub. In these particular configurations from this weekend, I will note they used the REV USB hubs. I do not know if they were powered separately or not. I'm going off of only a few scenarios from the limited exposure I have to robots that have problems.

As an aside, there was a single instance where I noticed something interesting, from an event a couple of weeks ago. One of the teams (9872, (In)Formal Logic) dropped in semifinals, and in my typical postmortem I decided to do something a little differently. The wire configuration was OTG adapter (micro to A female), then A-to-mini. I replugged the A adapter side (not the micro from the phone port), and this did not reset the connection. It was only after I replugged the micro port from the phone that it worked. I was never able to reproduce this situation, across 3 more events, as the replug from the A side did end up resetting the connection properly.

So, I wonder how it determines this. Is it the 2 data line levels going to a default state. Or could it be the 5th OTG pin level dropping.

I had this speculation with the scenario from 9872, that it has something to do with the OTG signal that is affected by either ESD or rough conditions of robot play. What I speculated was this OTG signal was bounced, and it took a full replug to properly reset the connection. After not being able to reproduce it, I figured that path was no longer the way to go - but it may be a lead after all.

I've never had a problem with my little demo robot (4 motors, all encoders, using IMU but no other sensor), and I've never had the chance to attempt reproducing the failures on a more complicated configuration. In the cases that happen at events, they tend to be recurring teams, but it is not consistent across matches.

That picture is heckin' cool. I may play around with intentionally zapping individual connections to see what happens.

If it really was the USB_ID pin state bouncing, wouldn't ESD cause it to bounce?

ESD, or rough play. I never counted ESD out, but it isn't necessarily the cause. After all, "this only happens at competitions"... Rough conditions may cause individual leads (or all the contacts of a plug) to become momentarily disconnected, which would affect how the logic signals are transferred and interpreted. Again, speculation.

Windwoes commented 4 years ago

As an aside, there was a single instance where I noticed something interesting, from an event a couple of weeks ago. One of the teams (9872, (In)Formal Logic) dropped in semifinals, and in my typical postmortem I decided to do something a little differently. The wire configuration was OTG adapter (micro to A female), then A-to-mini. I replugged the A adapter side (not the micro from the phone port), and this did not reset the connection. It was only after I replugged the micro port from the phone that it worked. I was never able to reproduce this situation, across 3 more events, as the replug from the A side did end up resetting the connection properly.

Does this mean that you've seen stuck USB writes where the fault can be cleared without replugging OTG the adapter on the micro-B side? If so that completely goes against all of my findings.

FrowntownPitt commented 4 years ago

Does this mean that you've seen stuck USB writes where the fault can be cleared without replugging OTG the adapter on the micro-B side?

In the cases where the fault was cleared by unplugging the A side, I did not look at the logs. It may have not been the "stuck USB write" issue, or there may be different circumstances where this fault occurs. I'll keep a look out for it next time.

jmaff commented 4 years ago

I've been experiencing this same issue during our last few competitions as well as occasionally in practice. My findings are similar to what's been discussed here in that there's always the stuck USB write error thrown and the only solution is physically unplugging and replugging the OTG to phone connector. We've begun taking more precautions in terms of preventing ESD or cable wobble, but still find that there will be the occasional unlucky short loss in USB comms that happens to be during a write and so gets the controller stuck in this bad state. Has any more progress been made on this issue? I use Moto G4's and so the solution of editing and recompiling a modified kernel seems to be more difficult not to mention I don't have much experience with linux source to really know what I'm doing. Is this really the only "hacky" solution that can be done or is there possibly something that's less extreme than having to directly modify phone firmware in order to force the controller back in Host mode? (maybe something like figuring out how to turn power off/on to the USB port programmatically?) It's just frustrating that a robot can go dead in a match with a problem that should have a software solution but is so low level that a general solution is hard to roll out and teams just have to accept being dead in a match despite their connections still being plugged in fine.

Windwoes commented 4 years ago

I use Moto G4's and so the solution of editing and recompiling a modified kernel seems to be more difficult not to mention I don't have much experience with linux source to really know what I'm doing. Is this really the only "hacky" solution that can be done or is there possibly something that's less extreme than having to directly modify phone firmware in order to force the controller back in Host mode? (maybe something like figuring out how to turn power off/on to the USB port programmatically?)

Believe me, I tried many things to be able to reset the USB host controller without a kernel driver modification. I was able to reset the client portion of the USB stack without doing so (e.g. resulting in the phone disappearing from the computer it was connected to and then re-appearing), but it didn't work to recover from the stuck USB write.

It's just frustrating that a robot can go dead in a match with a problem that should have a software solution but is so low level that a general solution is hard to roll out and teams just have to accept being dead in a match despite their connections still being plugged in fine.

This is the rationale for the Control Hub. For starters, the connection to the internal Lynx PCB does not go over USB at all, but rather through a direct UART link. This means not only better reliability, but also reduced latency when issuing hardware commands. Furthermore, since it's a single device for which REV / the Tech Team has complete control software-wise, they could easily release a fix for a low-level issue and have it be available for all teams.

That being said, there's still things I don't like about the Control Hub. No power source besides main 12v battery, no screen which makes debugging hard, you're forced to use a webcam, etc.

Windwoes commented 4 years ago

One thing I will say is that just because I couldn't find a way to reset the USB host controller on my specific phone without a kernel modification does not mean that holds true for all phones. Unlock the bootloader on that bad boy, flash TWRP recovery, install LineageOS and the root ZIP, and start poking around :)

FIRST-Tech-Challenge / SkyStone

USB device cannot be re-opened after "stuck USB write" #145