daynix / UsbDk

Usb Drivers Development Kit for Windows
Apache License 2.0
535 stars 142 forks source link

State corruption during fast USB reconnections #32

Closed karalabe closed 7 years ago

karalabe commented 7 years ago

Hello,

It took me a while to figure out approximately what happens, but I've no idea exactly where things go wrong inside the driver. I'm trying to use a crypto wallet (Ledger Nano S) through UsbDK -> libusb -> gousb. The way this wallet works is that whenever an application running on the USB dongle starts or stops, it resets the USB device, meaning an extremely fast disconnect-reconnect sequence.

Now the problem is that this fast reconnection corrupts some state within the driver, because after some number of these reconnects (may be the first, may be the 10th, usually blows up quite fast), the opening the USB through UsbDK errors out with:

Driver file operation error. DeviceIoControl failed (Cannot create a file when that file already exists. Error code = 183)

After this error appears, it doesn't matter if the device is completely unplugged and plugged back in arbitrarily later, the driver will just get stuck displaying this error message for every open operation. The entire process needs to be restarted to successfully open the device again.

My hunch is that the fast reconnect triggers some race condition within the driver, which leaks out a file handle, causing all subsequent opens to fail forever. I've tried to debug this but unfortunately I'm way over my head here, so I'd really appreciate if you could take a peek as to what might be wrong.

Thanks, Peter

karalabe commented 7 years ago

Here's a trace I captured beginning from plugging in the device, until it managed to break the internal state of the driver: https://gist.github.com/anonymous/f4bde693a4599926f989bbea783beddf

karalabe commented 7 years ago

I think I managed to figure out where the issue comes from (from a high level point of view).

And here's issue. Since the driver apparently does reference counting based on the bus/address ID of the device, it merges together the references of multiple disjoint connections that happen to reuse the same ID. After that all internal invariants break, since there's no more meaningful life cycle management.

Btw, the way I discovered this is that I changed my code to first try to ping all devices I am currently attached to, if they don't respond call the close operation on them; and only afterwards do the listing for new devices. As this seems to work, it's more or less apparent that device listing actually bumps the reference count of a disconnected device.

karalabe commented 7 years ago

Btw, even now the fast disconnects/reconnects sometimes corrupt some internal state. Device discovery finds an available USB device, but trying to connect to it sometimes hangs for 2 minutes (some 120 second timeout I saw in your code) then results in:

Driver file operation error. DeviceIoControl failed (The system cannot find the file specified. Error code = 2)

This however can be worked around with a disconnect/reconnect cycle without retaining the invalid state given my reference count workaround above.

dmitryfleytman commented 7 years ago

@karalabe, I'm looking into the log you posted, and see there are 2 entries for your device with different serial numbers (0.1 and 3):

[8296] [5]2480.2308::01/29/2017-23:03:51.439 [UsbDk]CUsbDkRedirection::Dump Redirect: DevID: USB\VID_2C97&PID_0001, InstanceID: 0.1 [8296] [8296] [5]2480.2308::01/29/2017-23:03:51.439 [UsbDk]CUsbDkRedirection::Dump Redirect: DevID: USB\VID_2C97&PID_0001, InstanceID: 3

Are there 2 separate devices indeed?

karalabe commented 7 years ago

No, it's only one physical device.

karalabe commented 7 years ago

Not sure what the device reports though when starting or stopping these "apps" on it.

dmitryfleytman commented 7 years ago

Could you check this with UsbView or device manager? Pay attention, that In case this device changes it's serial number, UsbDk treats it as multiple unrelated devices.

Last part of the log hints that client tries to redirect a device that is already redirected:

[8296] [8296] [5]2480.26A0::01/29/2017-23:04:02.860 > [UsbDk]CUsbDkControlDevice::AddRedirectionToSet failed. Device already redirected. [8296] [8296] [7]0004.2128::01/29/2017-23:04:02.860 [UsbDk]CUsbDkFilterDevice::ContextCleanup Entry [8296] [8296] [6]2480.0FDC::01/29/2017-23:04:03.865 [UsbDk]CUsbDkControlDevice::AddRedirectionToSet failed. Device already redirected. [8296] [9344] Driver file operation error. DeviceIoControl failed (Cannot create a file when that file already exists. Error code = 183) [8296] [3]2480.2020::01/29/2017-23:04:04.870 [UsbDk]CUsbDkControlDevice::AddRedirectionToSet failed. Device already redirected. [8296] [9344] Driver file operation error. DeviceIoControl failed (Cannot create a file when that file already exists. Error code = 183) [8296] [1]2480.13F4::01/29/2017-23:04:05.900 [UsbDk]CUsbDkControlDevice::AddRedirectionToSet failed. Device already redirected. [8296] [8296] [1]2480.26A0::01/29/2017-23:04:06.909 [UsbDk]CUsbDkControlDevice::AddRedirectionToSet failed. Device already redirected. [8296]

etc.

This means that client did not close its device handle because redirection is removed automatically when handle gets closed. The fact that UsbDk functionality normalizes when client process exits is also an evidence for that. Windows closes all handles automatically on process termination.

Please verify that your application always closes its device handle.

dmitryfleytman commented 7 years ago

@karalabe, are you using UsbDk version compiled by yourself? If so, could you please apply the patch attached and collect new logs?

0001-ControlDevice-Always-log-full-device-ID-including-se.txt

karalabe commented 7 years ago

Here are the two logs from UsbView:

The iSerialNumber field does swap between two values depending on whether I'm in the menu of the USB or running an app on it. However, some peculiarity I still see that the Device Address field from UsbView does get incremented every time I start/stop apps on the thing, but inside my own code the address is reported static and non changing (it does increment properly under Linux).

Here's a log with your patch applied running on my code before I reorged it.

karalabe commented 7 years ago

Here's another interesting log running with my current code that explicitly closes connections first before even looking at potentially new devices' arrival. The reason this seems interesting is because looking at the device purely through UsbView, it swaps its iSerialNumber between "0.1" (inside the app) and none set outside the app. I've tried for quite a number of cycles and the serial number stays consistently the same for the app and the same for the menu.

However, looking at the logs I linked, you can see that originally I'm in the menu ID: 3, and then start an app, switching to ID: 0.1. Then however when I leave the app, I have

[1772] [5]0004.208C::01/30/2017-18:23:34.452 [UsbDk]CWdmList<CUsbDkChildDevice,CWdmSpinLock,CCountingObject,CScalarDeleter<CUsbDkChildDevice> Deleting child object:
[1772] [5]0004.208C::01/30/2017-18:23:34.452 [UsbDk]CUsbDkChildDevice::Dump Child device 0xFFFFB90446A856D0:
[1772] [5]0004.208C::01/30/2017-18:23:34.452 [UsbDk]CRegText::Dump ID: USB\VID_2C97&PID_0001
[1772] [5]0004.208C::01/30/2017-18:23:34.452 [UsbDk]CRegText::Dump ID: 0.1

followed directly by

[1772] [1]0004.2740::01/30/2017-18:23:34.642 [UsbDk]CUsbDkHubFilterStrategy::RegisterNewChild Registering new child (PDO: FFFFB9044755A7E0):
[1772] [1]0004.2740::01/30/2017-18:23:34.642 [UsbDk]CRegText::Dump ID: USB\VID_2C97&PID_0001
[1772] [1]0004.2740::01/30/2017-18:23:34.642 [UsbDk]CRegText::Dump ID: 0.1

The ID remains the same, whereas according to UsbView, it never happened that the same iSerialNumber was retained while exiting or entering the app; it always swapped. Right after the above happens, a few logs later the driver reports an error:

Driver file operation error. DeviceIoControl failed (The system cannot find the file specified. Error code = 2)

codido commented 7 years ago

Perhaps related to that, Android's AOA (https://source.android.com/devices/accessories/aoa.html) seems to break usbdk as well, sometimes resulting in the error message above (until usbdk.sys is uninstalled or the system is rebooted).

Following the initial AOA handshake, Android re-enumerates as a different device (VID/PID/bcdDevice) but obviously still connected to the same port. Trying to close the old device results in a stall, and the port is never cycled (as I understand redirection is kept) even after the host app exits.

dmitryfleytman commented 7 years ago

@codido, does this happen with the latest USBDK version (1.0.15)? A very similar issue was fixed by commit bf79d6e9425b1587aaa542c31912f5f6a574cd64.

codido commented 7 years ago

@dmitryfleytman, afraid so, yes.

dmitryfleytman commented 7 years ago

Hi @karalabe, I added more logs into Add/Remove redirect flows of UsbDk (see patch attached), could you please apply this patch in addition to my previous patch and collect logs of the original problem (before your code modifications)?

In addition, could you please reproduce the issue and collect log produced by UsbDk when the process is being closed?

Thanks, Dmitry

0001-ControlDevice-Add-more-traces-for-redirection-set-mo.txt

dmitryfleytman commented 7 years ago

@codido, this looks like another issue. Could you please collect UsbDk log as described here: https://github.com/daynix/UsbDk/blob/master/Documentation/Tracing.txt#L52 and share it with us? It would be better if you open another issue because it really looks like a different problem.

BTW, after the problem is reproduced and redirection is stuck, does re-plug of the device fixes the problem?

codido commented 7 years ago

@dmitryfleytman, sure thing: https://github.com/daynix/UsbDk/issues/34

karalabe commented 7 years ago

So, I've played a bit more with the issue and also tried to make a standalone repro. Helped a lot understanding what happens under the hood.

I'm more or less convinced the original error (Driver file operation error. DeviceIoControl failed (Cannot create a file when that file already exists. Error code = 183) was my fault. My discovery mechanism was implemented first in Linux, then built for Windows. Whenever it detected a new device, it tried to connect to it first and only then disconnect older instances. This is fine in Linux because that OS appoints a new USB address for every reconnect, so the "newly detected device" and "old disconnected device" are from all intents and purposes different devices. On Windows with UsbDk this broke since we do not get a new address, so the new and old devices are mapped to the same thing, resulting in a double open.

Although this might seem an obvious bug from my part, my confusion originated from the fact that this Ledger device cycles between two serial numbers, so if my discovery algorithm ran when the old and new versions had different serials, it worked correctly (opening the new one is fine), whereas if they had the same, it broke (since the new mapped to the old's structures). I did have a hunch that open/close might be the issue so I did make a hacked version of my discovery to close first, then open, alas only now did I realize that that one missed a close path, so depending on the order of events, it sometimes dropped a handle without closing.

Long story short, my original issue is now clear and it is not a bug in UsbDk!


With that out of the way, I'm still investigating the Driver file operation error. DeviceIoControl failed (The system cannot find the file specified. Error code = 2) errors (I have a clearer hunch where these might come from now, but I have to think through my code properly for it).

Still, one thing that kind of puzzles me still is that when quickly doing these power cycles, opening a device sometimes hangs for exactly 2 minutes. I've seen in the code that this 2 minute is hard coded at:

Could you detail what these timeouts do and do you have perhaps an idea what might cause them to be triggered? 2 minutes seems quite a steep time for a hard coded timeout.

dmitryfleytman commented 7 years ago

These timeouts are related to device reset (port cycle) sequence. UsbDk does port cycle on device open and close to remove and restore original device driver. Port cycle is an asynchronous operation, UsbDk sends a port cycle request and waits for a some special callbacks that indicate reset completion.

In case port cycle did not happen UsbDk will timeout in 2 minutes. In general failure of port cycle means hardware malfunction and should never happen, therefore timeout of 2 minutes is not an issue.

Another possible reason for port cycle failure might be that UsbDk sends port cycle command too late (when device disconnected already). This would of course mean that there is a bug in UsbDk.

In order to understand what really happens, could you please post trace of UsbDk with both patches attached to my previous comments?

Now this issue really looks like #34.

Thanks, Dmitry

dmitryfleytman commented 7 years ago

Most likely fixed by changes done for #34. Please re-test with the latest master.

karalabe commented 7 years ago

In the mean time my code evolved quite a lot, so I can't say I have the same version available that I started to report this issue with. However I will take the current release and master too to a spin and see how they behave and if I see something odd.

dmitryfleytman commented 7 years ago

Closing. Please reopen in case problem persists.