Yubico / yubihsm-connector

https://developers.yubico.com/yubihsm-connector/
Apache License 2.0
30 stars 13 forks source link

Does not recover from USB reconnect #20

Closed joernheissler closed 3 years ago

joernheissler commented 3 years ago

Hi, when I disconnect and reconnect the USB device, I get this in the logs for the next request:

time="2020-12-26T10:37:21Z" level=debug msg="usb device already open" Correlation-ID=502ce9d6-660b-40c9-9680-c7a2c45e27b8
time="2020-12-26T10:37:21Z" level=debug msg="usb endpoint write" Correlation-ID=502ce9d6-660b-40c9-9680-c7a2c45e27b8 buf="[1 0 12 72 69 65 76 84 72 95 67 72 69 67 75]" err="libusb: no device [code -4]" len=15 n=0
time="2020-12-26T10:37:21Z" level=debug msg="reopening usb context" Correlation-ID=502ce9d6-660b-40c9-9680-c7a2c45e27b8 why="libusb: no device [code -4]"
time="2020-12-26T10:37:21Z" level=error msg="failed usb proxy" X-Request-ID=502ce9d6-660b-40c9-9680-c7a2c45e27b8 error="device not found"
time="2020-12-26T10:37:21Z" level=error msg="error in handling request" Content-Length=15 Content-Type=application/x-www-form-urlencoded Method=POST RemoteAddr="172.17.0.1:51964" StatusCode=500 URI=/connector/api User-Agent=curl/7.72.0 X-Real-IP=172.17.0.1 X-Request-ID=502ce9d6-660b-40c9-9680-c7a2c45e27b8 latency="437.273µs"

And on subsequent requests:

time="2020-12-26T10:37:29Z" level=error msg="failed usb proxy" X-Request-ID=c6a9606e-c938-4d2a-bb7e-2147fbc464e9 error="device not found"
time="2020-12-26T10:37:29Z" level=error msg="error in handling request" Content-Length=15 Content-Type=application/x-www-form-urlencoded Method=POST RemoteAddr="172.17.0.1:51966" StatusCode=500 URI=/connector/api User-Agent=curl/7.72.0 X-Real-IP=172.17.0.1 X-Request-ID=c6a9606e-c938-4d2a-bb7e-2147fbc464e9 latency="211.029µs"

To make things work again, I need to restart the connector process.

Expected behaviour would be to either terminate the process (so it will be restarted by docker) or to open the USB device again.

qpernil commented 3 years ago

It does try to reopen the usb device when an error occurs, but it is simply not finding any device with the right vendor and product code in this case. We have seen 'disappearing usb devices' when running under a hypervisor, but usually they come back after some seconds. I don't know what virtualization software is being used in this case. We have also seen varying behavior depending on the guest OS. Perhaps you could check in the hypervisor console (or similar) if the usb device is redirected to the correct VM ? Does it need to be manually reattached ? In ESXi 6.7 and 7.0 for example the device will automatically be reattached for a while, but if it is unconnected for enough time it will be unattached from that VM until manually reattached.

joernheissler commented 3 years ago

I don't know what virtualization software is being used in this case.

yubihsm-connector is running inside a docker container running on a Linux VM inside VMware. The HSM is connected via network and some proprietary kernel "usb over network" driver, so the VMware part shouldn't matter.

It does try to reopen the usb device when an error occurs, but it is simply not finding any device with the right vendor and product code in this case.

The device is there, but with a new name, e.g. /dev/bus/usb/001/023 instead of /dev/bus/usb/001/022. "docker restart" fixes the problem.

I ran strace. Here's an excerpt with the interesting parts:

12:22:42.829151 ioctl(14, USBDEVFS_REAPURBNDELAY, 0x7fc77ab4bc90) = -1 ENODEV (No such device)
12:23:02.179343 read(7, "POST /connector/api HTTP/1.1\r\nHost: 127.0.0.1:27748\r\nUser-Agent: curl/7.64.0\r\nAccept: */*\r\nContent-Length: 15\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\n\1\0\fHEALTH_CHECK", 4096) = 177
12:23:02.179636 ioctl(14, USBDEVFS_SUBMITURB, 0x7fc75c005560) = -1 ENODEV (No such device)
12:23:02.179713 ioctl(14, USBDEVFS_RELEASEINTERFACE, 0x7fc77a34adac) = -1 ENODEV (No such device)
12:23:02.180018 close(14)               = 0
12:23:02.180607 write(2, "time=\"2020-12-27T12:23:02Z\" level=error msg=\"failed usb proxy\" X-Request-ID=164e4cd3-fdc5-4240-beef-db00ded4b057 error=\"device not found\"\n", 138) = 138
12:24:02.271779 read(7, "POST /connector/api HTTP/1.1\r\nHost: 127.0.0.1:27748\r\nUser-Agent: curl/7.64.0\r\nAccept: */*\r\nContent-Length: 15\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\n\1\0\fHEALTH_CHECK", 4096) = 177
12:24:02.272236 write(2, "time=\"2020-12-27T12:24:02Z\" level=error msg=\"failed usb proxy\" X-Request-ID=c73091a7-7fd3-4b32-ab39-906fe39587c6 error=\"device not found\"\n", 138) = 138

The error from USBDEVFS_REAPURBNDELAY seems to be ignored. When a request comes in and USB fails, the USB device is simply closed and that's it. There is no try to reopen it. There isn't even a try to get a new list of USB devices or anything USB related at all.

joernheissler commented 3 years ago

Btw, #19 is merged. There were some changes related.

qpernil commented 3 years ago

I see something is posting to the api endpoint, that is not advisable. The contents will be proxied to the usb device and may cause some of the trouble. If you want to check the health of the connector do a GET on /connector/status. (There are some commands that can be posted unencrypted to the HSM, and a random command could also terminate an ongoing encrypted session with the HSM).

The logs you shared from the connector seem to indicate that it is reopening the device, but I agree it doesn't look like it in the strace logs. The connector speaks to the usb device through gousb, a go library that talks to the usb stack using libusb. I guess it's conceivable that enumerating usb devices might not show up through strace. Once it has closed handle 14 it is looking for a usb device with the vendor and product codes of the HSM, 0x1050 and 0x0030, and it is not finding any such device. Perhaps it would be possible to run lsusb in that docker image ?

joernheissler commented 3 years ago

I see something is posting to the api endpoint, that is not advisable. The contents will be proxied to the usb device and may cause some of the trouble.

What kind of trouble could sending an ECHO to the USB device cause? I would have thought this to be completely harmless.

If you want to check the health of the connector do a GET on /connector/status.

Guess I can do that. It reports status=NO_DEVICE when I disconnect the device, so that's fine.

I guess it's conceivable that enumerating usb devices might not show up through strace.

Initial scan shows up, so it should also show up in a rescan. What I can see in strace for the initial scan are some failures to talk to udev.

Perhaps it would be possible to run lsusb in that docker image ?

Running lsusb -d 1050: inside the container before, while and after reattaching:

Bus 001 Device 027: ID 1050:0030 Yubico.com ` Bus 001 Device 029: ID 1050:0030 Yubico.com`

I also confirmed that restarting the yubihsm-connector process helps, i.e. without restarting the docker container.

joernheissler commented 3 years ago

I did some more debugging. On rescans, state.ctx.OpenDevices doesn't run the lambda function for the old nor the new device. So could be a bug in gousb or how it's used. I'll continue digging.

joernheissler commented 3 years ago

I wrote a C program to run libusb_get_device_list repeatedly. Inside the container it keeps returning the same list. Outside it updates the list. So I'm now pretty sure it's not a bug in yubihsm-connector. Thanks!

I'll add the solution once I find it.

joernheissler commented 3 years ago

Seems to be the same as https://github.com/moby/moby/issues/35359 Best fix I could find is to use alpine instead of debian.

qpernil commented 3 years ago

Sorry, I missed the echo command at the beginning of what you POST, so that should be all OK. I really appreciate your sharing of your debugging efforts, thx !