Yubico / yubihsm-connector

https://developers.yubico.com/yubihsm-connector/
Apache License 2.0
30 stars 13 forks source link

Issue with running the yubihsm connector on Linux VMs on VMWare ESXI #7

Closed phaus closed 3 years ago

phaus commented 5 years ago

I am within a team that creates an Appliance VM that needs a Yubikey HSM for Security.

During the setup, we had some struggles with stabilizing the operation of the yubihsm-connector within this VM.

It might just be an issue with the underlying libusb version or the Linux Kernel itself, but nevertheless I would like to pointout the issue and mentioning the workaround, we did implement to have this issue fixed for the moment

Environment

Some Details about the environment:

The Host System is VMWare ESXI6.7.0 Update 1 (Build 11675023). The HW is a NUC NUC7i5BNH, we will switch to real server gear later this month, so it might also be a misbehaviour of the underlying HW :-).

The Guest VM is a CentOS Linux release 7.6.1810 (Core) - Kernel Version 3.10.0-957.12.1.el7.x86_64.

The Yubikey HSM is setup as a passtrough device to the Linux VM:

$ lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 004: ID 1050:0030 Yubico.com
Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

What is the error?

In 50% of our tests, the setup just works as expected. The rest of the tests fail with the Yubikey Connetor stating:

$ curl localhost:12345/connector/status                                                                                                                                                                                                                                                                                            
status=NO_DEVICE
serial=*
version=2.0.0
pid=6677
address=localhost
port=12345

The Debug Log of the Connector shows then:

DEBU[0000] preflight complete                            cert= config= key= pid=6658 seccomp=false serial= syslog=false version=2.0.0
DEBU[0000] takeoff                                       TLS=false listen="localhost:12345" pid=6658
DEBU[0003] reopening usb context                         Correlation-ID=d3a27afe-67b7-69ad-4b43-852b71978459 why="status request"
DEBU[0003] usb context not yet open                      Correlation-ID=d3a27afe-67b7-69ad-4b43-852b71978459
WARN[0005] status failed to open usb device              X-Request-ID=d3a27afe-67b7-69ad-4b43-852b71978459 error="device not found"
INFO[0005] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:45628" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=d3a27afe-67b7-69ad-4b43-852b71978459 latency=1.629251502s
DEBU[0007] reopening usb context                         Correlation-ID=acaa699f-bd22-25ae-4cfe-6d1030559047 why="status request"

Workaround

Our first attempt was to disable the USB autosuspend via Kernel Parameters that did not work as expected, although the error description was similar to ours.

A working solution was finally to have a cron script running, that checks the connector status and resets all USB Devices if the status is not equals "OK"

#!/bin/bash

 STATUS=$(curl -s localhost:12345/connector/status | grep status=)

 if [ $STATUS != "status=OK" ]; then
     for i in /sys/bus/pci/drivers/[uoex]hci_hcd/*:*; do
        [ -e "$i" ] || continue
        echo "${i##*/}" > "${i%/*}/unbind"
        echo "${i##*/}" > "${i%/*}/bind"
     done
 fi

Maybe an update of go-libusb (as mentioned in #3) will solve the issue, but otherwise it might be nice to have the connector itself deal with these kind of issues (although resetting USB Device need root access rights :-/)

trebortech commented 5 years ago

I had a similar issue with my VMWare ESXi 6 server and a CentOS 7 build.

Have you installed the vmware tools? yum install open-vm-tools

NOTE: My ESXi and CentOS are older versions from what you have.

phaus commented 5 years ago

Yes, we did install the vmware tools. There was no change in the issue…

trebortech commented 4 years ago

@phaus Can you check the boot options for the VM and compare to the host BIOS? I'm troubleshooting another ESXi deployment and have seen issues with the host BIOS being set to UEFI and the VM being set to BIOS.

phaus commented 4 years ago

hmmm. That is actually a good catch. Unfortunetally, I don't know anymore. I guess the standard for a new VM should be UEFI, right? The Host was definetally an UEFI based system.

SimplyStaking commented 4 years ago

We've Also encountered the same issues on Vmware vsphere 6.7 latest updates. did anyone manage to get this to work?

phaus commented 4 years ago

@SimplyVC we just went with the work-around above. The project is over now. It worked as intended. I am not sure, if it is a bug in libusb or VMWare :-).

SimplyStaking commented 4 years ago

@phaus Thanks for your reply, I'm not sure it's a VMware / libusb bug, we had no issues using a ledger s, the issue only happened with the yubihsm

mrauter commented 4 years ago

We have the same problem on Linux (CentOS 7) but also on Windows Server 2016 guest. Workaround is not a solution

VMware ESXi, 6.5.0

evilmog commented 4 years ago

This issue hasn't been updated for a while, I need to deploy a pair of YubuHSM's to vmware. We will be holding off purchasing until we get confirmation that this has been resolved.

Can I get a status update on this bug?

phaus commented 4 years ago

You can use our workaround. It seems to work fine for now. We did not test against any updated connector version.

qpernil commented 3 years ago

I realize this is 'a bit' late, but there is a new PR on the yubihsm-connector that you could try to see if it helps with this. I had the issue that each /connector/status request took 1 minute 6 seconds to execute on my Ubuntu VMs under ESXi 7 without this change. With it everything works great for me. It also provides a performance improvement on a physical Ubuntu machine, but there it goes from 250ms to a a few ms. You can try building from the no-reset-device branch if you'd like to give it a try. I had the same issue before I upgraded my ESXi from 6.7 to 7.0 so I don't think its specific to my ESXi version. See https://github.com/Yubico/yubihsm-connector/pull/19

qpernil commented 3 years ago

PR https://github.com/Yubico/yubihsm-connector/pull/19 has now been updated to also avoid reopening the usb device, which seems to help with some guest OS:es under ESXi.

mrauter commented 3 years ago

Just tested the new patch. Unfortunately no change in our environment (VMware ESXi, 6.5.0, Cent OS 7 guest)


DEBU[0000] takeoff                                       TLS=false listen="localhost:12345" pid=116839

-- first 3 requests failed (NO_DEVICE)
DEBU[0027] reopening usb context                         Correlation-ID=02fadd76-6097-4905-b152-362d08f9857b why="status request"
DEBU[0027] usb context not yet open                      Correlation-ID=02fadd76-6097-4905-b152-362d08f9857b
DEBU[0029] Couldn't read serial number from device       Correlation-ID=02fadd76-6097-4905-b152-362d08f9857b Device="vid=1050,pid=0030,bus=2,addr=5"
WARN[0029] status failed to open usb device              X-Request-ID=02fadd76-6097-4905-b152-362d08f9857b error="device not found"
INFO[0029] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:50764" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=02fadd76-6097-4905-b152-362d08f9857b latency=1.532779563s
DEBU[0036] reopening usb context                         Correlation-ID=18b031a6-86c0-42f9-ade3-97fbd8ef40b5 why="status request"
DEBU[0038] Couldn't read serial number from device       Correlation-ID=18b031a6-86c0-42f9-ade3-97fbd8ef40b5 Device="vid=1050,pid=0030,bus=2,addr=5"
WARN[0038] status failed to open usb device              X-Request-ID=18b031a6-86c0-42f9-ade3-97fbd8ef40b5 error="device not found"
INFO[0038] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:52052" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=18b031a6-86c0-42f9-ade3-97fbd8ef40b5 latency=1.557908082s
DEBU[0040] reopening usb context                         Correlation-ID=31b1b57a-f905-4ad4-bdb5-d3d278f99d00 why="status request"
DEBU[0041] Couldn't read serial number from device       Correlation-ID=31b1b57a-f905-4ad4-bdb5-d3d278f99d00 Device="vid=1050,pid=0030,bus=2,addr=5"
WARN[0041] status failed to open usb device              X-Request-ID=31b1b57a-f905-4ad4-bdb5-d3d278f99d00 error="device not found"
INFO[0041] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:52978" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=31b1b57a-f905-4ad4-bdb5-d3d278f99d00 latency=1.350497683s

>-- 4th request successfull (OK)
DEBU[0043] reopening usb context                         Correlation-ID=b68618c6-ff97-4e18-8b60-000e80ecc3e4 why="status request"
DEBU[0043] Returning a matched device                    Correlation-ID=b68618c6-ff97-4e18-8b60-000e80ecc3e4 Device-Serial=0013200583 Wanted-Serial=
DEBU[0045] usb endpoint read                             Correlation-ID=b68618c6-ff97-4e18-8b60-000e80ecc3e4 buf="[]" err="transfer was cancelled" len=0 n=0
INFO[0045] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:53228" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=b68618c6-ff97-4e18-8b60-000e80ecc3e4 latency=1.731764218s

-- 5th requests failed again
DEBU[0051] reopening usb context                         Correlation-ID=97bfaf09-3ef1-43d3-857b-b83cb3181af2 why="status request"
DEBU[0053] Couldn't read serial number from device       Correlation-ID=97bfaf09-3ef1-43d3-857b-b83cb3181af2 Device="vid=1050,pid=0030,bus=2,addr=5"
WARN[0053] status failed to open usb device              X-Request-ID=97bfaf09-3ef1-43d3-857b-b83cb3181af2 error="device not found"
INFO[0053] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:54322" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=97bfaf09-3ef1-43d3-857b-b83cb3181af2 latency=1.534587666s`

-- Same via pkcs11-tool

# yhsm2-tool --list-slots
Available slots:
Slot 0 (0x0): YubiHSM Connector localhost
  token label        : YubiHSM
  token manufacturer : Yubico (www.yubico.com)
  token model        : YubiHSM
  token flags        : login required, rng, token initialized, PIN initialized
  hardware version   : 2.103
  firmware version   : 2.103
  pin min/max        : 12/68
Logging in to "YubiHSM".

-- 2nd try:

# yhsm2-tool --list-slots
Available slots:
Slot 0 (0x0): YubiHSM Connector localhost
  (empty)
error: PKCS11 function C_OpenSession failed: rv = CKR_TOKEN_NOT_PRESENT (0xe0)
Aborting.
qpernil commented 3 years ago

We have discovered that the virtual USB controller in the VM must be a USB 3.0 controller for this to work well under many guest OS:es. Could you verify what you have ? For many guest OS:es the virtual USB controller will be USB 2 by default, and there seems to be some bug in ESXi with that. We are seeing intermittent functionality with USB2 controller. (Even though the YubiHSM is in fact a USB2 device)

mrauter commented 3 years ago

Thanks for the hint. We have used USB 2.

After change to USB 3 it seems stable (no connection issues in the logs) BUT it is very very slow now. /connector/status or list-slots request takes ~45 seconds (like you described before your fix).

I double checked that we are using PR #19


# connector debug
DEBU[0666] reopening usb context                         Correlation-ID=32b364ca-fcc8-427a-8fa1-c1c123ca27ca why="status request"
DEBU[0666] Returning a matched device                    Correlation-ID=32b364ca-fcc8-427a-8fa1-c1c123ca27ca Device-Serial=0013200583 Wanted-Serial=
DEBU[0712] usb endpoint read                             Correlation-ID=32b364ca-fcc8-427a-8fa1-c1c123ca27ca buf="[]" err="transfer was cancelled" len=0 n=0
INFO[0712] handled request                               Content-Length=0 Content-Type= Method=GET RemoteAddr="127.0.0.1:48254" StatusCode=200 URI=/connector/status User-Agent=curl/7.29.0 X-Real-IP=127.0.0.1 X-Request-ID=32b364ca-fcc8-427a-8fa1-c1c123ca27ca latency=45.635963886s```

# dmesg during request (which worked at the end)
[ 1152.691348] usb 1-3.1: device descriptor read/64, error -110
qpernil commented 3 years ago

We have not seen slowness with the code in PR #19, but we do see that with the code in master. In my case, Ubuntu Xenial up to Groovy, I see a time of 66s for the status request with the code in master. A couple of ms with the code in PR#19. What guest OS are you using ? I will see if I can reproduce.

mrauter commented 3 years ago

My bad! Now we really use PR #19 - and it works 👍

To summarize: In other environment (VMware ESXi, 6.5.0, Cent OS 7 guest) to solution was to use USB 3 controller and this fix.

Thank you very much. Looking forward to see this in official release.

qpernil commented 3 years ago

Great news, thx for testing !

qpernil commented 3 years ago

The libusb code has been merged to master, and will be in the next release. Winusb code is also being worked on along the same lines. See https://github.com/Yubico/yubihsm-connector/pull/21

qpernil commented 3 years ago

Closing this issue now, if there are more problems please re-open a new issue.