crc-org / crc

CRC is a tool to help you run containers. It manages a local OpenShift 4.x cluster, Microshift or a Podman VM optimized for testing and development purposes
https://crc.dev
Apache License 2.0
1.23k stars 233 forks source link

[NESTED VIRTUALIZATION] crc fail to start properly "Failed to connect to the CRC VM with SSH" #1853

Closed hunter86bg closed 3 years ago

hunter86bg commented 3 years ago

General information

CRC version

CodeReady Containers version: 1.20.0+ef3f80d
OpenShift version: 4.6.6 (embedded in executable)

CRC status

DEBU CodeReady Containers version: 1.20.0+ef3f80d 
DEBU OpenShift version: 4.6.6 (embedded in executable) 
Checking file: /home/snikolov/.crc/machines/crc/.crc-exist
DEBU About to run SSH command:
df -B1 --output=size,used,target /sysroot | tail -1 
DEBU SSH command results: err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none], no supported methods remain, output:  
Cannot get root partition usage: ssh command error:
command : df -B1 --output=size,used,target /sysroot | tail -1
err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none], no supported methods remain
output  : 

CRC config

<NO OUTPUT>

Host Operating System

NAME="Springdale Open Enterprise Linux"
VERSION="8.3 (Modena)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.3"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Springdale Open Enterprise Linux 8.3 (Modena)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:springdale:enterprise_linux:8.3:GA"
HOME_URL="https://springdale.princeton.edu/"
BUG_REPORT_URL="https://springdale.princeton.edu/bugzilla"

REDHAT_BUGZILLA_PRODUCT="Springdale Open Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
REDHAT_SUPPORT_PRODUCT="Springdale Open Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.3"

Steps to reproduce

  1. crc setup
  2. crc start

Expected

crc starts properly

Actual

crc fails to start and connect

Logs

[snikolov@openshift1 ~]$ crc start --log-level debug
DEBU CodeReady Containers version: 1.20.0+ef3f80d 
DEBU OpenShift version: 4.6.6 (embedded in executable) 
DEBU Total memory of system is 10256244736 bytes  
DEBU No new version available. The latest version is 1.20.0 
DEBU Couldn't set key PLATFORM_ID, no corresponding struct field found 
DEBU Couldn't set key CPE_NAME, no corresponding struct field found 
DEBU Couldn't set key , no corresponding struct field found 
DEBU Couldn't set key REDHAT_BUGZILLA_PRODUCT, no corresponding struct field found 
DEBU Couldn't set key REDHAT_BUGZILLA_PRODUCT_VERSION, no corresponding struct field found 
DEBU Couldn't set key REDHAT_SUPPORT_PRODUCT, no corresponding struct field found 
DEBU Couldn't set key REDHAT_SUPPORT_PRODUCT_VERSION, no corresponding struct field found 
INFO Checking if podman remote executable is cached 
DEBU Currently podman remote is not supported     
INFO Checking if goodhosts executable is cached   
DEBU goodhost executable already cached           
INFO Checking minimum RAM requirements            
DEBU Total memory of system is 10256244736 bytes 
INFO Checking if running as non-root              
INFO Checking if Virtualization is enabled        
DEBU Checking if the vmx/svm flags are present in /proc/cpuinfo 
DEBU CPU virtualization flags are good            
INFO Checking if KVM is enabled                   
DEBU Checking if /dev/kvm exists                  
DEBU /dev/kvm was found                           
INFO Checking if libvirt is installed             
DEBU Checking if 'virsh' is available             
DEBU 'virsh' was found in /usr/bin/virsh          
DEBU Checking 'virsh capabilities' for libvirtd/qemu availability 
DEBU Running 'virsh capabilities'                 
DEBU Found x86_64 hypervisor with 'hvm' capabilities 
INFO Checking if user is part of libvirt group    
DEBU Checking if current user is part of the libvirt group 
DEBU Running '/usr/bin/groups snikolov'           
DEBU Current user is already in the libvirt group 
INFO Checking if libvirt daemon is running        
DEBU Checking if libvirtd service is running      
DEBU Running 'systemctl status virtqemud.socket'  
DEBU Command failed: exit status 3                
DEBU stdout: * virtqemud.socket - Libvirt qemu local socket
   Loaded: loaded (/usr/lib/systemd/system/virtqemud.socket; disabled; vendor preset: disabled)
   Active: inactive (dead)
   Listen: /run/libvirt/virtqemud-sock (Stream) 
DEBU stderr:                                      
DEBU virtqemud.socket is neither running nor listening 
DEBU Running 'systemctl status libvirtd.socket'   
DEBU libvirtd.socket is running                   
INFO Checking if a supported libvirt version is installed 
DEBU Checking if libvirt version is >=3.4.0       
DEBU Running 'virsh -v'                           
INFO Checking if crc-driver-libvirt is installed  
DEBU Checking if crc-driver-libvirt is installed  
DEBU Running '/home/snikolov/.crc/bin/crc-driver-libvirt version' 
DEBU Found crc-driver-libvirt version 0.12.13     
DEBU crc-driver-libvirt is already installed      
INFO Checking if systemd-networkd is running      
DEBU Checking if systemd-networkd.service is running 
DEBU Running 'systemctl status systemd-networkd.service' 
DEBU Command failed: exit status 4                
DEBU stdout:                                      
DEBU stderr: Unit systemd-networkd.service could not be found. 
DEBU systemd-networkd.service is not running      
INFO Checking if NetworkManager is installed      
DEBU Checking if 'nmcli' is available             
DEBU 'nmcli' was found in /usr/bin/nmcli          
INFO Checking if NetworkManager service is running 
DEBU Checking if NetworkManager.service is running 
DEBU Running 'systemctl status NetworkManager.service' 
DEBU NetworkManager.service is already running    
INFO Checking if /etc/NetworkManager/conf.d/crc-nm-dnsmasq.conf exists 
DEBU Checking NetworkManager configuration        
DEBU NetworkManager configuration is good         
INFO Checking if /etc/NetworkManager/dnsmasq.d/crc.conf exists 
DEBU Checking dnsmasq configuration               
DEBU dnsmasq configuration is good                
INFO Checking if libvirt 'crc' network is available 
DEBU Checking if libvirt 'crc' network exists     
DEBU Running 'virsh --connect qemu:///system net-info crc' 
DEBU Checking if libvirt 'crc' definition is up to date 
DEBU Running 'virsh --connect qemu:///system net-dumpxml --inactive crc' 
DEBU libvirt 'crc' network has the expected value 
INFO Checking if libvirt 'crc' network is active  
DEBU Checking if libvirt 'crc' network is active  
DEBU Running 'virsh --connect qemu:///system net-info crc' 
DEBU libvirt 'crc' network is already active      
Checking file: /home/snikolov/.crc/machines/crc/.crc-exist
? Image pull secret [? for help] **************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

INFO Loading bundle: crc_libvirt_4.6.6.crcbundle ... 
INFO Verifying bundle crc_libvirt_4.6.6.crcbundle ... 
INFO Creating CodeReady Containers VM for OpenShift 4.6.6... 
Found binary path at /home/snikolov/.crc/bin/crc-driver-libvirt
Launching plugin server for driver libvirt
Plugin server listening at address 127.0.0.1:36075
() Calling .GetVersion
Using API Version  1
() Calling .SetConfigRaw
() Calling .GetMachineName
(crc) Calling .GetMachineName
(crc) Calling .DriverName
(crc) Calling .PreCreateCheck
Running pre-create checks...
(crc) DBG | About to check libvirt version
(crc) DBG | Validating network
(crc) DBG | Validating storage pool
(crc) DBG | Activating pool 'crc'
(crc) Calling .GetConfigRaw
(crc) Calling .Create
Creating machine...
(crc) DBG | Preparing /home/snikolov/.crc/machines/crc/crc.qcow2 for machine use
(crc) DBG | image creation took 569.38813ms
(crc) DBG | Verifying executable bit set on /home/snikolov/.crc/machines/crc
(crc) DBG | Verifying executable bit set on /home/snikolov/.crc/machines
(crc) DBG | Verifying executable bit set on /home/snikolov/.crc
(crc) DBG | Verifying executable bit set on /home/snikolov
(crc) DBG | Verifying executable bit set on /home
(crc) DBG | Defining VM...
(crc) DBG | Found x86_64 hypervisor with 'hvm' capabilities
(crc) DBG | Found q35 machine type
(crc) DBG | Adding the file: /home/snikolov/.crc/machines/crc/.crc-exist
(crc) DBG | disk image capacity is already 33285996544 bytes
(crc) DBG | Starting VM crc
(crc) DBG | Validating network
(crc) DBG | Validating storage pool
(crc) DBG | GetIP called for crc
(crc) DBG | Getting current state...
(crc) DBG | Waiting for machine to come up 0/60
(crc) DBG | GetIP called for crc
(crc) DBG | Getting current state...
(crc) DBG | Waiting for machine to come up 1/60
(crc) DBG | GetIP called for crc
(crc) DBG | Getting current state...
(crc) DBG | Waiting for machine to come up 2/60
(crc) DBG | GetIP called for crc
(crc) DBG | Getting current state...

<SNIP>

DEBU retry loop: attempt 288                      
DEBU About to run SSH command:
exit 0             
Cannot read private ssh key /home/snikolov/.crc/machines/crc/id_rsa
DEBU SSH command results: err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain, output:  
DEBU error: Temporary error: ssh command error:
command : exit 0
err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
output  :  - sleeping 1s 
DEBU RetryAfter timeout after 289 tries           
Making call to close driver server
(crc) Calling .Close
Successfully made call to close driver server
Making call to close connection to plugin binary
Failed to connect to the CRC VM with SSH -- host might be unreachable: Temporary error: ssh command error:
command : exit 0
err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
output  :  (x289)

Please consider posting the output of crc start --log-level debug on http://gist.github.com/ and post the link in the issue.

praveenkumar commented 3 years ago

@hunter86bg Can you also put the logs what is in the <SNIP> because first we use the key which is part of bundle for authentication and then generate the new key which is part of .crc/machine/crc/id_rsa .

hunter86bg commented 3 years ago

I've run:

[snikolov@openshift1 ~]$ crc delete -f
[snikolov@openshift1 ~]$ crc cleanup
[snikolov@openshift1 ~]$ crc setup
[snikolov@openshift1 ~]$ crc config set pull-secret-file /home/snikolov/secret
[snikolov@openshift1 ~]$ crc config view
- pull-secret-file                      : /home/snikolov/secret
[snikolov@openshift1 ~]$ crc start --log-level debug &> debuglog 

debuglog.gz

hunter86bg commented 3 years ago

For me there is a problem with the VM, as it couldn't detect the IP.

hunter86bg commented 3 years ago

Any idea what is going on ?

cfergeau commented 3 years ago

It looks like the VM is not starting for some reason. Can you check whether it's running or not with virsh -c qemu:///system list --all? If it's not running, I'd look at the logs in /var/log/libvirt/qemu/crc.log. If it's running, you should be able to connect to it with virt-viewer -c qemu:///system crc to see its display.

hunter86bg commented 3 years ago

Actually you are right, the VM is frozen . I will try to debug it further.

hunter86bg commented 3 years ago

Something in the VM definition is not OK with the libvirt/qemu on EL8 and it fails to properly boot. I have removed the VM's disks, created new and reinstalled vanilla CentOS7 and it seems to be working:

[snikolov@openshift1 ~]$ crc start
INFO Checking if running as non-root              
INFO Checking if podman remote executable is cached 
INFO Checking if admin-helper executable is cached 
INFO Checking minimum RAM requirements            
INFO Checking if Virtualization is enabled        
INFO Checking if KVM is enabled                   
INFO Checking if libvirt is installed             
INFO Checking if user is part of libvirt group    
INFO Checking if libvirt daemon is running        
INFO Checking if a supported libvirt version is installed 
INFO Checking if crc-driver-libvirt is installed  
INFO Checking if systemd-networkd is running      
INFO Checking if NetworkManager is installed      
INFO Checking if NetworkManager service is running 
INFO Checking if /etc/NetworkManager/conf.d/crc-nm-dnsmasq.conf exists 
INFO Checking if /etc/NetworkManager/dnsmasq.d/crc.conf exists 
INFO Checking if libvirt 'crc' network is available 
INFO Checking if libvirt 'crc' network is active  
? Image pull secret [? for help] ********************************** 

INFO Loading bundle: crc_libvirt_4.6.9.crcbundle ... 
INFO Verifying bundle crc_libvirt_4.6.9.crcbundle ... 
INFO Creating CodeReady Containers VM for OpenShift 4.6.9... 
INFO CodeReady Containers VM is running           
INFO Generating new SSH Key pair ...              
INFO Updating authorized keys ...                 
INFO Copying kubeconfig file to instance dir ...  
INFO Starting network time synchronization in CodeReady Containers VM 
INFO Check internal and public DNS query ...      
INFO Check DNS query from host ...                
INFO Adding user's pull secret to instance disk... 
INFO Verifying validity of the kubelet certificates ... 
INFO Starting OpenShift kubelet service           
INFO Adding user's pull secret to the cluster ... 
INFO Updating cluster ID ...                      
INFO Starting OpenShift cluster ... [waiting 3m]  
INFO Updating kubeconfig                          
WARN Skipping the kubeconfig update. Cluster operator authentication still not ready after 2min. 
WARN The cluster might report a degraded or error state. This is expected since several operators have been disabled to lower the resource usage. For more information, please consult the documentation 

Started the OpenShift cluster

To access the cluster, first set up your environment by following the instructions returned by executing 'crc oc-env'.
Then you can access your cluster by running 'oc login -u developer -p developer https://api.crc.testing:6443'.
To login as a cluster admin, run 'oc login -u kubeadmin -p D4tTL-hJYdy-qe36Y-okTwh https://api.crc.testing:6443'.

You can also run 'crc console' and use the above credentials to access the OpenShift web console.
The console will open in your default browser.
cfergeau commented 3 years ago

Something in the VM definition is not OK with the libvirt/qemu on EL8 and it fails to properly boot.

The libvirtd logs/VM logs would have been useful to try to figure this out :) I have crc running fine on EL8 on my daily test machine.

hunter86bg commented 3 years ago

My setup is quite specific:

As CentOS 7 works fine (same VM), the VM hosting the 'crc' is OK. I checked the 'crc' VM is using host-passthrough , so the cpu type is OK.

I will spin a RHEL 8.3 to check if it's reproducable there.

hunter86bg commented 3 years ago

It's definately not a distro issue. Same behaviour on RHEl 8.3.

After I touched the CPU settings, now the Vm stops at 7th second of the boot process (previously it was 0.3s).

cfergeau commented 3 years ago

Would be useful to see what's going on in the VM, can you try https://github.com/code-ready/crc/issues/1853#issuecomment-764805370 or https://github.com/code-ready/crc/issues/1853#issuecomment-766761975 ?

hunter86bg commented 3 years ago

It just stucks. With default xml on 0.3s of the boot and with reduced cpu count it stucks on 7th second

hunter86bg commented 3 years ago

I will try to provide more details once I get any progress

cfergeau commented 3 years ago

It just stucks. With default xml on 0.3s of the boot and with reduced cpu count it stucks on 7th second

This doesn't tell much about which step exactly it's stuck on ;) By any chance, is this happening with a ryzen CPU? And your latest tests were on non-nested setups?

hunter86bg commented 3 years ago

I can provide screenshots when I'm back. Still nested as I don't have spare hardware and the hosts are too small for oVirt/RHV based deployment of Openshift. VM is using host-passthrough (FX-8350) and has all necessary flags for Virtualization.

cfergeau commented 3 years ago

On AMD, I'd recommend making sure you are using the latest bios version, this has been causing issues on some systems.

hunter86bg commented 3 years ago

On EL7 the following is generated and works without any issues:

<domain type='kvm' id='1'>
  <name>crc</name>
  <uuid>95b9eb62-8d28-4ef8-99c9-7160773e2eaf</uuid>
  <memory unit='KiB'>9000960</memory>
  <currentMemory unit='KiB'>9000960</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <feature policy='disable' name='rdrand'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' io='threads'/>
      <source file='/home/snikolov/.crc/machines/crc/crc.qcow2'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/home/snikolov/.crc/cache/crc_libvirt_4.6.9/crc.qcow2'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='network'>
      <mac address='52:fd:fc:07:21:82'/>
      <source network='crc' bridge='crc'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/1'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='none'/>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c348,c364</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c348,c364</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>
hunter86bg commented 3 years ago

Dropping the count of vCPUs of the 'crc' VM to 1 allows me to reach: image

hunter86bg commented 3 years ago

Original config stucks at: image

hunter86bg commented 3 years ago

As you can see it can't reach the switchroot.

cfergeau commented 3 years ago

On EL7 the following is generated and works without any issues:

So EL7 + this xml + nested virt works fine, but EL8 + default xml + nested virt fails? Have you tried the el7 xml on the EL8 machine? kernel/qemu will be different between EL7 and EL8, this makes it hard to know if it works because of the XML, or because of the different kernel/qemu.

hunter86bg commented 3 years ago

I already tried theEL7 default config on nested EL8 -> doesn't work.

cfergeau commented 3 years ago

This looks like a kernel/qemu/hypervisor/cpu related nested virtualization issue, I'm afraid we won't be able to do much for it in crc :-( Double check https://github.com/code-ready/crc/issues/1853#issuecomment-775252444 in case it magically solves it.

hunter86bg commented 3 years ago

I will try to deploy another VM in that ovirt virtual Hypervisor and double check if it's crc related or nested virt issue.

hunter86bg commented 3 years ago

Sadly you are right, nested virtualization is not working. According to https://access.redhat.com/solutions/58204 Nested Virtualization for AMD CPUs is in tech preview and requires the Bare metal host to be also working on EL 8

hunter86bg commented 3 years ago

I guess we can leave this issue open for other users.

cfergeau commented 3 years ago

For what it's worth, I've been having success with RHEL8 host + nested virt on a ryzen 3600 CPU, so it's tech preview, but it can work well :) Check again for bios updates if you haven't done so yet, this has been helping some other people trying crc with nested virt.

hunter86bg commented 3 years ago

Sadly the hardware is too old and has no bios updates . I believe that it's due to the fact that the Hypervisor is EL7 with netsted Hypervisor is EL 8 ...

gbraad commented 3 years ago

Closing, as this is not an issue with CRC.