Installer exits with error message: ERROR Error: virError(Code=38, Domain=7, Message='unable to connect to server at '192.168.122.1:16509': Connection timed out')

anderotxoa commented 4 years ago

Hi

Im trying to install snc in a POWER8 box with 8 cores and 128GB RAM. I have managed to install it with no issues in an x86 box but in the P8 box te installer stops with the following message:

DEBUG Initializing the backend...
DEBUG
DEBUG Initializing provider plugins...
DEBUG
DEBUG Terraform has been successfully initialized! DEBUG
DEBUG You may now begin working with Terraform. Try running "terraform plan" to see DEBUG any changes that are required for your infrastructure. All Terraform commands DEBUG should now work.
DEBUG
DEBUG If you ever set or change modules or backend configuration for Terraform, DEBUG rerun this command to reinitialize your working directory. If you forget, other DEBUG commands will detect it and remind you to do so if necessary.

./oc get etcds cluster
sleep 3
./oc get etcds cluster
sleep 3
./oc get etcds cluster ERROR
ERROR Error: virError(Code=38, Domain=7, Message='unable to connect to server at '192.168.122.1:16509': Connection timed out') ERROR
ERROR on ../../tmp/openshift-install-554439829/main.tf line 1, in provider "libvirt": ERROR 1: provider "libvirt" {
ERROR
ERROR
ERROR Failed to read tfstate: open /tmp/openshift-install-554439829/terraform.tfstate: no such file or directory FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
echo 'failed to create the cluster, but that is expected. We will block on a successful cluster via a future wait-for.' failed to create the cluster, but that is expected. We will block on a successful cluster via a future wait-for.
renew_certificates ++ ./oc adm release -a /root/snc/secret.txt info quay.io/openshift-release-dev/ocp-release@sha256:2ad9a328235f9c116e1c4319df2f0610253b40646b45afe846798ed4ffe6bb7d --image-for=cli
cli_image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb95ba34205fd6d3bd84b5fb6d3c8fb44333781fabc519838094bc30ee252679
./yq write kubelet-bootstrap-cred-manager-ds.yaml.in 'spec.template.spec.containers[0].image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb95ba34205fd6d3bd84b5fb6d3c8fb44333781fabc519838094bc30ee252679
./oc apply -f kubelet-bootstrap-cred-manager-ds.yaml error: unable to recognize "kubelet-bootstrap-cred-manager-ds.yaml": Get https://api.crc.testing:6443/api?timeout=32s: dial tcp: lookup api.crc.testing on 127.0.0.1:53: read udp 127.0.0.1:35600->127.0.0.1:53: i/o timeout ++ jobs -p
kill -9 10686 [root@snc snc]#

From the command line I can see no KVM virtual machine has been created: virsh # list --all Id Name State

virsh #

It looks like the primary error to me but not sure how to follow the investigation...

praveenkumar commented 4 years ago

Did you perform https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md#one-time-setup , looks like that might be the issue. since the libvirt not accepting the tcp connections.

mtarsel commented 4 years ago

Also make sure you have the default libvirt network created as well. When running snc again, remove any previously created crc- libvirt networks too.

Out of curiosity, what OS are you using? @anderotxoa

anderotxoa commented 4 years ago

Did you perform https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md#one-time-setup , looks like that might be the issue. since the libvirt not accepting the tcp connections.

Hi, yes I followed the instructions carefully. [root@snc snc]# cat /etc/libvirt/libvirtd.conf |grep -v "#" listen_tls = 0 listen_tcp = 1 tcp_port = "16509" auth_tcp = "none" [root@snc snc]# Also the right port looks like listenening: [root@snc snc]# netstat -tna|grep LISTEN tcp 0 0 0.0.0.0:16509 0.0.0.0: LISTEN
tcp 0 0 0.0.0.0:111 0.0.0.0: LISTEN
tcp 0 0 127.0.0.1:53 0.0.0.0: LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0: LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
*tcp6 0 0 :::16509 ::: LISTEN*
tcp6 0 0 :::111 ::: LISTEN
tcp6 0 0 :::22 ::: LISTEN
tcp6 0 0 ::1:25 ::: LISTEN
[root@snc snc]# BTW, I just disabled both selinux & firewalld

anderotxoa commented 4 years ago

Also make sure you have the default libvirt network created as well. When running snc again, remove any previously created crc- libvirt networks too.

Out of curiosity, what OS are you using? @anderotxoa

Which network should this one be? When I used it in the x86 box I did not take care of anything, the script created it all.

Regarding the OS it is RH7.8: [root@snc snc]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.8 (Maipo)

anderotxoa commented 4 years ago

I had a problem with an already installed YQ, it has been fixed but still same error keeps appearing

I will include also the whole log in case anyone has the time to check it and provide any insight. snc.log

I also can see that in my x86 box I have the virtual interfaces created, but my guess is that they are created in later stages so it makes sense that they are still not present in the Power8 box. Currently installing again in the x86 box to see any difference... Power8 --> RH7.8 (latest available) x86 --> CentOS 8.x (latest available)

mtarsel commented 4 years ago

@anderotxoa thanks the log is helpful. So it's a P8 rhel 7.8 machine using release 4.5.7 from the latest mirror. Are you trying to install SNC inside of a VM or are you installing snc as a bare metal install?

What is the output of virsh net-list --all? Make sure the default network is created like this:

[root@snc snc]# virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
default            active     yes           yes

Also I mentioned

remove any previously created crc- libvirt networks too.

Try something like this:

CONNECT="${CONNECT:=qemu:///system}"

for NET in $(virsh -c "${CONNECT}" net-list --all --name \| grep crc); do
    run virsh -c "${CONNECT}" net-destroy "${NET}"
    run virsh -c "${CONNECT}" net-undefine "${NET}"
done

anderotxoa commented 4 years ago

@anderotxoa thanks the log is helpful. So it's a P8 rhel 7.8 machine using release 4.5.7 from the latest mirror. Are you trying to install SNC inside of a VM or are you installing snc as a bare metal install?

What is the output of virsh net-list --all? Make sure the default network is created like this:
[root@snc snc] virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
default            active     yes           yes
Also I mentioned

remove any previously created crc- libvirt networks too.

Try something like this:
CONNECT="${CONNECT:=qemu:///system}"

for NET in $(virsh -c "${CONNECT}" net-list --all --name \| grep crc); do
    run virsh -c "${CONNECT}" net-destroy "${NET}"
    run virsh -c "${CONNECT}" net-undefine "${NET}"
done
Hi @mtarsel

Im afraid it does not show any net related stuff, even in the net config I can only see eth0 and lo0

[root@snc ~] # virsh net-list --all Name State Autostart Persistent "---------------------------------------------------------"

[root@snc ~] ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000 link/ether aa:3f:e8:2e:e1:02 brd ff:ff:ff:ff:ff:ff inet 10.1.3.15/16 brd 10.1.255.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fe80::a83f:e8ff:fe2e:e102/64 scope link valid_lft forever preferred_lft forever [root@snc ~]#

anderotxoa commented 4 years ago

UPDATE It looks like the libvirtd network was not properly created (not sure why). Anyway after creating it following those instructions: https://blog.programster.org/kvm-missing-default-network it looks like passing previous step where it failed... (it is still installing) In the meanwhile I would suggest two things:

I would check during the initial stages of snc.sh that the default libvirtd network is created and started
Also I added the snc path to $PATH since some errors were shown in the initial stages of the script. Probably unnecessary but it looks better without those messages.

will keep you posted

anderotxoa commented 4 years ago

UPDATE2

Now it is timingout to create the bootstrap node. I guess this could be due to slow disks (I only have access to regulad HDDs). I may have to move to a minsky server with NVMe, what do u think?

I also have plenty of RAM so I could use a RAM disk of ... lets say 64GB to accelerate the process but not sure where I could mount it, any suggestion?

Log:

DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: i/o timeout DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: i/o timeout DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host DEBUG Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG The connection to the server api-int.crc.testing:6443 was refused - did you specify the right host or port? DEBUG Gather remote logs
DEBUG Collecting info from crc-fgkhh-master-0.crc.testing DEBUG lost connection
EBUG ssh: connect to host crc-fgkhh-master-0.crc.testing port 22: No route to host DEBUG Log bundle written to /var/home/core/log-bundle-20200901130126.tar.gz INFO Bootstrap gather logs captured here "/root/snc/crc-tmp-install-data/log-bundle-20200901130126.tar.gz" FATAL Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition

echo 'failed to create the cluster, but that is expected. We will block on a successful cluster via a future wait-for.' failed to create the cluster, but that is expected. We will block on a successful cluster via a future wait-for.
renew_certificates ++ ./oc adm release -a /root/snc/secret.txt info quay.io/openshift-release-dev/ocp-release@sha256:2ad9a328235f9c116e1c4319df2f0610253b40646b45afe846798ed4ffe6bb7d --image-for=cli
cli_image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb95ba34205fd6d3bd84b5fb6d3c8fb44333781fabc519838094bc30ee252679
./yq write kubelet-bootstrap-cred-manager-ds.yaml.in 'spec.template.spec.containers[0].image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb95ba34205fd6d3bd84b5fb6d3c8fb44333781fabc519838094bc30ee252679
./oc apply -f kubelet-bootstrap-cred-manager-ds.yaml error: unable to recognize "kubelet-bootstrap-cred-manager-ds.yaml": Get https://api.crc.testing:6443/api?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused You have new mail in /var/spool/mail/root [root@snc snc]# [root@snc snc]#

cfergeau commented 4 years ago

I also have plenty of RAM so I could use a RAM disk of ... lets say 64GB to accelerate the process but not sure where I could mount it, any suggestion?

The VM images are created in /var/lib/libvirt/openshift-images. Hard to tell what went wrong in the install process with just these logs..

anderotxoa commented 4 years ago

I also have plenty of RAM so I could use a RAM disk of ... lets say 64GB to accelerate the process but not sure where I could mount it, any suggestion?

The VM images are created in /var/lib/libvirt/openshift-images. Hard to tell what went wrong in the install process with just these logs..

I just created a 64GB ramdisk in this place and I'm testing it now. Anyway, do you think it is possible (or where) to increase the 40m timeout for the bootstrap completion?

cfergeau commented 4 years ago

I just created a 64GB ramdisk in this place and I'm testing it now.

You need to mount your ramdisk at this location, not clear if this is what you did.

Anyway, do you think it is possible (or where) to increase the 40m timeout for the bootstrap completion?

I believe this was discussed before and rejected by the installer team.

anderotxoa commented 4 years ago

@cfergeau yes, I tested it both in /var/lib/libvirt/openshift-images and in /var/lib/libvirt/images but it made no difference. It still fails in the bootstrap creation.

I also changed the SMT level from 8 to 1 to give more power to the installation threads (it is using usually only two).

I will attach both the on screen logs and the general log in case someone can have some time to help because I cannot find a reason (I only can see it still complains about the net , in this case the sdn...)

snc.log --> Look for ERROR to find the only strange thing I saw log-bundle-20200902092624.tar.gz

mtarsel commented 4 years ago

So it's a P8 rhel 7.8 machine using release 4.5.7 from the latest mirror.

You have to upgrade to RHEL 8. The rhcos images for 4.3 and onward are not compatible with rhel 7 on ppc64le for snc at this time. Please upgrade to RHEL 8 and this should work.

anderotxoa commented 4 years ago

So it's a P8 rhel 7.8 machine using release 4.5.7 from the latest mirror.

You have to upgrade to RHEL 8. The rhcos images for 4.3 and onward are not compatible with rhel 7 on ppc64le for snc at this time. Please upgrade to RHEL 8 and this should work.

Hi @mtarsel This makes sense, I will reinstall with RH8 and report back if Im successful. Thanks for the tip!

anderotxoa commented 4 years ago

Hi @mtarsel I reinstalled with RH8, but still same error keeps appearing.

I will attach the log in case anyone can figure out something, also I will attach the command walkthrough I followed to install it in case something is wrong.

OCP SNC install walkthrough.txt log-bundle-20200904142155.tar.gz

cfergeau commented 4 years ago

yum install libvirt-devel libvirt-daemon-kvm libvirt-client -y
systemctl enable --now libvirtd

Just to be sure, is qemu-kvm installed as well? Should be a dependency of libvirt-daemon-kvm, but one never knows ^^

# Add IP and hostname to /etc/hosts
echo "10.1.3.15 snc" >> /etc/hosts

Not sure this is needed

vi /etc/libvirt/libvirtd.conf
# listen_tls = 0
# listen_tcp = 1
# auth_tcp = "none"
# tcp_port = "16509"

They should not be commented out (no # in front of these lines).

iptables -I INPUT -p tcp -s 192.168.126.0/24 -d 192.168.122.1 --dport 16509 -j ACCEPT -m comment --comment "Allow insecure libvirt clients"

No need for that if you are using firewalld. After the firewalld changes, I would check that virsh -c qemu+tcp://192.168.122.1/system list --all returns with no errors.

echo server=/tt.testing/192.168.126.1 | sudo tee /etc/NetworkManager/dnsmasq.d/openshift.conf

This particular change is not needed, though hopefully it won't cause issues, just delete /etc/NetworkManager/dnsmasq.d/openshift.conf and restart NetworkManager, sudo systemctl restart NetworkManager

anderotxoa commented 4 years ago

Hi @cfergeau Thanks for the comments. Yes the '#' in the libvirtd.conf lines are "uncommented", they are there just as a reminder. Yes the QEMU is installed The output of your command shows this: [root@snc ~] virsh -c qemu+tcp://192.168.122.1/system list --all setlocale: No such file or directory Id Name State

1 crc-nlqwb-bootstrap running

crc-nlqwb-master-0 shut off

[root@snc ~]

I will delete the file you mention and start all over again

anderotxoa commented 4 years ago

Hi @cfergeau

No matter what I always get the same problem. I have upgraded to RH8, moved to SMT1 and even used a Ramdisk for /var/lib/libvirtd/openshift-images. I have seen that it wastes a huge amount of time after the first three lines of this log and then (after may be 30 mins) it shows the following errors:

echo 'API server is up, applying etcd hack' API server is up, applying etcd hack
oc patch etcd cluster '-p={"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}' --type=merge etcd.operator.openshift.io/cluster patched

(HUGE DELAY HERE)

E0908 13:22:41.480062 276012 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=3897&timeoutSeconds=502&watch=true: dial tcp 192.168.126.10:6443: connect: connection refused E0908 13:22:44.608542 276012 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=3897&timeoutSeconds=595&watch=true: dial tcp 192.168.126.11:6443: connect: no route to host E0908 13:22:47.726212 276012 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: dial tcp 192.168.126.10:6443: connect: connection refused E0908 13:22:50.868144 276012 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: dial tcp 192.168.126.11:6443: connect: no route to host E0908 13:22:53.972520 276012 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: dial tcp 192.168.126.10:6443: connect: connection refused I0908 13:23:07.108441 276012 trace.go:116] Trace[1462331844]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (started: 2020-09-08 13:22:54.972654663 +0200 CEST m=+1541.174443011) (total time: 12.13573127s): Trace[1462331844]: [12.13573127s] [12.13573127s] END E0908 13:23:07.108472 276012 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list v1.ConfigMap: Get https://api.crc.testing:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0: net/http: TLS handshake timeout

anderotxoa commented 4 years ago

Problem found. While both KVM virtual machines appear as running, connecting to the console I can see they started booting but never finished. They are stuck somewhere.

This is the output I could see:

[root@snc snc]# virsh console crc-p6rcr-master-0 setlocale: No such file or directory Connected to domain crc-p6rcr-master-0 Escape character is ^] [ 186.020523] Processor 2 is stuck. [ 186.046023] systemd-udevd[581]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. [ 186.047537] systemd-udevd[524]: seq 1749 '/devices/pci0000:00/0000:00:01.0/virtio0' is taking a long time [ 186.047713] systemd-udevd[524]: seq 1792 '/devices/system/cpu/cpu5' is taking a long time [ 186.047840] systemd-udevd[524]: seq 1791 '/devices/system/cpu/cpu4' is taking a long time [ 186.047976] systemd-udevd[524]: seq 1790 '/devices/system/cpu/cpu3' is taking a long time [ 186.048102] systemd-udevd[524]: seq 1758 '/devices/pci0000:00/0000:00:03.0/virtio1' is taking a long time [ 186.048257] systemd-udevd[524]: seq 1789 '/devices/system/cpu/cpu2' is taking a long time [ 186.048382] systemd-udevd[524]: seq 1788 '/devices/system/cpu/cpu1' is taking a long time [ 186.048535] systemd-udevd[524]: seq 1787 '/devices/system/cpu/cpu0' is taking a long time [ 186.049720] systemd[1]: Starting udev Wait for Complete Device Initialization... Starting udev Wait for Complete Device Initialization... [ 186.298686] crypto_register_alg 'xts(aes)' = 0

mtarsel commented 4 years ago

the problem in above comment doesn't seem to be related to snc. I think this issue should be closed but I'll open a separate issue about the default network pre-req for snc on pcc64le.

anderotxoa commented 4 years ago

Hi @mtarsel , yes I agree. Finally I did not find the error cause but it is clear to me that it happens when trying to run the VMs inside one LPAR. They hung. Baremetal must be used instead.

gbraad commented 4 years ago

an issue related to nested virtualization?

anderotxoa commented 4 years ago

an issue related to nested virtualization?

Exactly

crc-org / snc

Installer exits with error message: ERROR Error: virError(Code=38, Domain=7, Message='unable to connect to server at '192.168.122.1:16509': Connection timed out') #227

From the command line I can see no KVM virtual machine has been created: virsh # list --all Id Name State