Error: unhealthy cluster - https://localhost:2379 #3

hernad commented 4 years ago

Hi, my bootstrap node reports this error:

[root@okd4-snc-bootstrap ~]# journalctl -b -f -u bootkube.service

Jul 01 16:27:29 okd4-snc-bootstrap.snc.test[674]: {"level":"warn","ts":"2020-07-01T16:27:29.381Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-9f885c4d-609f-444a-beab-393ba59f3c08/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Jul 01 16:27:29 okd4-snc-bootstrap.snc.test[674]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Jul 01 16:27:29 okd4-snc-bootstrap.snc.test[674]: Error: unhealthy cluster
hernad commented 4 years ago

DNS setup:

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> axfr @ snc.test
; (1 server found)
;; global options: +cmd
snc.test.               10800   IN      SOA 18 28800 7200 604800 3600
snc.test.               900     IN      NS      ns.snc.test.
api.okd4-snc.snc.test.  900     IN      A
api.okd4-snc.snc.test.  900     IN      A
_etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test.
okd4-snc-host.snc.test. 900     IN      A
api-int.okd4-snc.snc.test. 900  IN      A
api-int.okd4-snc.snc.test. 900  IN      A
*.apps.okd4-snc.snc.test. 900   IN      CNAME   okd4-snc-master.snc.test.
okd4-snc-bootstrap.snc.test. 900 IN     A
ns.snc.test.            900     IN      A
etcd-0.okd4-snc.snc.test. 900   IN      A
okd4-snc-master.snc.test. 900   IN      A
snc.test.               10800   IN      SOA 18 28800 7200 604800 3600

hernad commented 4 years ago

[root@hp-144 okd4-snc]# cat ~/bin/

export SNC_DOMAIN=snc.test
export SNC_HOST=
export INSTALL_ROOT=/usr/share/nginx/html/install
export INSTALL_URL=http://${SNC_HOST}/install
export OKD4_SNC_PATH=/root/okd4-snc
export OKD_RELEASE=4.4.0-0.okd-2020-05-23-055148-beta5
hernad commented 4 years ago

[root@hp-144 okd4-snc]# cat install-config-snc.yaml

apiVersion: v1
baseDomain: snc.test
  name: okd4-snc
  networkType: OpenShiftSDN
  - cidr:
    hostPrefix: 23
- name: worker
  replicas: 0
  name: master
  replicas: 1
  none: {}
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: ssh-rsa AAAABetc etc etc root@hp-144
hernad commented 4 years ago

FCOS defined in ~/bin/

hernad commented 4 years ago

nginx on host is ok:

curl http://okd4-snc-host.snc.test/install/fcos/ignition/bootstrap.ign

100 279k 100 279k 0 0 18.2M 0 --:--:-- --:--:-- --:--:-- 19.4M

cgruver commented 4 years ago

The FCOS version is old because recent versions of FCOS broke my install. I'm working on an alternative that works with the live ISO. The replacement of the isolinux.cfg that I'm doing in the deployment script no longer works with more recent versions of FCOS... I don't know why yet.

The error that you reported above is normal while the bootstrap is starting up. It can take a few minutes before it's up and listening on port 2379.

How long did you wait?

hernad commented 4 years ago

Just to add, I have also tried instalation with newer FCOS images, and okd 4.5.x, with no success. Exactly, with these FCOS:




and this OKD

#export OKD_RELEASE=4.5.0-0.okd-2020-06-29-110348-beta6
cgruver commented 4 years ago

It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.

See: and

cgruver commented 4 years ago

I've got a long weekend coming up with the holiday here in the states. I hope to get some work done on this. recent versions of FCOS seem to have broken it.

hernad commented 4 years ago

Thanks for your feeedback @cgruver.

How long did you wait?

As long as I am writing this :). About 30 minutes. Still the same...

hernad commented 4 years ago

This is master side virsh console:

[root@hp-144 okd4-snc]# virsh console okd4-snc-master

Connected to domain okd4-snc-master
[ ***  ] A start job is running for Ignition (fetch) (50min 37s / no limit)
[ 3040.240220] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #606
[ 3040.247110] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[     *] A start job is running for Ignition (fetch) (50min 42s / no limit)
[ 3045.248142] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #607
[ 3045.256863] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[ ***  ] A start job is running for Ignition (fetch) (50min 47s / no limit)
[ 3050.257876] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #608
[ 3050.266642] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[     *] A start job is running for Ignition (fetch) (50min 49s / no limit)
hernad commented 4 years ago

Master is obviously stuck at ignition phase.

hernad commented 4 years ago

It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.

You are right. ssh login to bootstrap node Fedora CoreOS 31.20200521.20.0. There was an upgrade from 31.20200505.2.0

cgruver commented 4 years ago

Yes, what you are seeing is similar to the problem that I am having now building a full cluster. The master nodes cannot pull the ignition from the bootstrap node. I think this is related to the issues I listed above.


curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

See if you get a 500 error. That's what I am seeing. The bootstrap node is failing to serve up the ignition files.

cgruver commented 4 years ago

Track progress here:

cgruver commented 4 years ago

Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...

What is the output of: curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

Run it several times to make sure that DNS round-robin is working. It should hit your bootstrap node.

hernad commented 4 years ago

curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master

* About to connect() to api-int.okd4-snc.snc.test port 22623 (#0)
*   Trying
* Connection refused
*   Trying
* Connection refused
* Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused
* Closing connection 0
curl: (7) Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused

There is no service on port 22623 ?!

hernad commented 4 years ago

active services on bootstrap:

[root@okd4-snc-bootstrap ~]# netstat -tlnp

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0*               LISTEN      805/crio            
tcp        0      0*               LISTEN      894/kubelet         
tcp        0      0   *               LISTEN      1/systemd           
tcp        0      0    *               LISTEN      723/sshd            
tcp        0      0 *               LISTEN      798/rpc.statd       
tcp6       0      0 :::6080                 :::*                    LISTEN      3705/kube-etcd-sign 
tcp6       0      0 :::10250                :::*                    LISTEN      894/kubelet         
tcp6       0      0 :::6443                 :::*                    LISTEN      3705/kube-etcd-sign 
tcp6       0      0 :::10255                :::*                    LISTEN      894/kubelet         
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::40593                :::*                    LISTEN      798/rpc.statd       
tcp6       0      0 :::22                   :::*                    LISTEN      723/sshd
hernad commented 4 years ago

Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...

I have noticed that.

cgruver commented 4 years ago

Try tearing it down, and running everything again. 

Double check your DNS config against the files that I provided. This entry may be incorrect:

_etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test.

I believe that there should not be a . after _etcd-server-ssl._tcp.okd4-snc.snc.test

Also note that after the bootstrap process completes, you will have to remove the A records for api and api-int that refer to the bootstrap node IP. That is why I include the remove-after-bootstrap in my example zone file.

cgruver commented 4 years ago

I just pushed an update that works with FCOS 32 and OKD 4 Beta 6

It also tested with Beta 5

hernad commented 4 years ago

@cgruver, great work !

Last day I have finally achieved a working cluster using this configuration:

It is based on your work mostly. The difference is loading ingition file via qemu firmware option. The positive thing about this configuration is that http nginx server is not needed. I had success with FC32 last test image and 4.5 okd.

I just pushed an update that works with FCOS 32 and OKD 4 Beta 6

I will try this after current investigation of my first working cluster :)

Again, thank for your work and support.

hernad commented 4 years ago

I believe that there should not be a . after _etcd-server-ssl._tcp.okd4-snc.snc.test

For your information, dot at the end is OK. It is standard to put in NS configuration to say "this is full qualified name - STOP".

I have seen similar examples in OKD documentation where FQDN is finished with dot.

cgruver commented 4 years ago


I will take a look at your config. Eliminating the Nginx server will simplify the deployment for folks.