Closed hernad closed 4 years ago
DNS setup:
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> axfr @192.168.168.10 snc.test ; (1 server found) ;; global options: +cmd snc.test. 10800 IN SOA dc1.sa.out.ba. root.sa.out.ba. 18 28800 7200 604800 3600 snc.test. 900 IN NS ns.snc.test. api.okd4-snc.snc.test. 900 IN A 192.168.168.164 api.okd4-snc.snc.test. 900 IN A 192.168.168.165 _etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test. okd4-snc-host.snc.test. 900 IN A 192.168.168.160 api-int.okd4-snc.snc.test. 900 IN A 192.168.168.164 api-int.okd4-snc.snc.test. 900 IN A 192.168.168.165 *.apps.okd4-snc.snc.test. 900 IN CNAME okd4-snc-master.snc.test. okd4-snc-bootstrap.snc.test. 900 IN A 192.168.168.164 ns.snc.test. 900 IN A 192.168.168.10 etcd-0.okd4-snc.snc.test. 900 IN A 192.168.168.165 okd4-snc-master.snc.test. 900 IN A 192.168.168.165 snc.test. 10800 IN SOA dc1.sa.out.ba. root.sa.out.ba. 18 28800 7200 604800 3600
[root@hp-144 okd4-snc]# cat ~/bin/setSncEnv.sh
export SNC_DOMAIN=snc.test export SNC_HOST=192.168.168.160 #export SNC_NAMESERVER=${SNC_HOST} export SNC_NAMESERVER=192.168.168.10 export SNC_NETMASK=255.255.255.0 export SNC_GATEWAY=192.168.168.254 export INSTALL_HOST_IP=${SNC_HOST} export INSTALL_ROOT=/usr/share/nginx/html/install export INSTALL_URL=http://${SNC_HOST}/install export OKD4_SNC_PATH=/root/okd4-snc export OKD_REGISTRY=quay.io/openshift/okd export OKD_RELEASE=4.4.0-0.okd-2020-05-23-055148-beta5
[root@hp-144 okd4-snc]# cat install-config-snc.yaml
apiVersion: v1
baseDomain: snc.test
metadata:
name: okd4-snc
networking:
networkType: OpenShiftSDN
clusterNetwork:
- cidr: 10.100.0.0/14
hostPrefix: 23
serviceNetwork:
- 172.30.0.0/16
compute:
- name: worker
replicas: 0
controlPlane:
name: master
replicas: 1
platform:
none: {}
pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
sshKey: ssh-rsa AAAABetc etc etc root@hp-144
FCOS defined in ~/bin/DeployOkdSnc.sh
CPU="4" MEMORY="16384" DISK="200" FCOS_VER=31.20200505.2.0 FCOS_STREAM=testing
nginx on host is ok:
curl http://okd4-snc-host.snc.test/install/fcos/ignition/bootstrap.ign
100 279k 100 279k 0 0 18.2M 0 --:--:-- --:--:-- --:--:-- 19.4M
The FCOS version is old because recent versions of FCOS broke my install. I'm working on an alternative that works with the live ISO. The replacement of the isolinux.cfg that I'm doing in the deployment script no longer works with more recent versions of FCOS... I don't know why yet.
The error that you reported above is normal while the bootstrap is starting up. It can take a few minutes before it's up and listening on port 2379.
How long did you wait?
Just to add, I have also tried instalation with newer FCOS images, and okd 4.5.x, with no success. Exactly, with these FCOS:
#FCOS_VER=32.20200615.3.0
#FCOS_STREAM=stable
#FCOS_VER=32.20200629.2.0
#FCOS_STREAM=testing
#FCOS_VER=31.20200517.3.0
#FCOS_STREAM=stable
and this OKD
#export OKD_RELEASE=4.5.0-0.okd-2020-06-29-110348-beta6
It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.
See: https://github.com/openshift/okd/issues/229 and https://github.com/openshift/okd/issues/238
I've got a long weekend coming up with the holiday here in the states. I hope to get some work done on this. recent versions of FCOS seem to have broken it.
Thanks for your feeedback @cgruver.
How long did you wait?
As long as I am writing this :). About 30 minutes. Still the same...
This is master side virsh console:
[root@hp-144 okd4-snc]# virsh console okd4-snc-master
Connected to domain okd4-snc-master
[ *** ] A start job is running for Ignition (fetch) (50min 37s / no limit)
[ 3040.240220] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #606
[ 3040.247110] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[ *] A start job is running for Ignition (fetch) (50min 42s / no limit)
[ 3045.248142] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #607
[ 3045.256863] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[ *** ] A start job is running for Ignition (fetch) (50min 47s / no limit)
[ 3050.257876] ignition[527]: GET https://api-int.okd4-snc.snc.test:22623/config/master: attempt #608
[ 3050.266642] ignition[527]: GET error: Get https://api-int.okd4-snc.snc.test:22623/config/master: dial tcp 192.168.168[ *] A start job is running for Ignition (fetch) (50min 49s / no limit)
...
Master is obviously stuck at ignition phase.
It's possible that during the install of the bootstrap node it is upgrading to FCOS 32 which we are having some issues with.
You are right. ssh login to bootstrap node Fedora CoreOS 31.20200521.20.0
. There was an upgrade from 31.20200505.2.0
Yes, what you are seeing is similar to the problem that I am having now building a full cluster. The master nodes cannot pull the ignition from the bootstrap node. I think this is related to the issues I listed above.
try:
curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master
See if you get a 500
error. That's what I am seeing. The bootstrap node is failing to serve up the ignition files.
Track progress here: https://github.com/openshift/okd/issues/239
Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...
What is the output of: curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master
Run it several times to make sure that DNS round-robin is working. It should hit your bootstrap node.
curl -v --insecure https://api-int.okd4-snc.snc.test:22623/config/master
* About to connect() to api-int.okd4-snc.snc.test port 22623 (#0)
* Trying 192.168.168.165...
* Connection refused
* Trying 192.168.168.164...
* Connection refused
* Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused
* Closing connection 0
curl: (7) Failed connect to api-int.okd4-snc.snc.test:22623; Connection refused
There is no service on port 22623 ?!
active services on bootstrap:
[root@okd4-snc-bootstrap ~]# netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:34083 0.0.0.0:* LISTEN 805/crio
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 894/kubelet
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 723/sshd
tcp 0 0 0.0.0.0:49241 0.0.0.0:* LISTEN 798/rpc.statd
tcp6 0 0 :::6080 :::* LISTEN 3705/kube-etcd-sign
tcp6 0 0 :::10250 :::* LISTEN 894/kubelet
tcp6 0 0 :::6443 :::* LISTEN 3705/kube-etcd-sign
tcp6 0 0 :::10255 :::* LISTEN 894/kubelet
tcp6 0 0 :::111 :::* LISTEN 1/systemd
tcp6 0 0 :::40593 :::* LISTEN 798/rpc.statd
tcp6 0 0 :::22 :::* LISTEN 723/sshd
Digging deeper, I'm not sure you are seeing the issue that we have with FCOS 32 and OKD 4.5...
I have noticed that.
Try tearing it down, and running everything again.
DestroyBootstrap.sh
UnDeploySncNode.sh
Double check your DNS config against the files that I provided. This entry may be incorrect:
_etcd-server-ssl._tcp.okd4-snc.snc.test. 900 IN SRV 0 10 2380 etcd-0.okd4-snc.snc.test.
I believe that there should not be a .
after _etcd-server-ssl._tcp.okd4-snc.snc.test
Also note that after the bootstrap process completes, you will have to remove the A
records for api
and api-int
that refer to the bootstrap node IP. That is why I include the remove-after-bootstrap
in my example zone file.
I just pushed an update that works with FCOS 32 and OKD 4 Beta 6
It also tested with Beta 5
@cgruver, great work !
Last day I have finally achieved a working cluster using this configuration: https://github.com/hernad/okd4-snc-qemu
It is based on your work mostly. The difference is loading ingition file via qemu firmware option. The positive thing about this configuration is that http nginx server is not needed. I had success with FC32 last test image and 4.5 okd.
I just pushed an update that works with FCOS 32 and OKD 4 Beta 6
I will try this after current investigation of my first working cluster :)
Again, thank for your work and support.
I believe that there should not be a . after _etcd-server-ssl._tcp.okd4-snc.snc.test
For your information, dot at the end is OK. It is standard to put in NS configuration to say "this is full qualified name - STOP".
I have seen similar examples in OKD documentation where FQDN is finished with dot.
Excellent!
I will take a look at your config. Eliminating the Nginx server will simplify the deployment for folks.
Hi, my bootstrap node reports this error:
[root@okd4-snc-bootstrap ~]# journalctl -b -f -u bootkube.service