[BUG] Microshift fails to start: `subjectAltNames must not contain localhost, 127.0.0.1`

gbraad commented 1 year ago

This is the current latest version of CRC installed on a clean install of Fedora38. When starting CRC / Microshift.

$ ./crc version                                          
CRC version: 2.23.0+ddcfe8
OpenShift version: 4.13.3
Podman version: 4.4.4
$ ./crc start --log-level debug
...
INFO Starting Microshift service... [takes around 1min] 
DEBU Cannot load secret from configuration: empty path 
DEBU Using secret from keyring                    
DEBU Creating /etc/crio/openshift-pull-secret with permissions 0600 in the CRC VM 
DEBU Running SSH command: <hidden>                
DEBU SSH command succeeded                        
DEBU Using root access: Starting microshift service 
DEBU Running SSH command: sudo systemctl start microshift 
DEBU SSH command results: err: Process exited with status 1, output:  
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU Successfully made call to close driver server 
DEBU Making call to close connection to plugin binary 
DEBU (crc) DBG | time="2023-07-18T15:45:36+08:00" level=debug msg="Closing plugin on server side" 
ssh command error:
command : sudo systemctl start microshift
err     : Process exited with status 1
...
$ ssh -i ~/.crc/machines/crc/id_ecdsa core@192.168.130.11
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is GREEN - Health Check SUCCESS
[core@api ~]$ systemctl status
● api.crc.testing
    State: degraded
...
$ systemctl list-units --failed
  UNIT                     LOAD   ACTIVE SUB    DESCRIPTION     
● microshift.service       loaded failed failed MicroShift
● qemu-guest-agent.service loaded failed failed QEMU Guest Agent
...
$ journalctl -u microshift
-- Boot b0a7d417f9dc47008439741d9b913d9c --
Jul 18 03:45:35 api.crc.testing systemd[1]: Starting MicroShift...
Jul 18 03:45:35 api.crc.testing microshift[2332]: ??? F0718 03:45:35.747087    2332 run.go:46] Error in reading or validating configuration: subjectAltNames must not contain localhost, 127.0.0.1
Jul 18 03:45:35 api.crc.testing systemd[1]: microshift.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 18 03:45:35 api.crc.testing systemd[1]: microshift.service: Failed with result 'exit-code'.
Jul 18 03:45:35 api.crc.testing systemd[1]: Failed to start MicroShift.
Jul 18 03:45:35 api.crc.testing systemd[1]: microshift.service: Scheduled restart job, restart counter is at 1.
Jul 18 03:45:35 api.crc.testing systemd[1]: Stopped MicroShift.
Jul 18 03:45:36 api.crc.testing systemd[1]: Starting MicroShift...
Jul 18 03:45:36 api.crc.testing microshift[2424]: ??? F0718 03:45:36.480606    2424 run.go:46] Error in reading or validating configuration: subjectAltNames must not contain localhost, 127.0.0.1
Jul 18 03:45:36 api.crc.testing systemd[1]: microshift.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 18 03:45:36 api.crc.testing systemd[1]: microshift.service: Failed with result 'exit-code'.
Jul 18 03:45:36 api.crc.testing systemd[1]: Failed to start MicroShift.
Jul 18 03:45:36 api.crc.testing systemd[1]: microshift.service: Scheduled restart job, restart counter is at 2.
Jul 18 03:45:36 api.crc.testing systemd[1]: Stopped MicroShift.
Jul 18 03:45:37 api.crc.testing systemd[1]: Starting MicroShift...
...

... this continues until the restart counter reaches 5 times.

praveenkumar commented 1 year ago

Does it happen every single time? I just tried it on F-38 and not able to reproduce this.

gbraad commented 1 year ago

every single time with this image... delete and start will result in the same.

praveenkumar commented 1 year ago

Also does this F-38 is a VM or on baremetal? I want to try same setup to identify the issue.

gbraad commented 1 year ago

Any idea what Error in reading or validating configuration: subjectAltNames must not contain localhost, 127.0.0.1 could refer to?

https://access.redhat.com/documentation/en-us/red_hat_build_of_microshift/4.13/html/configuring/microshift-using-config-tools#microshift-yaml-default_microshift-configuring

[core@api ~]$ cat /etc/microshift/config.yaml
dns:
  # Base domain of the cluster. All managed DNS records will be sub-domains of this base.
  baseDomain: crc.testing

network:
  clusterNetwork:
  # IP range for use by the cluster
  #- cidr: 10.42.0.0/16

  serviceNetwork:
  # IP range for services in the cluster
  #- 10.43.0.0/16

  # Node ports allowed for services
  #serviceNodePortRange: 30000-32767

node:
  # If non-empty, use this string to identify the node instead of the hostname
  #hostnameOverride: ''

  # IP address of the node, passed to the kubelet.
  # If not specified, kubelet will use the node's default IP address.
  #nodeIP: ''

apiServer:
  # The Subject Alternative Names for the external certificates in API server (defaults to hostname -A)
  #subjectAltNames: []

debugging:
  # Log verbosity ('Normal', 'Debug', 'Trace', 'TraceAll'):
  #logLevel: 'Normal'

etcd:
  # Memory limit for etcd, in Megabytes: 0 is no limit.
  #memoryLimitMB: 0

so that means it does:

  # The Subject Alternative Names for the external certificates in API server (defaults to hostname -A)
  #subjectAltNames: []

[core@api ~]$ hostname -A
bogon api

gbraad commented 1 year ago

I run this on my Thinkcentre Tiny. Reinstalled this machine yesterday. Other VMs on this machine are operational... so just tried to install crc on it. And this happened :-s.

praveenkumar commented 1 year ago

Any idea what Error in reading or validating configuration: subjectAltNames must not contain localhost, 127.0.0.1 could refer to?

As error suggest that AltNames shouldn't have localhost/127.0.0.1 which we don't have as part of microshift configuration, have you tried rerunning the service after ssh to the VM?

gbraad commented 1 year ago

Restarting from the VM would result in the same issue... it can't resolve the hostname to a valid value. After adding this to /etc/hosts

192.168.130.11 bogon api

it starts

praveenkumar commented 1 year ago

Can you check if crc-dnsmasq service is running from the VM?

gbraad commented 1 year ago

Yes, as I also posted earlier that qmeu-guest-agent and microshift failed:

[root@api ~]# systemctl status crc-dnsmasq
● crc-dnsmasq.service - Podman container-59a21a926ac5a3dc9a4ce468813397f136e630ee790c9fb665f02871f1cd48bf.service
     Loaded: loaded (/etc/systemd/system/crc-dnsmasq.service; enabled; preset: disabled)
     Active: active (running) since Tue 2023-07-18 03:45:22 EDT; 1h 17min ago

[root@api ~]# podman exec -it crc-dnsmasq bash
[root@59a21a926ac5 /]# cat /etc/dnsmasq.conf 
user=root
port= 53
bind-interfaces
expand-hosts
log-queries
local=/crc.testing/
domain=crc.testing
address=/apps-crc.testing/192.168.130.11
address=/api.crc.testing/192.168.130.11
address=/api-int.crc.testing/192.168.130.11
address=/crc.crc.testing/192.168.122.147
[root@59a21a926ac5 /]# exit
exit
[root@api ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search crc.testing
nameserver 192.168.130.1
[root@api ~]#

praveenkumar commented 1 year ago

I am not sure then why you need to add api as part of /etc/hosts because following works for me

$ hostname
api.crc.testing
[core@api ~]$ ping $(hostname)
PING api.crc.testing (192.168.130.11) 56(84) bytes of data.
64 bytes from api (192.168.130.11): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from api (192.168.130.11): icmp_seq=2 ttl=64 time=0.123 ms
64 bytes from api (192.168.130.11): icmp_seq=3 ttl=64 time=0.133 ms
^C

gbraad commented 1 year ago

[root@api ~]# dig api @192.168.130.1

; <<>> DiG 9.16.23-RH <<>> api @192.168.130.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39304
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;api.               IN  A

;; ANSWER SECTION:
api.            0   IN  A   192.168.130.11

;; Query time: 2 msec
;; SERVER: 192.168.130.1#53(192.168.130.1)
;; WHEN: Tue Jul 18 05:06:54 EDT 2023
;; MSG SIZE  rcvd: 48

[root@api ~]#

Maybe it is not API, but bogon:

[root@api ~]# dig bogon @192.168.130.1

; <<>> DiG 9.16.23-RH <<>> bogon @192.168.130.1
;; global options: +cmd
;; connection timed out; no servers could be reached

[root@api ~]# hostname
api.crc.testing
[root@api ~]#

because hostname -A returns

[root@api ~]# hostname -A
bogon bogon api.crc.testing api.crc.testing

For you? `

praveenkumar commented 1 year ago

[core@api ~]$ hostname -A
api.crc.testing api api.crc.testing api.crc.testing

From where this bogon comes ?

praveenkumar commented 1 year ago

So looks like bogon is for bogus IP address https://apple.stackexchange.com/a/394640 but I am not sure how it is showing for you.

gbraad commented 1 year ago

Yep... so something in the bring up fails to work properly.

Did this again just now, and started:

INFO Creating CRC VM for MicroShift 4.13.3...     
INFO Generating new SSH key pair...               
INFO Starting CRC VM for microshift 4.13.3...     
INFO CRC instance is running with IP 192.168.130.11 
INFO CRC VM is running                            
INFO Updating authorized keys...                  
INFO Configuring shared directories               
INFO Check internal and public DNS query...       
INFO Check DNS query from host...                 
INFO Starting Microshift service... [takes around 1min] 
INFO Waiting for kube-apiserver availability... [takes around 2min] 
INFO Adding microshift context to kubeconfig...   
Started the MicroShift cluster.

Use the 'oc' command line interface:
  $ eval $(crc oc-env)
  $ oc COMMAND

Note: this went wrong 5 times in a row...

praveenkumar commented 1 year ago

This might be issue on your network side configuration, first time I am seeing bogon issue and I am not sure how to fix it on our end (crc/snc side).

gbraad commented 1 year ago

It also seems the VM looses connectivity with podman over time. It was working with crc podman-env, but after a while it just stops responding.

I think it is related to something with Podman inside the VM:

[core@api ~]$ podman ps
ERRO[0000] invalid internal status, try resetting the pause process with "podman system migrate": could not find any running process: no such process

which might explain why the name does not resolve.

It shows really weird behaviour at times. Redoing a crc podman-env makes it work again, but after a while it stops again.

Seems the containers I start get stopped:

$ podman start tailscale  
tailscale
$ podman ps -a          
CONTAINER ID  IMAGE                                             COMMAND     CREATED         STATUS        PORTS       NAMES
fb0bb4b683d7  ghcr.io/spotsnel/tailscale-systemd/fedora:latest              37 minutes ago  Up 3 seconds              tailscale
$ podman ps -a
CONTAINER ID  IMAGE                                             COMMAND     CREATED         STATUS      PORTS       NAMES
fb0bb4b683d7  ghcr.io/spotsnel/tailscale-systemd/fedora:latest              50 minutes ago  Created                 tailscale
$ podman start tailscale
tailscale
$ podman ps

... and now just hangs again

These containers run any other environments for days/months without an issue, but inside the VM they get stopped. Podman becomes unresponsive. After a while it 'works' again, but the containers all have stopped.

$ eval $(./crc podman-env) 
$ podman ps               
Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
Error: unable to connect to Podman socket: Get "http://d/v4.5.1/libpod/_ping": ssh: rejected: connect failed (open failed)
$ podman ps 
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

gbraad commented 1 year ago

This might be issue on your network side configuration,

The VM does not lose connectivity at all. I do a ping (and flood) test and there is no loss.

Must have been a glitch... Startups work now. It is Podman that is failing badly now. Containers that use systemd do not remain active.

gbraad commented 1 year ago

$ podman ps -a
CONTAINER ID  IMAGE                                             COMMAND     CREATED      STATUS        PORTS       NAMES
fb0bb4b683d7  ghcr.io/spotsnel/tailscale-systemd/fedora:latest              2 hours ago  Up 5 minutes              tailscale
$ podman ps -a
Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
Error: unable to connect to Podman socket: Get "http://d/v4.5.1/libpod/_ping": ssh: rejected: connect failed (open failed)
$ podman ps -a
Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
Error: unable to connect to Podman socket: Get "http://d/v4.5.1/libpod/_ping": ssh: rejected: connect failed (open failed)
$ ssh -i ~/.crc/machines/crc/id_ecdsa core@192.168.130.11
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is GREEN - Health Check SUCCESS
Last login: Tue Jul 18 07:40:25 2023 from 192.168.130.1
[core@api ~]$ podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
[core@api ~]$ podman ps -a
CONTAINER ID  IMAGE                                             COMMAND     CREATED      STATUS      PORTS       NAMES
fb0bb4b683d7  ghcr.io/spotsnel/tailscale-systemd/fedora:latest              2 hours ago  Stopping                tailscale
[core@api ~]$

Are the containers killed because of linger? podman-env uses an ssh connection to start/stop the containers, which most likely acts different when a process/container is started over the socket.

For as long as the container is Stopping it is not possible to do anything externally.

gbraad commented 1 year ago

Network must have been some weird glitch. Filing a new issue for the lingering, as it seems this is the case

crc-org / crc

[BUG] Microshift fails to start: `subjectAltNames must not contain localhost, 127.0.0.1` #3757