GoogleCloudPlatform / cloud-ops-sandbox

Cloud Operations Sandbox is an open source collection of tools that helps practitioners to learn O11y and R9y practices from Google and apply them using Cloud Operations suite of tools.
Apache License 2.0
406 stars 148 forks source link

Broken cartservice - busybox `nslookup` misreturns 1 instead of 0 #995

Closed palladius closed 1 year ago

palladius commented 1 year ago

urrently lastest version of https://cloud-ops-sandbox.dev/ is broken.

A fresh isntall fails on the cartservice.

I've done a long investigation with @alml in [1] shared with leoy@

A quick/cheap fix would be good enough.

[1] https://docs.google.com/document/d/1RTEKaDlP9PwoNfKvpAFxjbQYCZ0o5kA9Pj_V26kqu3Y/edit# [2]

palladius commented 1 year ago

From Leonid:

The investigation is still in progress. So far I can confirm that the cause of the problem is init container in the cartservice pod that fails. The workaround of the problem is to delete the cartservice deployment. Ensure that redis-cart deployment and service are in ready state. Delete the initContainers section from the cartservice.yaml (in the kubernetes-manifests/ folder and re-deploy the cartservice

palladius commented 1 year ago

the part which needs to be removed is

 initContainers:
      - command:
        - bin/sh
        - -c
        - until nslookup redis-cart; do echo waiting for redis; sleep 2; done;
        image: busybox
        imagePullPolicy: Always
        name: init-redis-ready
palladius commented 1 year ago

Alex and I noticed that this command returns correctly on main container but poorly on the init container:

Server:     10.28.0.10
Address:    10.28.0.10:53

Non-authoritative answer:
Name:   redis-cart.default.svc.cluster.local
Address: 10.28.2.181

** server can't find redis-cart.svc.cluster.local: NXDOMAIN

** server can't find redis-cart.cluster.local: NXDOMAIN

** server can't find redis-cart.cluster.local: NXDOMAIN

** server can't find redis-cart.svc.cluster.local: NXDOMAIN

** server can't find redis-cart.google.internal: NXDOMAIN

** server can't find redis-cart.google.internal: NXDOMAIN

** server can't find redis-cart.c.cloud-ops-sandbox-2646743255.internal: NXDOMAIN

** server can't find redis-cart.c.cloud-ops-sandbox-2646743255.internal: NXDOMAIN

/app # echo $?
0

It would incorrectly return 1 on the init (where the SHELL env was slightly different, maybe a differen versioj busybox? Leonid suggests it might be a bug in busybox and I agree.

palladius commented 1 year ago

I can confirm this change works:

  initContainers:
        - name: init-redis-ready-riccardo
          # There is a bug in busybox that prevents us from returning 0 when redis is available and multiple addresses are in /etc/resolv.conf :/
          image: busybox
          command: ['bin/sh', '-c', 'until nslookup redis-cart|grep Address: ; do echo Waiting for redis BUG in busybox; sleep 2; done;']
          #command: ['bin/sh', '-c', 'echo OK Ric04 just ok']
      containers:
palladius commented 1 year ago

I'll try now also the 1.28 version as per here: https://www.linkedin.com/pulse/busybox-nslookup-bug-gary-tay/

palladius commented 1 year ago

YES! The

    - name: init-redis-ready-riccardo128
      # There is a bug in busybox that prevents us from returning 0 when redis is available and multiple addresses are in /etc/resolv.conf
      image: busybox:1.28
      #command: ['bin/sh', '-c', 'until nslookup redis-cart|grep Address: ; do echo Waiting for redis BUG in busybox; sleep 2; done;']
      command: ['bin/sh', '-c', 'until nslookup redis-cart ; do echo Waiting for redis BUG in busybox; sleep 2; done;']

also works.