Readiness probe failure on cluster deployment

IBM / operator-for-redis-cluster

IBM Operator for Redis Cluster

https://ibm.github.io/operator-for-redis-cluster

MIT License

59 stars 35 forks source link

Readiness probe failure on cluster deployment #61

Open RamiAwar opened 2 years ago

RamiAwar commented 2 years ago

I'm getting a cluster liveness probe failure when trying to deploy the cluster. How can I go about debugging that?

Haven't made any changes to the charts / values, just trying to deploy it as is, that's why I think it's a bug.

The only thing I added to the deployment was a namespace other than default. I built the docker images and pushed them to my cluster as well, and the operator chart installed smoothly. The node however is failing:

28m         Normal    Created             pod/rediscluster-cluster-node-for-redis-ntf9p   Created container redis-node
28m         Normal    Started             pod/rediscluster-cluster-node-for-redis-ntf9p   Started container redis-node
3m52s       Warning   Unhealthy           pod/rediscluster-cluster-node-for-redis-ntf9p   Liveness probe failed: HTTP probe failed with statuscode: 503
27m         Warning   Unhealthy           pod/rediscluster-cluster-node-for-redis-ntf9p   Readiness probe failed: HTTP probe failed with statuscode: 503

cin commented 2 years ago

Interesting. I have never seen this happen. Can you share the node's logs and maybe the operator's logs? It seems like there's a communication issue or the HTTP service isn't up for some reason. Do you have network policies installed in your cluster that may be impacting things?

RamiAwar commented 2 years ago

Good news! After pairing up with a coworker on this, she pointed out that the error logs looked like an M1 build issue. (it was some exec error related to starting up redis server).

After building it on a linux server and pushing the updated image, that fixed things. We also had to use another tag due to the image pull policy (tagging everything as latest wasn't pulling the latest image).

Maybe worth documenting somewhere that building on M1 might need a platform build flag or something.

Feel free to close this whenever!

cin commented 2 years ago

Oh, wow. Glad you got it working! Just curious, how were you building and running things? I have an M1 mac that I work on and haven't run into this issue. But I'm also probably running in a different config than you. I use podman (bc we can't use Docker desktop anymore) and kind for local testing. I build the local images with make container PREFIX= TAG=cin and then load them into kind with kind load docker-image.

RamiAwar commented 2 years ago

Oh I see. Yeah I used docker desktop (we're still a small-ish company) and I directly tested on our hosted kubernetes cluster on GCP.

Commands I used:

docker build -t node-for-redis:latest -f Dockerfile.node .
docker tag node-for-redis gcr.io/ourproject/node-for-redis
docker push gcr.io/ourproject/node-for-redis

RamiAwar commented 2 years ago

I found that equivalent to the make container command. I also used make on the linux server with a prefix and tag and that worked fine too with docker push.

RamiAwar commented 2 years ago

Would it be helpful for me to test if building with an alternative platform ex. amd64 as part of the docker build step fix things?

ex. docker build --platform=linux/amd64 -t node-for-redis:amd64 -f Dockerfile.node .

cin commented 2 years ago

Oddly, I've never had to use that option. Will take a look into it but it wouldn't hurt to test out if you have time. I wouldn't think it'd be needed however bc the build environment is already setup in the Dockerfiles. We include the appropriate GOOS and GOARCH settings in the Dockerfile as well. There's definitely something I'm not understanding here.

RamiAwar commented 2 years ago

I think its a docker runtime thing, idk. So I tried it with the --platform=linux/amd64, and that worked fine building it on an M1 mac.

RamiAwar commented 2 years ago

This is where I got the solution from : https://stackoverflow.com/questions/66920645/exec-format-error-when-running-containers-build-with-apple-m1-chip-arm-based, didn't use buildx however just build

cin commented 2 years ago

I think we got a bit lucky in this situation as we build our published images through GitHub Actions. I wonder if it makes sense to add --platform=linux/amd64 to our Makefile commands though. Of course then you may have issues running the image locally, which I'm not sure why you'd do other than for testing image changes. I'll test on my M1 and see.