Open RamiAwar opened 2 years ago
Interesting. I have never seen this happen. Can you share the node's logs and maybe the operator's logs? It seems like there's a communication issue or the HTTP service isn't up for some reason. Do you have network policies installed in your cluster that may be impacting things?
Good news! After pairing up with a coworker on this, she pointed out that the error logs looked like an M1 build issue. (it was some exec error related to starting up redis server).
After building it on a linux server and pushing the updated image, that fixed things. We also had to use another tag due to the image pull policy (tagging everything as latest wasn't pulling the latest image).
Maybe worth documenting somewhere that building on M1 might need a platform build flag or something.
Feel free to close this whenever!
Oh, wow. Glad you got it working! Just curious, how were you building and running things? I have an M1 mac that I work on and haven't run into this issue. But I'm also probably running in a different config than you. I use podman
(bc we can't use Docker desktop anymore) and kind
for local testing. I build the local images with make container PREFIX= TAG=cin
and then load them into kind
with kind load docker-image
.
Oh I see. Yeah I used docker desktop (we're still a small-ish company) and I directly tested on our hosted kubernetes cluster on GCP.
Commands I used:
docker build -t node-for-redis:latest -f Dockerfile.node .
docker tag node-for-redis gcr.io/ourproject/node-for-redis
docker push gcr.io/ourproject/node-for-redis
I found that equivalent to the make container command. I also used make on the linux server with a prefix and tag and that worked fine too with docker push.
Would it be helpful for me to test if building with an alternative platform ex. amd64 as part of the docker build step fix things?
ex. docker build --platform=linux/amd64 -t node-for-redis:amd64 -f Dockerfile.node .
Oddly, I've never had to use that option. Will take a look into it but it wouldn't hurt to test out if you have time. I wouldn't think it'd be needed however bc the build environment is already setup in the Dockerfiles. We include the appropriate GOOS
and GOARCH
settings in the Dockerfile as well. There's definitely something I'm not understanding here.
I think its a docker runtime thing, idk. So I tried it with the --platform=linux/amd64, and that worked fine building it on an M1 mac.
This is where I got the solution from : https://stackoverflow.com/questions/66920645/exec-format-error-when-running-containers-build-with-apple-m1-chip-arm-based, didn't use buildx however just build
I think we got a bit lucky in this situation as we build our published images through GitHub Actions. I wonder if it makes sense to add --platform=linux/amd64
to our Makefile commands though. Of course then you may have issues running the image locally, which I'm not sure why you'd do other than for testing image changes. I'll test on my M1 and see.
I'm getting a cluster liveness probe failure when trying to deploy the cluster. How can I go about debugging that?
Haven't made any changes to the charts / values, just trying to deploy it as is, that's why I think it's a bug.
The only thing I added to the deployment was a namespace other than default. I built the docker images and pushed them to my cluster as well, and the operator chart installed smoothly. The node however is failing: