garden-io / garden

Automation for Kubernetes development and testing. Spin up production-like environments for development, testing, and CI on demand. Use the same configuration and workflows at every step of the process. Speed up your builds and test runs via shared result caching
https://garden.io
Mozilla Public License 2.0
3.39k stars 275 forks source link

In-cluster building doesn't work on DigitalOcean (doks) #877

Closed edvald closed 4 years ago

edvald commented 5 years ago

Bug

Current Behavior

Currently, in-cluster building doesn't work with DigitalOcean doks clusters.

The garden-system all deploy as normal, but the cluster is unable to pull from the in-cluster registry. The proximal cause appears to be that hostPort pods (through DaemonSets) can't be reached (connection refused), even for other services. I've tried all manner of things and am stuck on fixing this.

Expected behavior

For in-cluster building to work, same as on GKE, AKS etc.

Reproducible example

Try configuring a doks cluster environment in the demo-project and deploying the project. It will eventually fail with ImagePullBackOff because the cluster is unable to reach the in-cluster registry.

Workaround

Use the default local-docker build mode when deploying to doks clusters.

Suggested solution(s)

We need to reach out to DO to figure out why hostPort services refuse connections.

Your environment

Latest master.

clems71 commented 5 years ago

Hey there!

Same thing on my side with a Kops provisioned cluster on AWS. Using default Kops settings. No fancy networking stuff. I'm basically getting the same error. Everything deployed properly, cluster-init ran fine as well. Only when I want to deploy or dev, it fails with:

Error deploying backend: ImagePullBackOff - Back-off pulling image "127.0.0.1:5000/demo-project/backend:v-5b72e91a7c"
clems71 commented 5 years ago

By investigating more and looking into the logs of the registry-proxy DaemonSet, I've found that requests have been filtered out. That's mostly due to the range parameter used on socat. Here is a sample log line I've got:

2019/06/28 08:42:51 socat[8] W refusing connection from AF=2 100.96.3.1:40244 due to range option

Initial command in registry-proxy:

socat -d TCP-LISTEN:5000,fork,range=10.0.0.0/8 TCP:garden-docker-registry.garden-system.svc.cluster.local:5000

I updated it for the moment to make it work by removing the range param (I'm sure there are some security implications I'm not aware of, but at least it makes things work):

socat -d TCP-LISTEN:5000,fork TCP:garden-docker-registry.garden-system.svc.cluster.local:5000

HTH, Cheers

valerauko commented 5 years ago

I face the same issue. Sadly removing the range option didn't resolve it (or maybe I removed it from the wrong spec)

eddiezane commented 5 years ago

:wave: Eddie from the DigitalOcean DevRel team here.

TLDR: This should be fixed with a new DOKS image shipping this week.

This is due to the currently used version of our CNI (Cilium) not supporting hostPort out of the box. A newer version adds a flag that makes enabling it easy. A new version of DOKS should be shipping this week that enables this.

I've been told you this can be used as a workaround for the time being https://github.com/snormore/cilium-portmap.

edvald commented 5 years ago

@clems71 Ah, we probably need to dynamically work out the correct address range. Key thing was to make sure we're not allowing outside traffic accidentally, as a side-effect of our little trickery to get the in-cluster registry going. I'll dig into this, see how we might best solve this across the board.

@eddiezane Thanks for the quick response! Once it's released, I expect I need to update my existing cluster(s)?

eddiezane commented 5 years ago

@edvald you should be able to upgrade your cluster to the latest patch version once it lands via the console or the automatic maintenance window. Fix should be baked into all images/minor versions.

edvald commented 5 years ago

@eddiezane That's awesome, thanks again for the fast response!

@clems71 your issue is something we need to fix on our side, we'll figure it out for our next patch release this week.

clems71 commented 5 years ago

Awesome thanks!

edvald commented 5 years ago

@clems71 I believe #930 solves your issue. Just tried it myself on a kops cluster and seems to do the trick. It'll be in v0.10.1, so you can get rid of the workaround then .)

clems71 commented 5 years ago

Ok duly noted! Will check ASAP. Thanks for the feedback.

On Sun, 7 Jul 2019 at 20:46, Jon Edvald notifications@github.com wrote:

@clems71 https://github.com/clems71 I believe #930 https://github.com/garden-io/garden/pull/930 solves your issue. Just tried it myself on a kops cluster and seems to do the trick. It'll be in v0.10.1, so you can get rid of the workaround then .)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/garden-io/garden/issues/877?email_source=notifications&email_token=AAQUSQ7QKD7HG3FN7BJLLPTP6I2ZZA5CNFSM4H274UH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLQ6OY#issuecomment-509022011, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQUSQ4CGZEOMDUBNIGEHITP6I2ZZANCNFSM4H274UHQ .

-- Clément JACOB

edvald commented 5 years ago

@eddiezane I just checked with the latest version (1.14.3-do.0) and still have the same issue. Do you have an issue filed that we could track?

eddiezane commented 5 years ago

@edvald ack. Just pinged the team again.

timoreimann commented 5 years ago

@edvald we recently created https://github.com/digitalocean/DOKS to allow DOKS users to create issues and generally get in touch with our team. Feel free to file a bug report so that we can keep you posted on any updates. (I know some of my colleagues are already looking into the issue.)

Thanks!

solomonope commented 4 years ago

Hello,

I tested this with a Digital Ocean K8s cluster version 1.16.2-do.0 . I was able to build and deploy.

edvald commented 4 years ago

Cool! Could you take a quick look at #995 as well @solomonope?