kapp-controller fails to come up on a less powerful k3s cluster

aaronshurley commented 2 years ago

What steps did you take: Extracting this specific issue from a previous comment in a different issue from @alvarosanchez so that we can better understand this:

We are facing this same issue when installing kapp-controller (v0.29) in a K3s cluster we use for tests. In our local machines, approximately 20% of the times it fails to come up. In CI (a less powerful environment) it fails 100% of the times.

However, I'm not convinced that it is a matter of bumping the timeout, since when it works, it only takes 5-10 seconds.

Additional context from the requester:

I have been testing it more times, and I've seen the APIService v1alpha1.data.packaging.carvel.dev take 2-3 minutes to become available.

Could this timeout be parameterised so that we can customise it in resource-constrained environments?

What happened: kapp-controller fails some or all of the time

What did you expect: kapp-controller to succeed

Anything else you would like to add: [Additional information that will assist in solving the issue.]

Environment: [Additional information that will assist in solving the issue.]

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

aaronshurley commented 2 years ago

@alvarosanchez Thanks for bringing this up! We're hoping to get a couple more pieces of information from you:

From this comment:

Could you verify if the apiPort used by kapp-controller is in the allowed list? In our use case the issue was a combination of etcd latency(intermittent failures) and kapp-controller being scheduled on a node that did not have the apiPort in the allowed list.

Alternatively you can use a high port in the node port range to verify if this is a networking issue or a resource issue.

And from a separate private slack thread:

can you provide full logs from kapp controller pod (kapp logs -a kc --lines=10000000) and output of a deploy (kapp deploy output)

With this information, we can try to reproduce the issue.

joe-kimmel-vmw commented 2 years ago

@alvarosanchez can you provide the information requested above?

danielhelfand commented 2 years ago

I was unable to run into this issue using k3d. kapp-controller installed without any friction. I was able to run it/delete it about 10 times and never hit this problem.

joe-kimmel-vmw commented 2 years ago

@alvarosanchez - any interest in following up on this issue with more information?

benmoss commented 2 years ago

I tried this out with a 2 CPU / 4g memory minikube cluster and things were fine. I think we'll need more information to reproduce since I don't think resource constraints are the only issue.

alvarosanchez commented 2 years ago

@joe-kimmel-vmw sorry, I had notifications disabled and missed your comments.

As of now, I no longer work for VMware, however, @beltran-rubo should be able to handle this.

Again, apologies for missing your comments. Álvaro.

beltran-rubo commented 2 years ago

Hi,

We found networking issues in our CI containers, so that is probably the root of the issue. It was introducing connectivity issues randomly depending on the interface. If we find similar issues in the future I will reopen this ticket.

aaronshurley commented 2 years ago

That sounds good. Thanks for following up @beltran-rubo.

carvel-dev / kapp-controller

kapp-controller fails to come up on a less powerful k3s cluster #444