envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.58k stars 344 forks source link

Allow setting envoy proxy service spec.loadbalancerClass #1793

Closed wondersd closed 1 year ago

wondersd commented 1 year ago

Description:

For EKS clusters in AWS its recommended to use the aws-load-balancer-controller over the in-tree LoadBalancer provider.

However as of aws-load-blanacer-controller 2.5, it is no longer compatible with envoy-gateway.

The reason is an added mutating webhook that sets spec.loadBalancerClass: service.k8s.aws/nlb Once this has been set the envoy gateway controller can no longer converge the service object and is "stuck"

2023-08-16T17:18:54.170Z    ERROR   infrastructure  runner/runner.go:70 failed to create new infra  {"runner": "infrastructure", "error": "failed to create or update service ... : for Update: Service \"...\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set", "errorVerbose": "Service \"...\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set....

For now it seems like its possible to disable the mutating webhook in the aws-load-balancer-controller and rely on annotations to defer control to the aws-load-balancer-controller.

Though i think it would be a bit nicer to be able to configure this aspect of the generated Service object

Relevant Links:

arkodg commented 1 year ago
wondersd commented 1 year ago

Using latest EG, 0.5.0

a mutating webhook to mutate an object not owned by the AWS LB controller (EG owns the managed Service here), is something that doesnt seem right, feels like this needs to be disabled.

The mutation webhook is a mechanism to ease the transition off of the in-tree loadbalancer provider for EKS/AWS as a means to change the default at a per cluster level. I would very much like to keep it on, and would prefer to use spec.loadbalancerClass to choose LB implementations.

Think its also important to note that while EG owns the Service that api is an abstraction whose implementation is something dependent to the target cluster and its operators who would need to be able to configure those implementation details to suit the needs of the target cluster. EG cant presume any information about that implementation or the need to either configure or not configure it. Leaving its best course to simply pass-through such implementation configuration to the end user.

@qicz ( @wondersd can you share the EG version) can you help investigate why EG's infra layer is throwing an error ? it should just update the service again (in this case in an infinite loop with the mutating webhook controller), without throwing any error logs

The spec.loadbalancerClass is an immutable property of the service object. When EG goes to reconcile the service object after its initial creation it is deciding that field to be null (probably just the default value of the model object) which is not the mutated value. The control plane api is refusing its efforts to set it back to null. This is why its not an infinite loop and is instead "suck", meaning in this state EG loses all control over the object. Which is how I noticed something was wrong, changes to listeners were not applying.

arkodg commented 1 year ago

thanks for root causing the issue @wondersd .

We discussed issues similar to this in the community meeting a few weeks ago and the current consensus is

  1. EG owns and reconciles all the fields of a resources it creates
  2. EG will support a model where users can manage the resource and link it to EG https://github.com/envoyproxy/gateway/issues/1379

The 3rd possible model is something we dont want to support at this point, of EG owning some fields in the resource, and other actors can edit other fields. This makes the system fragile and hard to maintain - EG will have to keep continuous track of which fields to ignore when reconciling the resource.

wondersd commented 1 year ago

I very very much not like to ever use option 2. This in my opinion cuts all the magic out, and quite frankly the usefulness, of EG as a gateway api implementation.

Given my current example of seeing Gateway listeners not being propagated to envoy proxy Service and Pod resources as exposed container ports and backend mappings. With a "linked" fleet such a change to Gateway would have to now be two fold. First to update the fleet structure and then the Gateway resource, else one would have to support option 3 in that will be able to modify certain fields/structural elements of the linked fleet. Which Im inclined to agree is not something anyone should want to support.

I think option 1 is the best and not in conflict with the ask here. Would like to expand on what I can customize for the resources under management. And passing through such configurations especially when there is no possible conflict seems pretty straight forward. Conversely attributes that can be in conflict would require more consideration, e.g. if one can place user defined ports on the Service object that may be in conflict with EG, but perhaps thats not necessary bad like overriding appProtocol to match the expected traffic, or nodePort in environments where port allocations on node requires consideration.

This notion exists within EG already. One is allowed to swap the Service type from ClusterIP to NodePort to LoadBalancer. Whats missing is to then be allowed to modify the specific configurations for said implementations. That seems like an awkward middle ground. Its clear why a user would want to change the service type but not clear why they wouldn't then want to change the attributes of that chosen type. Same reasoning goes into being able to modify attributes of the envoy proxies pods, like node selectors or topologySpread constraints, or priorityClass. None of these are important to EG functionally, but are very important to have defined with respect to the target cluster. Do I have a cluster with certain failure domains that i want proxy pods to align too? Do I have a cluster where pods that are accessible from a public loadbalancer must be hosted on a certain node group or network segment? Is the envoy proxy a critical service that I want to take priority over other pods? These are clearly not important to the functionality of EG but are absolutely important to ultimately deploying to a cluster.

arkodg commented 1 year ago

this was discussed in the community meeting today, and no one opposed adding this field, so adding this issue to our v0.6.0 milestone

borederman commented 1 year ago

I encountered the similar issue when using the CLB product of Alibaba Cloud.

apiVersion: config.gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: envoy-gateway-system
spec:
  provider:
    kubernetes:
      envoyService:
        annotations:
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-master-zoneid: cn-hongkong-b
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-name: gw
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-vswitch-id: vsw-xxxx
        type: LoadBalancer
    type: Kubernetes

ERROR log

2023-08-25T02:56:53.192Z        ERROR    infrastructure  runner/runner.go:70     failed to create new infra      {"runner": "infrastructure", "error": "failed to create or update service envoy-gateway-system/envoy-envoy-gateway-system-xxxx: for Update: Service \"envoy-envoy-gateway-system-xxxx\" is invalid: metadata.resourceVersion: Invalid value: \"\": must be specified for an update", "errorVerbose": "Service \"envoy-envoy-gateway-system-xxxx\" is invalid: metadata.resourceVersion: Invalid value: \"\": must be specified for an update\nfor Update\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*InfraClient).CreateOrUpdate.func1\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/infra_client.go:43\nk8s.io/client-go/util/retry.OnError.func1\n\t/home/runner/go/pkg/mod/k8s.io/client-go@v0.27.4/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.27.4/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.27.4/pkg/util/wait/backoff.go:461\nk8s.io/client-go/util/retry.OnError\n\t/home/runner/go/pkg/mod/k8s.io/client-go@v0.27.4/util/retry/util.go:50\nk8s.io/client-go/util/retry.RetryOnConflict\n\t/home/runner/go/pkg/mod/k8s.io/client-go@v0.27.4/util/retry/util.go:104\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*InfraClient).CreateOrUpdate\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/infra_client.go:29\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*Infra).createOrUpdateService\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/infra_resource.go:96\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*Infra).createOrUpdate\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/infra.go:67\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*Infra).CreateOrUpdateProxyInfra\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/proxy_infra.go:27\ngithub.com/envoyproxy/gateway/internal/infrastructure/runner.(*Runner).subscribeToProxyInfraIR.func1\n\t/home/runner/work/gateway/gateway/internal/infrastructure/runner/runner.go:69\ngithub.com/envoyproxy/gateway/internal/message.HandleSubscription[...]\n\t/home/runner/work/gateway/gateway/internal/message/watchutil.go:36\ngithub.com/envoyproxy/gateway/internal/infrastructure/runner.(*Runner).subscribeToProxyInfraIR\n\t/home/runner/work/gateway/gateway/internal/infrastructure/runner/runner.go:58\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.20.6/x64/src/runtime/asm_amd64.s:1598\nfailed to create or update service envoy-gateway-system/envoy-envoy-gateway-system-xxxx\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*Infra).createOrUpdate\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/infra.go:68\ngithub.com/envoyproxy/gateway/internal/infrastructure/kubernetes.(*Infra).CreateOrUpdateProxyInfra\n\t/home/runner/work/gateway/gateway/internal/infrastructure/kubernetes/proxy_infra.go:27\ngithub.com/envoyproxy/gateway/internal/infrastructure/runner.(*Runner).subscribeToProxyInfraIR.func1\n\t/home/runner/work/gateway/gateway/internal/infrastructure/runner/runner.go:69\ngithub.com/envoyproxy/gateway/internal/message.HandleSubscription[...]\n\t/home/runner/work/gateway/gateway/internal/message/watchutil.go:36\ngithub.com/envoyproxy/gateway/internal/infrastructure/runner.(*Runner).subscribeToProxyInfraIR\n\t/home/runner/work/gateway/gateway/internal/infrastructure/runner/runner.go:58\nruntime.goexit\n\t/opt/hostedtoolcache/go/1.20.6/x64/src/runtime/asm_amd64.s:1598"}