aws / aws-application-networking-k8s

A Kubernetes controller for Amazon VPC Lattice
https://www.gateway-api-controller.eks.aws.dev/
Apache License 2.0
175 stars 50 forks source link

No metrics to indicate reconciliation failures #601

Closed shulin-sq closed 6 months ago

shulin-sq commented 9 months ago

Hello,

I'm using the metrics from :8080 and can't seem to find a good one to indicate there are reconciliation failures

for example runtime_reconcile_errors is 0 despite there being reconciliation errors in my controller Screenshot 2024-02-14 at 1 21 13 PM

and I curled the port to confirm that this isn't an issue with the datadog agent that is pulling the metrics

☁  ~  curl -s localhost:8080/metrics | grep error
# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors
# TYPE certwatcher_read_certificate_errors_total counter
certwatcher_read_certificate_errors_total 0
# HELP controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller
# TYPE controller_runtime_reconcile_errors_total counter
controller_runtime_reconcile_errors_total{controller="accesslogpolicy"} 0
controller_runtime_reconcile_errors_total{controller="gateway"} 0
controller_runtime_reconcile_errors_total{controller="gatewayclass"} 0
controller_runtime_reconcile_errors_total{controller="grpcroute"} 0
controller_runtime_reconcile_errors_total{controller="httproute"} 0
controller_runtime_reconcile_errors_total{controller="iamauthpolicy"} 0
controller_runtime_reconcile_errors_total{controller="pod"} 0
controller_runtime_reconcile_errors_total{controller="service"} 0
controller_runtime_reconcile_errors_total{controller="serviceexport"} 0
controller_runtime_reconcile_errors_total{controller="serviceimport"} 0
controller_runtime_reconcile_errors_total{controller="targetgrouppolicy"} 0
controller_runtime_reconcile_errors_total{controller="vpcassociationpolicy"} 0
controller_runtime_reconcile_total{controller="accesslogpolicy",result="error"} 0
controller_runtime_reconcile_total{controller="gateway",result="error"} 0
controller_runtime_reconcile_total{controller="gatewayclass",result="error"} 0
controller_runtime_reconcile_total{controller="grpcroute",result="error"} 0
controller_runtime_reconcile_total{controller="httproute",result="error"} 0
controller_runtime_reconcile_total{controller="iamauthpolicy",result="error"} 0
controller_runtime_reconcile_total{controller="pod",result="error"} 0
controller_runtime_reconcile_total{controller="service",result="error"} 0
controller_runtime_reconcile_total{controller="serviceexport",result="error"} 0
controller_runtime_reconcile_total{controller="serviceimport",result="error"} 0
controller_runtime_reconcile_total{controller="targetgrouppolicy",result="error"} 0
controller_runtime_reconcile_total{controller="vpcassociationpolicy",result="error"} 0

a search through our logs show that servicemanager.upsert is failing at 5rps for services that were created before we added a custom domain name to the gateway (thus always fail to reconcile because the service needs to be recreated)

zijun726911 commented 7 months ago

Suspect in here we swallow the error and didn't return it to the controller-runtime:

    retryErr := NewRetryError()
    if errors.As(err, &retryErr) {
        return ctrl.Result{RequeueAfter: time.Second * 20}, nil
    }

https://github.com/aws/aws-application-networking-k8s/blob/ac2bdb52a7081ac1fc5b5a98646ad21d3d9a46ec/pkg/runtime/reconcile.go#L19

liwenwu-amazon commented 7 months ago

Here is how a reconcile loop looks like: eks-internal The current logic always bubble up LATTICE_RETRY whenever reconciler encounters any issues. And it will trigger controller to reconcile same event every 20 seconds. The current logic does not distinguish between waiting for job to be finished (e.g. create target group before programming the listener rules etc) vs incomplete configuration (e.g. the customer domain name issue you ran into).

liwenwu-amazon commented 7 months ago

I am thinking POC emitting metrics similar to aws load balancer that it will have metrics on all lattice API calls, and it also have labels to indicate the status code and error code from lattice API call.
In your particular example (e.g. the customer domain name ), it will have a metrics incrementing with that specific error code.

liwenwu-amazon commented 7 months ago

here is the POC output

curl -s localhost:8080/metrics | grep aws_api_requests_total
# HELP aws_api_requests_total Total number of HTTP requests that the SDK made
# TYPE aws_api_requests_total counter
aws_api_requests_total{error_code="",operation="CreateListener",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateRule",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateService",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateServiceNetworkServiceAssociation",service="VPC Lattice",status_code="200"} 1
aws_api_requests_total{error_code="",operation="CreateTargetGroup",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="GetResources",service="Resource Groups Tagging API",status_code="200"} 24
aws_api_requests_total{error_code="",operation="GetRule",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="",operation="GetTargetGroup",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="",operation="ListListeners",service="VPC Lattice",status_code="200"} 6
aws_api_requests_total{error_code="",operation="ListRules",service="VPC Lattice",status_code="200"} 6
aws_api_requests_total{error_code="",operation="ListServiceNetworkServiceAssociations",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListServiceNetworkVpcAssociations",service="VPC Lattice",status_code="200"} 1
aws_api_requests_total{error_code="",operation="ListServiceNetworks",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListServices",service="VPC Lattice",status_code="200"} 8
aws_api_requests_total{error_code="",operation="ListTagsForResource",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListTargetGroups",service="VPC Lattice",status_code="200"} 18
aws_api_requests_total{error_code="",operation="ListTargets",service="VPC Lattice",status_code="200"} 9
aws_api_requests_total{error_code="",operation="RegisterTargets",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="ConflictException",operation="CreateServiceNetworkServiceAssociation",service="VPC Lattice",status_code="409"} 1
aws_api_requests_total{error_code="ConflictException",operation="RegisterTargets",service="VPC Lattice",status_code="409"} 1
liwenwu-amazon commented 6 months ago

I am closing this issue with PR #628 . Please re-open it if you feel it has not addressed your concerns. Thanks