Closed shulin-sq closed 6 months ago
Suspect in here we swallow the error and didn't return it to the controller-runtime:
retryErr := NewRetryError()
if errors.As(err, &retryErr) {
return ctrl.Result{RequeueAfter: time.Second * 20}, nil
}
Here is how a reconcile loop looks like: The current logic always bubble up LATTICE_RETRY whenever reconciler encounters any issues. And it will trigger controller to reconcile same event every 20 seconds. The current logic does not distinguish between waiting for job to be finished (e.g. create target group before programming the listener rules etc) vs incomplete configuration (e.g. the customer domain name issue you ran into).
I am thinking POC emitting metrics similar to aws load balancer that it will have metrics on all lattice API calls, and it also have labels to indicate the status code and error code from lattice API call.
In your particular example (e.g. the customer domain name ), it will have a metrics incrementing with that specific error code.
here is the POC output
curl -s localhost:8080/metrics | grep aws_api_requests_total
# HELP aws_api_requests_total Total number of HTTP requests that the SDK made
# TYPE aws_api_requests_total counter
aws_api_requests_total{error_code="",operation="CreateListener",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateRule",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateService",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="CreateServiceNetworkServiceAssociation",service="VPC Lattice",status_code="200"} 1
aws_api_requests_total{error_code="",operation="CreateTargetGroup",service="VPC Lattice",status_code="201"} 1
aws_api_requests_total{error_code="",operation="GetResources",service="Resource Groups Tagging API",status_code="200"} 24
aws_api_requests_total{error_code="",operation="GetRule",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="",operation="GetTargetGroup",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="",operation="ListListeners",service="VPC Lattice",status_code="200"} 6
aws_api_requests_total{error_code="",operation="ListRules",service="VPC Lattice",status_code="200"} 6
aws_api_requests_total{error_code="",operation="ListServiceNetworkServiceAssociations",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListServiceNetworkVpcAssociations",service="VPC Lattice",status_code="200"} 1
aws_api_requests_total{error_code="",operation="ListServiceNetworks",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListServices",service="VPC Lattice",status_code="200"} 8
aws_api_requests_total{error_code="",operation="ListTagsForResource",service="VPC Lattice",status_code="200"} 4
aws_api_requests_total{error_code="",operation="ListTargetGroups",service="VPC Lattice",status_code="200"} 18
aws_api_requests_total{error_code="",operation="ListTargets",service="VPC Lattice",status_code="200"} 9
aws_api_requests_total{error_code="",operation="RegisterTargets",service="VPC Lattice",status_code="200"} 5
aws_api_requests_total{error_code="ConflictException",operation="CreateServiceNetworkServiceAssociation",service="VPC Lattice",status_code="409"} 1
aws_api_requests_total{error_code="ConflictException",operation="RegisterTargets",service="VPC Lattice",status_code="409"} 1
I am closing this issue with PR #628 . Please re-open it if you feel it has not addressed your concerns. Thanks
Hello,
I'm using the metrics from :8080 and can't seem to find a good one to indicate there are reconciliation failures
for example runtime_reconcile_errors is 0 despite there being reconciliation errors in my controller
and I curled the port to confirm that this isn't an issue with the datadog agent that is pulling the metrics
a search through our logs show that servicemanager.upsert is failing at 5rps for services that were created before we added a custom domain name to the gateway (thus always fail to reconcile because the service needs to be recreated)