aws / aws-application-networking-k8s

A Kubernetes controller for Amazon VPC Lattice
https://www.gateway-api-controller.eks.aws.dev/
Apache License 2.0
175 stars 50 forks source link

In service_manager, if tags of an association cannot be fetched, skip instead of error #600

Closed shulin-sq closed 8 months ago

shulin-sq commented 9 months ago

What type of PR is this? bug

Which issue does this PR fix: In the scenario where

Account A can be in a situation where does not have permissions to read the tags of that association

Therefore this results in an error during route reconciliation.

error during service synthesis failed ServiceManager.Upsert xxx due to NotFoundException: Resource was not found
    status code: 404, request id: a400c824-64e3-4694-94ce-1b5698e1ee35

What does this PR do / Why do we need it:

if the get tags of a service network association request fails, instead of erroring, skip deleting the association and log.

If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:

see above scenario

Testing done on this change:

tested the fix in a staging environment (on top of most recent master) but open to suggestions on how test it via unit tests (looks like current tests don't touch on this scenario).

sample error message:

{"level":"warn","ts":"2024-02-26T01:47:25.568Z","logger":"controller.route","caller":"lattice/service_manager.go:232","msg":"skippin
g update associations  service: xxx, association:xxx, error: NotFoundException: Resource was not found\n\tstatus code: 404, request id: 086c
ba48-4421-4eda-bcae-c543acd81a85"}

Automation added to e2e:

Will this PR introduce any new dependencies?: no

Will this break upgrades or downgrades. Has updating a running cluster been tested?: no, yes

Does this PR introduce any user-facing change?:

* fix reconciling lattice services that have service network associations created in a different account (via RAM)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

shulin-sq commented 9 months ago

@solmonk interesting. have not run into the same issue with vpc association but seems like the same issue. I will update the PR.

shulin-sq commented 8 months ago

@mikhail-aws @zijun726911 it seems the presubmit hook is stuck, any advice on how to fix it?

zijun726911 commented 8 months ago

presubmit passed, will merge it

zijun726911 commented 8 months ago

@shulin-sq Do you need our side to do a new controller version release( new helm chart) to include this fix? Or you build your image by yourself in your private image repo?

shulin-sq commented 8 months ago

@zijun726911 We already created an internal build that includes this change. No rush on a release version.

shulin-sq commented 8 months ago

I had to redact some fields from a comment I left earlier. Here is the comment:

some notes:

I was unable to reproduce the VPC problem. I do have my VPC association set up in another account, so I'm unsure why I'm not running into the same issue. Specifically my scenario is: account A owns the sn, account A shares sn to account B, account B does the vpc association. account A does not have access to the vpc association. The controller has iam permissions for account A. I tried checking for the error code but unfortunately it looks like getTags returns 404 NotFoundException which is different from ResourceNotFoundException which is already a static string in the vpclattice go sdk. For now I kept the err check and added a todo message. Alternatives would be to check against the status code 404 (but that's not exposed as part of awserr.Error) or to check for NotFoundException but temporarily define the string somewhere else. Please let me know what your preference is. I was able to test the service association in my staging environment and saw the following error messages:

before while debugging (when inspecting the awserr.Error)

{"level":"error","ts":"2024-02-26T01:31:06.420Z","logger":"controller.route","caller":"lattice/service_manager.go:227","msg":"shulin _was_here NotFoundException, %!w(awserr.requestError=&{0xc01324f640 404 5cd0afde-496b-4343-abaa-e239df73eea7 []}), %!w(awserr.requ estError=&{0xc01324f640 404 5cd0afde-496b-4343-abaa-e239df73eea7 []})"} after

{"level":"warn","ts":"2024-02-26T01:47:25.568Z","logger":"controller.route","caller":"lattice/service_manager.go:232","msg":"skippin g update associations service: REDACTED, association: REDACTED, error: NotFoundException: Resource was not found\n\tstatus code: 404, request id: 086c ba48-4421-4eda-bcae-c543acd81a85"}