aztfmod / rover

The rover is a docker container in charge of the deployment of the Terraform platform engineering for Azure
MIT License
172 stars 142 forks source link

Intermittent fault login_as_launchpad #216

Closed JvHd-vw closed 2 years ago

JvHd-vw commented 2 years ago

I have encountered a intermittent fault when running the rover on GitHub actions with a service principal. The error I get in the workflow is :

Getting launchpad coordinates from subscription: ***
 - keyvault_name: null
ERROR: AKV10032: Invalid issuer. Expected one of,,,,, found***/.
Error on or near line 354: Not authorized to manage landingzones. User must be member of the security group to access the launchpad and deploy a landing zone; exiting with status 102

This error happens after the workflow already successfully deployed a level0 landingzone and is trying to deploy a higher level lz. After some debugging I found out that it has to do with the fact that our workflow uses a matrix and hence has separate jobs for each landingzone. If a followup job routes request to the same region as level0 e.g WESTUS2 the workflow succeeds but when routed to a different region e.g EASTUS2 the keyvault used in login_as_launchpad is not present in the output of the az keyvault list command.

I have created a Support Request (ID: 2110150050000485) on Azure to investigate/mitigate this issue. For now Azure Support suggested to replace az keyvault list with az graph query as the latter should not have the issue.

I created a pull request to implement this.

brk3 commented 2 years ago

If a followup job routes request to the same region as level0 e.g WESTUS2 the workflow succeeds but when routed to a different region e.g EASTUS2 the keyvault used in login_as_launchpad is not present in the output of the az keyvault list command.

Hi, just curious as to why this would happen. My understanding is az cli requests are not region specific?

JvHd-vw commented 2 years ago

The request is not, but depending on where the response comes from it dit or dit not have the keyvault in the response. I ran az keyvault list --debug and received a different 'Response content' depending on the x-ms-routing-request-id. So depending on the region in the request-id I received different result. Bug

brk3 commented 2 years ago

That sounds really odd behavior, I would expect that if this happened consistently a lot of people would be having problems. Are you behind some form of proxy that may be caching responses? Why does it only happen when using a service principal?

The PR looks good to me, just feel like there's more to this puzzle, perhaps maintainers from MS can share more insight.

JvHd-vw commented 2 years ago

That sounds really odd behavior, I would expect that if this happened consistently a lot of people would be having problems. Are you behind some form of proxy that may be caching responses? Why does it only happen when using a service principal?

The PR looks good to me, just feel like there's more to this puzzle, perhaps maintainers from MS can share more insight.

I experienced it when running on github-actions, but also locally.

arnaudlh commented 2 years ago

hi @brk3 @JvHd-vw working on a repro now, will keep you guys posted.