Open avm-team-linter[bot] opened 4 days ago
[!IMPORTANT] @Azure/avm-res-documentdb-databaseaccount-module-owners-bicep, the workflow for the
avm/res/document-db/database-account
module has failed. Please investigate the failed workflow run. If you are not able to do so, please inform the AVM core team to take over.
I'll take a look. This may be due to a paired region.
@bryansan-msft:
I was looking through https://azure.github.io/Azure-Verified-Modules and I couldn't find where the enforcedLocation
pattern was defined or specified. Does you have a link to that?
My idea is to update the multiRegion test so that it no longer uses a dependency file to "locate" a paired region. Instead, I want to use two enforcedLocation
parameters and hard-code regions that I know will work until I eventually can get better guidance from the product group about this.
CC: @AlexanderSehr
In my PR, I have all greens on the testing in my local environment. I made the following changes:
I just got word from the product group that we don't have a way to measure, via the REST API, capacity for an Azure region. I also heard back that zone-redunandancy in Azure Cosmos DB may be difficult to spin-up in our testing subscriptions based on capacity.
If, down the road, we have an endpoint to measure capacity/quota via the REST API, I will update the e2e tests to use the API to find a region with capacity.
I just got word from the product group that we don't have a way to measure, via the REST API, capacity for an Azure region. I also heard back that zone-redunandancy in Azure Cosmos DB may be difficult to spin-up in our testing subscriptions based on capacity.
If, down the road, we have an endpoint to measure capacity/quota via the REST API, I will update the e2e tests to use the API to find a region with capacity.
Hey @jtracey93, do we have any guidance regarding the WAF-reliablity pillar? I guess what we could do is to configure the WAF-reliable default in the module, but disable it via the test. However, if we do this has 2 drawbacks we'd need to be 'ok' with:
It's unfortunately also neither the first time we had an issue like this, nor will it be the last. It's just that other than in other cases, for CosmosDB it 'can' work - but be difficult (like @seesharprun mentioned above). So I'd expect it to fail most of the time.
I'll leave the PR and this item open while everyone discusses the pros-cons of enabling/disabling zone redundancy (zonal) for Azure Cosmos DB.
CC: @markjbrown from the product group for any additional context.
There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try.
For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.
Hey @jtracey93, do we have any guidance regarding the WAF-reliablity pillar? I guess what we could do is to configure the WAF-reliable default in the module, but disable it via the test. However, if we do this has 2 drawbacks we'd need to be 'ok' with:
- The Usage Examples would show the zone redundancy being disabled (also in the WAF Usage Example)
- We'd need to disable the zone redundancy testing for the resource in the PSRule configuration as it is what forces module owners today to configure the zone redundancy
It's unfortunately also neither the first time we had an issue like this, nor will it be the last. It's just that other than in other cases, for CosmosDB it 'can' work - but be difficult (like @seesharprun mentioned above). So I'd expect it to fail most of the time.
In the related PR, I quoted the [..], where possible and appropriate [...] line from that WAF-aligned standard. I would argue that the path of least resistance is to change the default for zone availability to false
and then set it to true
for the WAF-aligned template. With that default, you're less likely to see customers filing support tickets because they can't deploy this template due to a capacity issue.
But, I'm flexible and will pivot to whatever this group decides.
In WAF alignment for resource modules we are not saying multi-region has to be enforced by default, only zone redundancy.
Zone Redundancy is mandatory for resource modules by default as per the specifications and a non-negotiable.
Maybe in these use cases we need to add some logic to the CI that if in the group of deployment/e2e tests some fail with expected errors, like capacity issues etc., but one or 2 of the others do pass, we continue with publishing. This way the defaults are kept "correct" and aligned to specs and messaging of AVM value "secure and resilient out of the box" but also allows us to continue progressing with some test failing, make our CI more tollerant.
I'll update the associated PR to leave zone redundant as the default and wait for your team to work on logic to handle scenarios where zone redundant deployment of Azure Cosmos DB fails (as expected in our environments).
FWIW let me also add here relevant AVM docs references: Waf alignment FAQ and the spec Availability Zone
@seesharprun do you think the above covers the guidance? Please let us know so that the docs can be improved if needed.
I'd suggest not to wait for any CI change short term as these require discussion and proper testing if agreed upon. For the time being can we find a location where this is working?
Per @markjbrown
There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try.
For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.
I'll do two things:
zoneRedundant
property set to true
by defaultWe just have to accept that the pipeline will likely fail again in the future when the regions with or without restrictions change or capacity changes. We also have to accept that some internal downstream dependencies or external customers will have issues running the tests if they select the wrong region.
I'm okay with accepting that just to get the pipeline running again.
Per @markjbrown
There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try. For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.
I'll do two things:
- I'll hardcode the pipeline to use two of those four for the current multi-region test
- I'll leave the
zoneRedundant
property set totrue
by defaultWe just have to accept that the pipeline will likely fail again in the future when the regions with or without restrictions change or capacity changes. We also have to accept that some internal downstream dependencies or external customers will have issues running the tests if they select the wrong region.
I'm okay with accepting that just to get the pipeline running again.
Thank you all for the discussion. @seesharprun what you describe seems like a reasonably good solution in the short term. The CI change seems a bit wacky and I'm not convinced we'd be able to pull it of in the way it is imagined (as it comes with a lot of dependencies on external factors), but am happy to brainstorm it.
Mind you, if bad comes to worse we could also remove the multi-region test. The only mandatory ones are the defaults
& waf-aligned
and having those being fully WAF compatible (be it via defaults, configuration or explicit values). Every single other test's purpose is just to test a scenario the module owner wanted to either show, or use to validate that the module works in the intended way. This, in turn, also means we could very well disable zone redundancy in the multi-region main.test.bicep
file- if that helps (which I understand it would).
Failed run: https://github.com/Azure/bicep-registry-modules/actions/runs/11390119682