[Failed pipeline] avm.res.document-db.database-account

avm-team-linter[bot] commented 4 days ago

Failed run: https://github.com/Azure/bicep-registry-modules/actions/runs/11390119682

avm-team-linter[bot] commented 4 days ago

[!IMPORTANT] @Azure/avm-res-documentdb-databaseaccount-module-owners-bicep, the workflow for the avm/res/document-db/database-account module has failed. Please investigate the failed workflow run. If you are not able to do so, please inform the AVM core team to take over.

seesharprun commented 4 days ago

I'll take a look. This may be due to a paired region.

seesharprun commented 1 day ago

@bryansan-msft:

I was looking through https://azure.github.io/Azure-Verified-Modules and I couldn't find where the enforcedLocation pattern was defined or specified. Does you have a link to that?

My idea is to update the multiRegion test so that it no longer uses a dependency file to "locate" a paired region. Instead, I want to use two enforcedLocation parameters and hard-code regions that I know will work until I eventually can get better guidance from the product group about this.

CC: @AlexanderSehr

seesharprun commented 1 day ago

In my PR, I have all greens on the testing in my local environment. I made the following changes:

Set zone redundancy to false by default (instead of true)
Use East US 2 as the default region for most tests and West US 3 as the second region for the multi-region test

seesharprun commented 1 day ago

I just got word from the product group that we don't have a way to measure, via the REST API, capacity for an Azure region. I also heard back that zone-redunandancy in Azure Cosmos DB may be difficult to spin-up in our testing subscriptions based on capacity.

If, down the road, we have an endpoint to measure capacity/quota via the REST API, I will update the e2e tests to use the API to find a region with capacity.

avm-team-linter[bot] commented 16 hours ago

Failed run: https://github.com/Azure/bicep-registry-modules/actions/runs/11436472872

AlexanderSehr commented 13 hours ago

I just got word from the product group that we don't have a way to measure, via the REST API, capacity for an Azure region. I also heard back that zone-redunandancy in Azure Cosmos DB may be difficult to spin-up in our testing subscriptions based on capacity.

If, down the road, we have an endpoint to measure capacity/quota via the REST API, I will update the e2e tests to use the API to find a region with capacity.

Hey @jtracey93, do we have any guidance regarding the WAF-reliablity pillar? I guess what we could do is to configure the WAF-reliable default in the module, but disable it via the test. However, if we do this has 2 drawbacks we'd need to be 'ok' with:

The Usage Examples would show the zone redundancy being disabled (also in the WAF Usage Example)
We'd need to disable the zone redundancy testing for the resource in the PSRule configuration as it is what forces module owners today to configure the zone redundancy

It's unfortunately also neither the first time we had an issue like this, nor will it be the last. It's just that other than in other cases, for CosmosDB it 'can' work - but be difficult (like @seesharprun mentioned above). So I'd expect it to fail most of the time.

seesharprun commented 9 hours ago

I'll leave the PR and this item open while everyone discusses the pros-cons of enabling/disabling zone redundancy (zonal) for Azure Cosmos DB.

CC: @markjbrown from the product group for any additional context.

markjbrown commented 9 hours ago

There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try.

For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.

seesharprun commented 8 hours ago

Hey @jtracey93, do we have any guidance regarding the WAF-reliablity pillar? I guess what we could do is to configure the WAF-reliable default in the module, but disable it via the test. However, if we do this has 2 drawbacks we'd need to be 'ok' with:

The Usage Examples would show the zone redundancy being disabled (also in the WAF Usage Example)

We'd need to disable the zone redundancy testing for the resource in the PSRule configuration as it is what forces module owners today to configure the zone redundancy

It's unfortunately also neither the first time we had an issue like this, nor will it be the last. It's just that other than in other cases, for CosmosDB it 'can' work - but be difficult (like @seesharprun mentioned above). So I'd expect it to fail most of the time.

In the related PR, I quoted the [..], where possible and appropriate [...] line from that WAF-aligned standard. I would argue that the path of least resistance is to change the default for zone availability to false and then set it to true for the WAF-aligned template. With that default, you're less likely to see customers filing support tickets because they can't deploy this template due to a capacity issue.

But, I'm flexible and will pivot to whatever this group decides.

jtracey93 commented 5 hours ago

In WAF alignment for resource modules we are not saying multi-region has to be enforced by default, only zone redundancy.

Zone Redundancy is mandatory for resource modules by default as per the specifications and a non-negotiable.

Maybe in these use cases we need to add some logic to the CI that if in the group of deployment/e2e tests some fail with expected errors, like capacity issues etc., but one or 2 of the others do pass, we continue with publishing. This way the defaults are kept "correct" and aligned to specs and messaging of AVM value "secure and resilient out of the box" but also allows us to continue progressing with some test failing, make our CI more tollerant.

seesharprun commented 5 hours ago

I'll update the associated PR to leave zone redundant as the default and wait for your team to work on logic to handle scenarios where zone redundant deployment of Azure Cosmos DB fails (as expected in our environments).

eriqua commented 5 hours ago

FWIW let me also add here relevant AVM docs references: ⁠Waf alignment FAQ and the spec ⁠Availability Zone

@seesharprun do you think the above covers the guidance? Please let us know so that the docs can be improved if needed.

I'd suggest not to wait for any CI change short term as these require discussion and proper testing if agreed upon. For the time being can we find a location where this is working?

seesharprun commented 5 hours ago

Per @markjbrown

There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try.

For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.

I'll do two things:

I'll hardcode the pipeline to use two of those four for the current multi-region test
I'll leave the zoneRedundant property set to true by default

We just have to accept that the pipeline will likely fail again in the future when the regions with or without restrictions change or capacity changes. We also have to accept that some internal downstream dependencies or external customers will have issues running the tests if they select the wrong region.

I'm okay with accepting that just to get the pipeline running again.

AlexanderSehr commented 1 hour ago

Per @markjbrown

There are restrictions for zonal deployments for Cosmos for EA/PAYG. CXP Capacity PBI has the details. Error messages returned from ARM will direct customers to open a ticket so customers should be able to sort out for themselves. But given the restrictions, likely customers will not get happy path on first try. For Internal, non-prod subscriptions, these are heavily restricted for zonal Cosmos deployments. The only regions without restrictions at this time in public cloud are Australia East, UK South, Sweden Central, France Central. This is subject to change at any time.

I'll do two things:

I'll hardcode the pipeline to use two of those four for the current multi-region test

I'll leave the zoneRedundant property set to true by default

We just have to accept that the pipeline will likely fail again in the future when the regions with or without restrictions change or capacity changes. We also have to accept that some internal downstream dependencies or external customers will have issues running the tests if they select the wrong region.

I'm okay with accepting that just to get the pipeline running again.

Thank you all for the discussion. @seesharprun what you describe seems like a reasonably good solution in the short term. The CI change seems a bit wacky and I'm not convinced we'd be able to pull it of in the way it is imagined (as it comes with a lot of dependencies on external factors), but am happy to brainstorm it. Mind you, if bad comes to worse we could also remove the multi-region test. The only mandatory ones are the defaults & waf-aligned and having those being fully WAF compatible (be it via defaults, configuration or explicit values). Every single other test's purpose is just to test a scenario the module owner wanted to either show, or use to validate that the module works in the intended way. This, in turn, also means we could very well disable zone redundancy in the multi-region main.test.bicep file- if that helps (which I understand it would).

Azure / bicep-registry-modules

[Failed pipeline] avm.res.document-db.database-account #3588