Azure / Azure-Proactive-Resiliency-Library-v2

Azure Proactive Resiliency Library v2 (APRL) - Source for Azure WAF reliability guidance and associated ARG queries
https://azure.github.io/Azure-Proactive-Resiliency-Library-v2/
MIT License
61 stars 70 forks source link

đź’ˇ Feature Request - Databricks/workspaces -- Zone Resiliency #246

Closed FallenHoot closed 2 months ago

FallenHoot commented 3 months ago

Describe the solution you'd like

Databricks doesn't support availability zone resiliency at this time. It is default deployed to a single AZ and the only resiliency for this service is Azure Region outage. This should be documented or noted in the - https://azure.github.io/Azure-Proactive-Resiliency-Library-v2/azure-resources/Databricks/workspaces/

Describe alternatives you've considered

Databricks doesn't support availability zone resiliency at this time and no alternatives as it is out of the user's control.

Additional context

End Users can’t control Azure Databricks control plane, and this controls the compute cluster creation. It is unknown if Azure defaults to “Auto-AZ”. Would need to ask the PG/PM how this is configured in the backend for us by default. The theory is that if an AZ fails, then the cluster will not be healthy. It will automatically (if Auto-AZ is default) fail to the next healthy AZ that has the compute selected. You are unable at this time to select what AZ will be used, making it assume that Auto-AZ is default.

oZakari commented 3 months ago

Hey @FallenHoot, I do agree that this is valid information for the customer, but the recommendations in APRL need to be actionable by the customer so this shouldn't be a dedicated recommendation.

We could probably add a disclaimer to this recommendation. Thoughts?

Probably would be beneficial to have this information added to the product documentation directly if you want to create a request there as well.

FallenHoot commented 3 months ago

Hi @oZakari, I think it is good to outline that it doesn't support zonal failover. I think a disclaimer on APRL and also on - https://learn.microsoft.com/en-us/azure/databricks/admin/disaster-recovery (Is this what you meant by product doc?). Would make it more transparent. I can create a PR for this as well.

oZakari commented 2 months ago

Agreed @FallenHoot, I am submitting a PR for APRL if you want to add one for the MS Learn link you pasted.