Spike - IaaS-based Compute

heoelri commented 2 years ago

This documents replacing AKS with IaaS as the compute platform used for Azure Mission-Critical. Some specific scenarios might require the use of IaaS VMs instead of PaaS services. Potential reasons are:

Lack of knowledge and skills
Legacy workloads that require OS-level access or specific drivers and configurations
Performance requirements that cannot fullfilled in containers or PaaS services
Lack of support for 3rd-party workloads

Changes required compared to Mission-Critical-Online:

Removed AKS and replaced with VMSS
- Requires a replacement for ingress e.g. AppGW (or FD?) - AppGw might make sense here - potentially with a PLS in front to expose it via AFD Premium
- Requires different rollout process for the workload
- Two VMSS one for Frontend (exposed via AppGw) one for Backend - not exposed hosting the backend processing
Removed ACR
Added shared image gallery (as global service for now) to store images

Scenarios to address:

Scalable / stateless workloads -> Virtual Machine Scale Sets
Static / stateful workloads -> Virtual Machines in an AV-Set

Open questions / findings:

boot diag storage for vmss does not support zrs
shared image gallery as a global service or per stamp?
can stateful workloads hosted in vmss in a meaningful way
what's the recommended (and most reliable) way to rollout software to (windows) vms?
where to store application/workload components? (pendant for acr in a more cloud-native scenario) storage accounts?
how to deal with dependencies like ADDS, WSFC, ..
database backends (on VMs) in or out of scope?

Recommendations:

Security
- Disable username / password authentication when using Linux
- Store VMSS credentials in Azure KeyVault
Compute
- Same Zone considerations apply; spread across zones if possible OR consolidate in less than 3 zones if proximity is required and/or latency is a concern

heoelri commented 2 years ago

Using VMs instead of containers or PaaS services like AppSvc (with or without Containers) requires us to develop and implement a new application build, packaging and installation process to bring our application code (i'd stick to the sample catalogservice application we already have for now) onto the frontend and backend virtual machines.

Downloading the source and building the application on demand when starting a VM(SS) instance is from my POV not a viable option as this would take to long, is potentially error prone and could lead to varying results.

My idea is to replace the existing container build/push (to ACR) task with a VM specific one. This process could build (dotnet publish) the application code, for example self-contained and singlefile for a certain architecture, for example linux, package it into a tar.gz file (for linux) and push it to a storage account. This SA would act as a (private) repository.

- task: AzureCLI@2
  displayName: 'Build and package ${{ parameters.componentName }}'
  retryCountOnTaskFailure: 1
  inputs:
    workingDirectory: ${{ parameters.workingDirectory }}
    azureSubscription: $(azureServiceConnection)
    scriptType: pscore
    scriptLocation: inlineScript
    inlineScript: |

      dotnet publish ${{ parameters.componentName }} `
        -r ${{ parameters.targetPlatform }} `
        -p:PublishSingleFile=true `
        --self-contained:true `
        -o output

      tar -czf  ${{ parameters.componentName }}-$(Build.BuildId)-${{ parameters.targetPlatform }}.tar.gz output

      az storage blob upload -f ${{ parameters.componentName }}-${{ parameters.targetPlatform }}.tar.gz `
          --container-name applications `
          --name ${{ parameters.componentName }}-$(Build.BuildId)-${{ parameters.targetPlatform }}.tar.gz `
          --account-name $(global_storage_account_name) `
          --auth-mode login

This builds (dotnet publish) the appliocation code, archieves it into a *.tar.gz file and uploads it to a global storage account.

From there we can pull it, in a specific version into the VM(SS) instances for example via a custom script extension or via cloud-init.

CC: @sebader; @msimecek for feedback.

sebader commented 2 years ago

This looks all pretty good to me already! One thing I would like to throw in: Using VMSS for horizontal scaling is almost(...) going towards a more cloud-native approach. Which is great when you can use it. But IaaS workloads in my experience often contain some workload that does not work with such an approach with dynamic scale out etc. Often enough customers need to use VMs because the workload they need to run needs some actual installation process on one or more VMs which cannot be scaled in or out dynamically. So how about the following: We use the VMSS-based approach for either one, frontend or backend. For the other we use VMs (maybe in an Availability Set?) and try to mimic some kind of installation process during deployment. This way we can show both approaches. Thoughts?

heoelri commented 2 years ago

This looks all pretty good to me already! One thing I would like to throw in: Using VMSS for horizontal scaling is almost(...) going towards a more cloud-native approach. Which is great when you can use it. But IaaS workloads in my experience often contain some workload that does not work with such an approach with dynamic scale out etc. Often enough customers need to use VMs because the workload they need to run needs some actual installation process on one or more VMs which cannot be scaled in or out dynamically. So how about the following: We use the VMSS-based approach for either one, frontend or backend. For the other we use VMs (maybe in an Availability Set?) and try to mimic some kind of installation process during deployment. This way we can show both approaches. Thoughts?

Yes, I think that's a good idea. And I agree that customers who have to stick to VMs due to legacy workloads probably struggle to use VMSS. Using VMSS for FE and VMs for BE (or vice versa) would allow us to address both scenarios.

Azure / Mission-Critical-Online

Spike - IaaS-based Compute #593