dotnet / dnceng

.NET Engineering Services
MIT License
24 stars 19 forks source link

Ubuntu.2004.ArmArch exists in different regions between HelixImages and HelixPRImages #4156

Open chcosta opened 1 month ago

chcosta commented 1 month ago

ubuntu.2004.armarch image is in westus2 in the 'HelixImages' Azure Compute Gallery, but in westus in 'HelixPRImages'. We likely got into this state because the compute hash, as it currently exists, skips a lot of deployment during staging because it only computes such a narrow set of definition values. ubuntu.2004.armarch needs to be in westus2 for both galleries. Currently, if you accidentally deploy ubuntu.2004.armarch during a staging ci job (by changing one of the deployment values defined in definitions/shared/linux.yaml which it uses for the hash), you'll encounter an error like this:

                     ##[error]D:\a\_work\1\s\DeployQueues.dll(,): error : Failed to delete existing VM in pr-ubuntu.2004.armarch.open-dev-chcosta-upgradepol-a-scaleset: "The gallery image /subscriptions/84a65c9a-787d-45da-b10a-3a1cefce8060/resourceGroups/HelixPRImages/providers/Microsoft.Compute/galleries/HelixPRImages/images/ubuntu.2004.armarch/versions/2024.0917.232437 is not available in westus2 region. Please contact image owner to replicate to this region, or change your requested region."
                     Status: 404
                     ErrorCode: GalleryImageNotFound

Release Note Category

dougbu commented 1 month ago

the Region: westus2 property in the ubuntu.2004.armarch definition YAML should control the deployment region regardless of the environment (PR, staging, prod). where is that being overridden for deployments from PR builds❓ that is, how does this image get created in westus at all❓

separately I agree including the region in the hash might be useful. I'm not sure that would actually move the image between regions as you expect however. is this 🤞

dougbu commented 1 month ago

is this definitely an Ops issue @chcosta and @ilyas1974:question: just wondering if it needs triage

ilyas1974 commented 4 weeks ago

I think we have two issues here. The first is to correct the issue where we have images in different regions (ops), the second is the prevention\mitigation of how this happened and how to prevent it from happening again. I think that separate issue is something that can be discussed in triage.

dougbu commented 4 weeks ago

broke this into #4324 and #4325. marked second as Needs triage

chcosta commented 2 weeks ago

Images appear to be created in the correct region and consistent. Image definitions are in different regions.

Here are the queues where image definitions are in different regions...

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

BuildPool | HelixPRImages definition location | HelixImages definition location | helix-machines source code value (under definitions folder) -- | -- | -- | -- Build.Windows.10.Amd64.ES.VS2017.Open | westus | westus2 | not present Build.Windows.Amd64.VS2019.Pre.ES.Open | westus | westus2 | not present ubuntu.1804.armarch | westus | westus2 | westus2 ubuntu.2004.armarch | westus | westus2 | westus2 windows.11.arm64 | westus | westus2 | westus2 Windows.Server.Amd64.VS2017 | westus | westus2 | westus windows.vs2017.amd64.es.open | westus | westus2 | not present windows.vs2022.amd64 | westus | westus2 | westus windows.vs2022preview.amd64.open | westus | westus2 | westus

chcosta commented 2 weeks ago

I'm now unable to repro this failure locally, in dotnet-helix-machines-pr, or in dotnet-helix-machines-ci. Closing this issue until it surfaces or we figure out how to get a repro (I don't know what I did differently the first time to encounter this failure).

helix-machines-ci - https://dev.azure.com/dnceng/internal/_build/results?buildId=2572700&view=results, failure in this run is related to resource issues from manually running the pipeline, it's not the failure i was hoping go see.

helix-machines-pr - https://dev.azure.com/dnceng/internal/_build/results?buildId=2572102&view=results

dougbu commented 2 weeks ago

Images appear to be created in the correct region and consistent. Image definitions are in different regions.

Here are the queues where image definitions are in different regions...

BuildPool HelixPRImages definition location HelixImages definition location definitions value Build.Windows.10.Amd64.ES.VS2017.Open westus westus2 not present Build.Windows.Amd64.VS2019.Pre.ES.Open westus westus2 not present ubuntu.1804.armarch westus westus2 westus2 ubuntu.2004.armarch westus westus2 westus2 windows.11.arm64 westus westus2 westus2 Windows.Server.Amd64.VS2017 westus westus2 westus windows.vs2017.amd64.es.open westus westus2 not present windows.vs2022.amd64 westus westus2 westus windows.vs2022preview.amd64.open westus westus2 westus

I'm not quite sure what this table means. could you clarify the column titles @chcosta❓

in case it matters, CreateCustomImages gets --region westus in both -pr and -ci builds. looks hard-coded but is in fact overridden for all ARM64 architecture images to use westus2. but this controls only the initial image.

DeployQueues and Deploy1ESHostedPools decide where images are copied for use in our scale sets and build pools. that should always match the Region specified in the definitions/ YAML files; there's no hidden override. a few definitions (all ARM64) explicitly override the Region: westus default.

separately Region controls the name of some resource groups but not the actual RG. I haven't looked closely enough to determine exactly what the Region property in the defintions/ YAML explicitly controls. nor do I understand why locations would be chosen differently between PR, staging, and production builds. one possibility for those differences may be timing — PR resources are created from scratch, always using the latest code but that's not the case for most hosted pools, scale sets, and their linked staging or production resources.

lastly, I suspect this remains a problem given problems in !44466. should we reopen this issue❓

chcosta commented 2 weeks ago

If we have a repro, then yes, we should reopen this

dougbu commented 2 weeks ago

given !44466 passed on retry, I'm beginning to think the issue is related to Azure quota. might be worth looking for earlier warnings in builds failing w/ this symptom.