dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
666 stars 341 forks source link

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9925

Closed dotnet-eng-status-staging[bot] closed 2 years ago

dotnet-eng-status-staging[bot] commented 2 years ago

:broken_heart: Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99
andriipatsula commented 2 years ago

from https://dnceng.visualstudio.com/internal/_wiki/wikis/DNCEng%20Services%20Wiki/885/-Alerts-Autoscale-Minutes-to-scale-up-from-zero-machine-alert

Note: Only MattGal has access to this Kusto instance, so if you want to run this, you'll need to touch base with him or ask for permissions.

@MattGal, could you please assist me?

andriipatsula commented 2 years ago

we have one more alarm: https://github.com/dotnet/arcade/issues/9922 "Helix AutoScaler Service Stopped Running"

andriipatsula commented 2 years ago

image there was a spike with logging; I used query that was introduced in https://github.com/dotnet/arcade/issues/9856

riarenas commented 2 years ago

@ilyas1974 and @andriipatsula what was the status of this one? It seems like the alert is still active but I don't see any hand off notes to know what you tried and looked at.

andriipatsula commented 2 years ago

I thought, this was related to https://github.com/dotnet/arcade/issues/9856 .

riarenas commented 2 years ago

Were there some logs in the auto scaler that suggested excessive logging was causing machines to not come up? I'd appreciate any notes you made during your investigation so that we can keep looking at it.

andriipatsula commented 2 years ago

I wasn't able to invest much in autoscaler issues (this one and https://github.com/dotnet/arcade/issues/9922). As on Friday there were few more issues with apple devices. As I am new in this area, I asked @dkurepa to assist. Our short investigation brought us to testing of AzureDevOpsTestAggregation_v2 and I switched to other alarms.

riarenas commented 2 years ago

Our short investigation brought us to testing of AzureDevOpsTestAggregation_v2 and I switched to other alarms.

Was it some traces you found? was it just the fact that the issue was happening at the same time?

missymessa commented 2 years ago

@ilyas1974 started this conversation with us last week about #9922. Since these services are in different SF clusters, I'm confused how AzureDevOpsTestAggregation_v2 is affecting Autoscaler: https://teams.microsoft.com/l/message/19:23c287844dbd4f2c818607ab799024a9@thread.skype/1657298430656?tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47&groupId=147df318-61de-4f04-8f7b-ecd328c256bb&parentMessageId=1657298430656&teamName=.NET%20Eng%20Services&channelName=V-Team%20-%20Dev%20WF%20(Kitchen)&createdTime=1657298430656

riarenas commented 2 years ago

I see there are a lot of exceptions of this type in app insights:

Microsoft.ServiceFabric.Actors.ActorConcurrencyLockTimeoutException at ProcessAutoScaleService.ProcessAutoScaleService+<ProcessAutoScale>d__10.MoveNext

But it seems we get those from time to time so it's not easy for me to tell if this is the culprit.

I also see that there haven't been any "scaling up" traces since 7/7 at 5 PM, which pretty much matches when this alert would've fired.

riarenas commented 2 years ago

https://github.com/dotnet/arcade/issues/9922 happened around the same time. The alert is inactive now, but perhaps when it came back it came up messed up. After the current meeting I'll follow the instructions in https://dnceng.visualstudio.com/internal/_wiki/wikis/DNCEng%20Services%20Wiki/648/AutoScalerInvestigateIssues to troubleshoot

riarenas commented 2 years ago

The staging cluster shows that the customAutoScale application is totally in error:

image

I don't think this is at all related to any logging.

riarenas commented 2 years ago

I believe this was caused by this change: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/commit/1686b410450a6a8ed221100f5800a6dff0bf729d

That change was first deployed with this build: https://dev.azure.com/dnceng/internal/_build/results?buildId=1866653&view=results which matches when the alert was opened.

@ChadNedzlek I'm going to prepare a rollback for that change. I don't have any updates on how to figure out which changes broke the auto scaler, but I think unblocking the build should come first.

riarenas commented 2 years ago

Revert PR: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/24052

tkapin commented 2 years ago

@riarenas / @ChadNedzlek - the reverted change changes both how dependencies are consumed and some code (e.g., AzureTableReleaseJournalStore.cs). Do you think it'd be possible to get clarity about which exact part of the change caused the problem? Thanks

riarenas commented 2 years ago

It's not clear which part of the change caused the problem. I don't think we should have that understood before rolling back as per our policy of rolling back culprits when found.

ChadNedzlek commented 2 years ago

I wasn't able to find any errors. I asked @ulisesh for assistance tracking it down, because he's the expert on the auto scaler, but I didn't hear back. That change is important, so someone on FR will need to pick it back up.

riarenas commented 2 years ago

Now that we've confirmed we're not trying to rush a production rollout I'm going to see if I can debug this locally today. If I can't make any progress by EoD we can revert and then be more diligent with our testing when bringing it back.

I think adopting central package management was maybe too big of a task for FR so I'd rather avoid having to go through that again if we are able to figure out what caused the failure here.

ChadNedzlek commented 2 years ago

Without the central package thing, we'll need to add a reference to NewtonSoft.Json to basically every project in the entire repository, and that's a really painful (and hard to remove later) change. Since identifying dead package references is a pain... I'm not aware of any tooling that will help, so you have to remove a package, try building the solution...

riarenas commented 2 years ago

I think it's definitely a change worth making, I just question whether it fits the nature of FR issues, and instead we should've dedicated someone with more bandwidth to just focus on the work and all the pitfalls we've seen with the feature.

riarenas commented 2 years ago

Running the scaler locally showed no issues. Made basically no progress on finding out the issues today. The revert PR is also still failing due to the mishmash of the new project with the old versioning scheme bringing in bad versions of newtonsoft.json. We made quite a mess here.

ChadNedzlek commented 2 years ago

Yeah, any "new" references since the break will need to get versions slapped on them.

ChadNedzlek commented 2 years ago

Oooooor, we could revert more stuff. That's a bigger hammer, but it might be necessary.

riarenas commented 2 years ago

Yep my next attempt tomorrow if I can't slap the right versions to appease everything will be to revert more things.

MattGal commented 2 years ago

I think I am able to reliably reproduce this by making my local debugger try to run all scaling not just one queue.

Basically, an ObjectDisposedException is thrown on SslStream inside src\ServiceFabric\AutoScaleActorService\ServiceBusClient.cs "GetQueueAsync", as the inner exception of a TaskCancelledException, then ignored. This prevents the autoscale service from getting queue depth and rapidly things grind to a halt.

Inner exception's stack trace:

   at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
   at System.Net.Http.HttpConnection.FillAsync()
   at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)

From previous @ChadNedzlek work in helix-service, this requires an updated version of the messaging client, and in this case an update to a totally different library, to work around. We should proceed with the rollback efforts, then update away from Microsoft.Azure.Management.Compute at our earliest convenience.

riarenas commented 2 years ago

Thanks for your investigation @mattgal. I'll push through the rollback first thing tomorrow morning.

dotnet-eng-status-staging[bot] commented 2 years ago

:green_heart: Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

MattGal commented 2 years ago

(this going green was just me running the scaler in a debugger locally, it will be red again soon)

ChadNedzlek commented 2 years ago

Gonna ask for an RCA on this one. Three of us have been trying to track down the problem for days, and have been unsuccessful... that's not really acceptable.

dotnet-eng-status-staging[bot] commented 2 years ago

:broken_heart: Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

riarenas commented 2 years ago

The revert PR has been merged.

riarenas commented 2 years ago

The revert went through and I'm seeing traces of the auto scaler attempting to scale up and down. The validation build still failed due to timeouts in the jobs, and this alert is still active so there's something else going on.

##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.28.open: Job 17ba832a-867b-4804-b44c-6ed0b4a03c86 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.31.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.31.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job 3ef8e58a-4a96-4a60-a733-05f8e94a0bbf has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.open.arcade@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.open.arcade@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job e7788fdc-c53a-436c-a51a-e2dbf7dc8b80 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.31.open: Job c30375f0-bf18-4cf9-af42-57fbf3b2e8af has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.30.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.30.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job 1a6f90cb-2194-4afc-b129-bd619172e74f has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.open.arcade: Job d36f7013-0e82-463c-bccc-56c9156ee100 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.27.open: Job 34822ade-ef90-4715-8462-a1a998990849 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.27.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.27.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job b615fff9-0a8d-4b79-bc49-541c30f2ea6f has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.2004.amd64.open.svc: Job 61851d2d-e21b-41cc-9e27-957d00293e84 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.29.open: Job 09c7faef-6eaa-42dc-b806-6ea049f4c216 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.32.open: Job 6f573426-1a7e-49d5-9295-a12787677185 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.32.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.32.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job 0f8e5c45-1bf5-408c-883c-5b982902066e has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.28.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.28.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job d2843bcc-bfcb-456f-bc61-f3f335060be8 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.30.open: Job 17c0e4dc-cfbd-4991-8c6c-e548230ceaaa has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.29.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.android.29.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job b154b63a-b0dd-4b6b-97ed-5c540c347a9d has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : sles.15.amd64: Job 1720272c-e5e6-45ae-bb81-b67ec6cf957f has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.svc@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Validation job timeouted and an attempt to get queue info failed. For PR queues, it is expected that queue info won't be available -- the timeout means validation failed though and you should check your change.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.svc@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19: Job dd3cbde6-2e78-4e24-a2ee-0df9c5d46a0f has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.2204.amd64: Job de5b9f4e-a9a5-4607-bc9e-82c64391ec22 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : windows.10.amd64: Job 08ce2296-a135-40f4-9def-de0ce7b0f0dd has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : redhat.7.amd64: Job 64b6a127-ddc4-4dcf-a8d3-f40efa472129 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : centos.7.amd64: Job 783e1c1d-3118-43a6-9bab-2a879f971fe9 has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.1804.amd64.svc: Job 6febd060-173d-48b1-931c-f2a9337d965b has not finished in 45 minutes.
##[error]D:\a\_work\1\s\ValidateHelix.dll(,): error : ubuntu.2004.amd64: Job f6ccf1c5-ea35-4d0a-ba7a-4d97790bc260 has not finished in 45 minutes.
MattGal commented 2 years ago

I think I know what this is (leftover offline machines that the scaler "forgot" about ) : I'll scale these sets back to 0 and restart the validation.

dotnet-eng-status-staging[bot] commented 2 years ago

:green_heart: Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

MattGal commented 2 years ago

It was that issue (fallout from me running the scaler locally and leaving live, offlined machines); can probably close this?

riarenas commented 2 years ago

Sure. I'll make sure to fill out the rca.