Closed cremor closed 1 year ago
Hello @cremor, we will take a look, however, we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.
@cremor , you can check event log by running powershell, for example
- task: PowerShell@2
condition: always()
inputs:
targetType: 'inline'
script: |
Get-EventLog -LogName System | fl
Get-EventLog -LogName Application | fl
we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.
I don't know how the agent images work, but can't you start the LocalDB instance before you create the image, so that it is already running when a new agent is started from that image? If it doesn't work that way, then maybe the LocalDB instance could at least be created in the image? Then the agent would only have to start it (this is assuming that creating the instance is the problem and not starting it - I haven't verified this).
Of course, those are just some ideas. If there is a way to "simply" fix the creation/startup performance then none of this is needed.
@cremor , did you have a chance to collect Windows Application Event Log ?
No, I've only seen that LocalDB creation error twice over a timespan of about a month. I'd have to add this to all my pipelines and hope that it happens again, which I don't really want to do.
ok, most probably it will be something like "the mentioned service was not started for .... seconds". but let's add and see.
from my side, I'll try running your steps sqllocaldb start MSSQLLocalDB
in ADO and see how much time it takes.
@criemen , I created sample workflow
trigger:
- main
pool:
vmImage: windows-latest
steps:
- task: PowerShell@2
inputs:
targetType: 'inline'
script: 'sqllocaldb start MSSQLLocalDB'
and ran it 50 times
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=279
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=280
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=281
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=282
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=283
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=284
60s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=285
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=286
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=287
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=288
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=289
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=290
17s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=291
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=292
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=293
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=294
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=295
39s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=296
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=297
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=298
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=299
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=300
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=301
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=302
50s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=303
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=304
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=305
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=306
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=307
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=308
35s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=309
40s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=310
11s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=311
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=312
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=313
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=314
44s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=315
33s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=316
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=317
31s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=318
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=319
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=320
18s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=321
36s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=322
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=323
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=324
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=325
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=326
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=327
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=328
all 50 times LocalDB started. I was not able to reproduce your case when service did not start. can it be a matter of system utilization ? do you start LocalDB in the very beginning ?
My actual pipeline calls the LocalDB start script after a .NET build step. But even in my test pipeline with that single LocalDB start step the step alone sometimes took over an minute.
I didn't catch extremely long (multiple minute) durations with the simple test pipeline, but that might just have been bad (good? 😁 ) luck. As I wrote in my initial report, the time is very inconsistent. Sometimes it's fast, sometimes it's slow.
The error has only happend twice for me as far as I know, so that is very rare. And I suspect that the error is just some timeout, because when it happend the step duration was around 5 minutes. So fixing the performance might also fix the error.
Btw, I'm in the Azure DevOps region "West Europe" in case the region affects performance.
I'd say that bottleneck is vm size. Standard runners are Standard DS2 v2 (2 vcpus, 7 GiB memory)
, it is somewhat possible to utilize 7Gb of RAM by heavy build.
I'd suggest to move LocalDB start to the very beginning. I do not beleive "West Europe" is a reason. Well, if you can collect, say 100 builds in West Europe and 100 builds in other region, we can compare numbers. But it is unlikely to be the root cause.
But if the same project (same build step) still results in LocalDB startup times that vary between 20s and 5m, then the build can't be the factor that slows the LocalDB startup time down, right?
Also, I've seen the 5 minute LocalDB startup time for very simple Web API projects (not much more than a single controller and a bit of EF Core infrastructure code).
But I'll try a few things and will get back to you.
I do not observer 5m. I suspect that you have some other component which slows things down. On clear build I observe startup times between 12s and 60s.
anyway, please provide detailed repro step how to observe 5min startup time, so we can investigate.
with repro steps provided, I see 12s-60s.
I just tested it again with this simple pipeline:
trigger:
- main
pool:
vmImage: windows-latest
steps:
- script: 'sqllocaldb create MSSQLLocalDB'
displayName: 'Create SQL Server LocalDB'
- script: 'sqllocaldb start MSSQLLocalDB'
displayName: 'Start SQL Server LocalDB'
I split the "create" and "start" steps here to maybe figure out what it exactly is that's slow. Turns out it can be both 😁 So maybe I better continue to use just a "start" step.
First try: (Missing display names for the steps. I added those later.)
Second try:
@cremor ,
I modified my workflow as
trigger:
- main
pool:
vmImage: windows-latest
steps:
- task: PowerShell@2
inputs:
targetType: 'inline'
script: 'sqllocaldb create MSSQLLocalDB'
- task: PowerShell@2
inputs:
targetType: 'inline'
script: 'sqllocaldb start MSSQLLocalDB'
I really wonder whether THAT difference is because of "script" <--> "task" semantic, but my numbers are ...
well, I cannot reproduce your numbers... thus I cannot make measurement on my side. I'd suggest adding WPT code around "sqllocaldb create MSSQLLocalDB"
something like
trigger:
- main
pool:
vmImage: windows-latest
steps:
- task: CmdLine@2
inputs:
script: |
choco install windows-performance-toolkit
"C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY
- task: CmdLine@2
inputs:
script: |
sqllocaldb create MSSQLLocalDB
- task: CmdLine@2
inputs:
script: |
sqllocaldb start MSSQLLocalDB
- task: CmdLine@2
inputs:
script: |
"c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
ArtifactName: 'etl'
publishLocation: 'Container'
this should collect WPT telemetry in tmp.etl and publish it as artifact.
I'll provide the WPT telemetry file once I manage to reproduce the slow startup in my test pipeline again. Right now the LocalDB startup always takes less than 1 minute. It's really annoying that this is so inconsistent (maybe dependend on outside factors?).
I'm not familiar with LocalDB, how Create SQL Server LocalDB
works ? Does it look at files around ? Maybe we have different filesets in our repos (I've no files in mine).
sqllocaldb
should be completely independent of any source files in the repo. See also https://learn.microsoft.com/en-us/sql/tools/sqllocaldb-utility
ok, I'll try to run 50 builds with createdb + telemetry
good good. I made small mistake in yaml, glad you were able to fix it.
interesting. defender is relatively short. what can be RuntimeBroker ? is it part of SQL ?
I see that SQL itself takes all the time (and no other process)
well, in theory we can invoke sqllocaldb create MSSQLLocalDB
during image preparation. what is MSSQLLocalDB
, is it database name ? we cannot "warmup" just any database, but we should name certain database ?
well, seems the winner is svchost
no idea what is it doing, but most probably "docker" is not related to LocalDB activity (and we can something with that after we figure out what it is)
in theory we can invoke
sqllocaldb create MSSQLLocalDB
during image preparation
That would be great. It seems like most of the time the "start" only takes a few seconds when "create" was executed before. Only rarely "start" alone takes long too. (If I catch it, I will provide separate trace files.)
what is
MSSQLLocalDB
, is it database name ? we cannot "warmup" just any database, but we should name certain database ?
MSSQLLocalDB
is the default DB name for the automatic instance. See
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#automatic-and-named-instances
and
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#connect-to-the-automatic-instance
@cremor , I made some tests, unfortunately "sqllocaldb create MSSQLLocalDB" will take a bit more effort from our side.
good news - it can be run many times, if I add it to image generation, you still can run it from your pipeline, no error will be emitted. bad news - it stores database in user profile. user used for image generation is deleted after image generation (profile is deleted as well). we need to move generated database to "C:\Users\Default" (which is used as template for new users)
also, few thoughts
1) "sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive (it is exposed via env variables as Agent Temp directory) 2) in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive. cortana is enabled in purpose, for real life e2e tests it is invaluable to run in very similar to real life environment. Unfortunately, we do not have good solution for customers who do not need those extra services. Of course, customer can disable it from pipeline, but it takes many efforts to find out what to disable.
- "sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive
Is this also true for other databases that are preinstalled on the images? Or do the data files of that other databases use the D: drive?
2. in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive.
If this (or some other background process) is indeed the problem, than maybe the correct solution would be to fix that resource usage instead of doing something special for LocalDB? But I'm wondering why the effect of those background processes is so inconsistent. Shouldn't they always behave the same since the same image is used to start the agent?
Just wanted to let you know... thank god we are not the only ones with this issue. It has been bugging us on and off for a few months. But recently (as in, last week and the week before that) it has gotten significantly worse. We weren't using the work around @cremor implemented so our unit tests keep failing now and then. So at least that is something we can implement on our end to make the builds stay green.
Yeah, Microsoft-hosted agent performance is all over the place recently. I have seen steps (not LocalDB, something completely different) that usually take less than 5 seconds and now sometimes take nearly 2 minutes 😞
@cremor , sorry for long delay.
while I'm still looking how to properly "warmup" things, it seems that I've found the reason of accidental high disk IO.
we haven't disabled the following scheduled task
we'll disable it, it's easy part. just curious, does your observation of slowness corellate with that schedule ?
I don't think so. Even though I don't know which time zone this would be, I've seen slow LocalDB startup at all times around a (work) day.
But maybe the trigger is executed as soon as the agent starts because it's last scheduled time was missed? If that is the case, then it could indeed be the cause for all the performance problems of the agent.
@cremor , I've figured out that deleting defender tasks does not help, it recreates them silently.
we are going to add Set-MpPreference -ScanScheduleDay 8
to images, it tells defender to delete scanning task.
new image rollout will take a week.
if you have spare time, can you verify that adding Set-MpPreference -ScanScheduleDay 8
to your pipeline increase stability (or not ?)
I've now ran a few more tests:
So I'd say I can't see an improvement with that command 😉
I've ran the command as early as possible (even before checkout). Here is my full test pipeline:
pool:
vmImage: 'windows-latest'
steps:
- task: PowerShell@2
inputs:
targetType: 'inline'
script: 'Set-MpPreference -ScanScheduleDay 8'
- checkout: self
- task: CmdLine@2
inputs:
script: |
choco install windows-performance-toolkit
"C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY
- script: 'sqllocaldb start MSSQLLocalDB'
displayName: 'Start SQL Server LocalDB'
- task: CmdLine@2
inputs:
script: |
"c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
ArtifactName: 'etl'
publishLocation: 'Container'
I'm back with some good news and some bad news.
bad news: LocalDB wants database to be placed in %USERPROFILE%, neither https://github.com/actions/runner-images/pull/8435, nor https://github.com/ilia-shipitsin/runner-images/commit/25d45e460761e799a9d1fca833660cd727a348f4 did not convince LocalDB to use moved database. it is created from scratch and placed to %USERPROFILE%
good news: during investigation we improved several things
taking into account that we cannot warmup LocalDB due to its nature (to use %USERPOFILE%), we hope that other improvements will help to reduce disk IO (especially disabling StorSvc)
Thanks for the information. How can I find out which image version contains those fixes? I'd like to test again when the new image version is available on hosted agents.
my measurements show that StorSvc was the most significant Disk IO consumer. it is deployed already (PR: https://github.com/actions/runner-images/pull/8388 )
the last change (related to powershell first run warmup) is not merged yet. but it is rather cosmetic. it warms up the first invocation. if you add dummy first step with "something in powershell", that step will warmup for you
Description
I’m running database integration tests in Azure DevOps Pipelines with the Microsoft-hosted Windows agent. The tests use MS SQL Server Express LocalDB. The time it takes to start the LocalDB instance is very inconsistent. Sometimes it takes just 20 seconds (which is still slow compared to my local dev machine), but sometimes it takes more than 5 minutes!
For that reason I can’t just use LocalDB in my tests and rely on automatic startup of the DB like I’m used to from my local dev machine because then the tests time out. As a workaround I now have an explicit step in my pipeline:
The works most of the time, but as mentioned just this single command can take multiple minutes.
Rarely the LocalDB startup even fails. It then prints the following to the log:
(I can't check the Windows Application event log for details because this is a Microsoft-hosted agent. Enabling system diagnostics for the build also doesn't provide any more useful logs.)
Please fix this so that the LocalDB startup is consistently fast. Maybe it would be possible to already start it when the image is created so that each new build run already gets a started LocalDB instance?
Platforms affected
Runner images affected
Image version and build link
Image: windows-2022 Version: 20230804.1.0
Is it regression?
unknown
Expected behavior
MS SQL Server Express LocalDB creation/startup time should be fast and consistent. Ideally I could rely on the auto-start feature and not have to manually start the LocalDB with a script step.
Actual behavior
MS SQL Server Express LocalDB creation/startup time varies between a few seconds and multiple minutes.
Repro steps