dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.1k stars 9.91k forks source link

Application created from Worker Service template violates CPU usage limit and consume all available swap space #48780

Open vsfeedback opened 1 year ago

vsfeedback commented 1 year ago

This issue has been moved from a ticket on Developer Community.


Using Visual Studio Professional 2022 for Mac Version 17.5.1 I created a blank Worker Service using the project template of that name. I then built the project without modification, and ran it as a launchd global daemon. After letting it run for less than 24 hours, macOS killed the process with the following report:

23:19:21.591511 process Worker-04-10-202[54918] thread 16118204 caught burning CPU! It used more than 50% CPU over 180 seconds 23:19:21.593692 com.apple.symptomsd Received CPU usage trigger: Worker-04-10-2023[54918] () used 90.00s of CPU over 169.64 seconds (averaging 53%), violating a CPU usage limit of 90.00s over 180 seconds. 23:19:21.596745 com.apple.symptomsd RESOURCE_NOTIFY trigger for Worker-04-10-2023 [54918] (90000066750 nanoseconds of CPU usage over 169.00s seconds, violating limit of 90000000000 nanoseconds of CPU usage over 180.00s seconds) 23:19:22.032062 Saved cpu_resource.diag report for Worker-04-10-2023 version ??? to Worker-04-10-2023_2023-04-10-231921_Marcels-MacBook-Pro-2.cpu_resource.diag 05:36:33.248530 low swap: killing largest compressed process with pid 54918 (Worker-04-10-202) and size 39587 MB

This same behavior has been plagueing our full-scale application for over six months. The only work-around has been to set the daemon to auto-restart. I have tried Release builds vs. Debug builds, and X64 builds vs. arm64 builds, and tried running on X64 hardware and M1 hardware, all with the same result, the only difference being the amount of time before the application β€œblows up” and is killed by the OS.


Original Comments

Feedback Bot on 4/13/2023, 07:43 PM:

(private comment, text removed)


Original Solutions

(no solutions)

amcasey commented 1 year ago

This seems very strange since the default worker service doesn't do anything besides print a time-stamped log message once per second. There's a tiny allocation for each one, but it seems unlikely that the allocations would be allowed to build up for so long that it would take 90 seconds of CPU time to GC them.

amcasey commented 1 year ago

I tried this over the weekend with dotnet 7.0.302 on macOS 12.6.4 (Intel). After an initial working set increase (over the first minute or so), things stabilized and stayed flat for the remainder of the run. Same on Windows.

Disclaimer: I didn't use a daemon in either repro attempt and it's possible that that matters.

From the internal ticket, the original OS was probably 13.2, which might matter. I should be able to test with launchd, but getting a mac with an arbitrary OS is trickier.

amcasey commented 1 year ago

Since it took me a few tries to cobble it together, here's my /Library/LaunchDaemons/gh48548.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>Label</key>
        <string>gh48548</string>
        <key>Program</key>
        <string>/Users/asplab/repros/gh48548/bin/Debug/net7.0/gh48548</string>
        <key>RunAtLoad</key>
        <false/>
        <key>EnvironmentVariables</key>
        <dict>
            <key>DOTNET_ROOT</key>
            <string>/Users/asplab/dotnet/</string>
            <key>DOTNET_DbgEnableMiniDump</key>
            <string>1</string>
            <key>DOTNET_EnableCrashReport</key>
            <string>1</string>
        </dict>
        <key>StandardOutPath</key>
        <string>/tmp/gh48548.stdout</string>
        <key>StandardErrorPath</key>
        <string>/tmp/gh48548.stderr</string>
    </dict>
</plist>
parnasm commented 1 year ago

Hey thanks for looking at this. I want to say it is necessary to be running as a global daemon in order to observe the problem. Also, it is unpredictable as to how long to let it run before it occurs. Sometimes it is within a day, other times it can run up to a week. I have not tried running with your DOTNET_ flags, perhaps I will do that, but my impression is when this occurs we get no warning or opportunity to take crash dumps.

amcasey commented 1 year ago

@parnasm Thanks for the update! I tried again with a global daemon, but it was still using a reasonable amount of memory and CPU a day later. You said you've seen this on both ARM and x64?

Also, I notice that the original error message mentioned a .cpu_resource.diag file. Did you have a chance to review that?

parnasm commented 1 year ago

Worker-04-20-2023_2023-06-02-111037_Marcels-MacBook-Pro-2.cpu_resource.diag.txt

@amcasey I no longer have the one from that run in April, but I have the most recent one which is similar. The thing that I believe is constant in these reports is the presence of an FSEvents callback. And yes, for the most part, the sample will run fine. Then something in the OS seems to trigger. And arm64 or x64 makes no appreciable difference.

Notice in the attached case, no action was taken. It appears sometimes the app is able to weather the storm!

amcasey commented 1 year ago

Yeah, sorry about that delay - your report got misrouted. 😞

Do I understand correctly that the attached file does not show a failure?

parnasm commented 1 year ago

It shows a sustained period of above average cpu usage, but it was not enough to trigger killing the app. In cases where the app has been killed, it is usually (always?) due to low swap space.

amcasey commented 1 year ago

It's just so unclear what it could be doing with CPU cycles, given that all it does is log once per second...

parnasm commented 1 year ago

My original post on MS Developer Community included more detail than is shown here, including the .plist and the project folder which I no longer see attached to that post or this.

Completely agree, it makes no sense that this app itself is doing anything with CPU cycles. My intuition is it is something deep in the dotnet runtime that is watching for file system changes (for hot-swaps perhaps?) and somehow runs into an infinite loop or self-generated feedback loop.

amcasey commented 1 year ago

It looks like maybe it was trying to start a thread? So (even more tenuously) maybe the thread pool is getting exhausted? Somehow?

amcasey commented 1 year ago

My original post on MS Developer Community included more detail than is shown here, including the .plist and the project folder which I no longer see attached to that post or this.

That's frustrating. I did find the zip file of your project and some console output, but not the plist, which would have been very helpful. I think attachments expire for privacy (or other) reasons.

Completely agree, it makes no sense that this app itself is doing anything with CPU cycles. My intuition is it is something deep in the dotnet runtime that is watching for file system changes (for hot-swaps perhaps?) and somehow runs into an infinite loop or self-generated feedback loop.

That's going to be extremely difficult to debug without a trace or something. I'm sure we could rig up a way to collect such a trace, but it does tend to change the timing of the running program, which seems likely to be important here.

Did you say you'd tried this with a global agent (i.e. running as you, rather than root) and it hadn't repro'd?

parnasm commented 1 year ago

I don't see where it is trying to start a thread, the main thread has already started, and is processing a file system event, in the course of which it is doing all kinds of system level calls into libsystem_kernel.dylib, etc.

In the past I have been able to attach a debugger to the daemon process just before it was killed, and was able to see a very deep callstack, again involving the file system event callback. I will try to either dig that up or catch it in the act again.

Yes I have tried running as the logged on user, and while I will claim to have seen the same thing happen, it was a long time ago at this point, and many revs of Visual Studio and MacOS ago. Running as root is the surest way to see this happen.

amcasey commented 1 year ago

I don't see where it is trying to start a thread, the main thread has already started, and is processing a file system event, in the course of which it is doing all kinds of system level calls into libsystem_kernel.dylib, etc.

My mistake - I was reading the stack upside down. πŸ˜†

amcasey commented 1 year ago

This is on .net 7, correct? Can you try experimentally setting the environment variable COMPlus_GCName=clrgc.dll? That'll revert back to an older version of the GC which will help us determine whether changes in that area are affecting your scenario.

parnasm commented 1 year ago

I can try that but as I mentioned in my original post, we've been experiencing this problem for over a year, going back to when we were still on .NET Core 3.1. We subsequently moved to .NET 6.0 and then to 7.0, with no improvement.

amcasey commented 1 year ago

Ah, don't bother then. Out of curiosity, given that it works with such a simple toy app, have you tried an even simpler app? e.g. a console app that just logs in a loop? That would eliminate a lot of moving parts.

parnasm commented 1 year ago

I set up a console app yesterday, and left it running side by side with our production app. The console app is fine so far, but I just managed to catch the production app going CPU nuts, and I took a sample and a screenshot of the Activity Monitor details. Note 117% cpu usage and the 2.5GB real memory size is way out of proportion to normal usage, which is usually <1% and ~100MB when idle. The call graphs in the sample are insane.

Sample of Phd.txt

Screenshot 2023-06-09 at 10 30 30

parnasm commented 1 year ago

I also managed to attach the VisualStudio debugger and break into the Phd process. Attached is a screenshot showing our main thread doing nothing, and no other of our threads doing anything, but note the .NET File Watcher thread is calling the system LStat() function, and also note the Console log processing thread is in TryDequeue(). I have observed this before, and again just intuitively, it seems to me there may be an interaction between the File Watcher and the Logger that is causing this raucous behavior.

Screenshot 2023-06-09 at 10 58 27

parnasm commented 1 year ago

And just for fun, here's Activity Monitor showing how our Phd process is about to get axed due to consuming all available swap space.

Screenshot 2023-06-09 at 11 11 51

parnasm commented 1 year ago

Hey @amcasey, don't want to pester you, just want to make sure you get these updates. Have a good weekend!

amcasey commented 1 year ago

@parnasm I did - thanks! I'm slammed today, so I'll have to dig in next week. Enjoy your weekend!

amcasey commented 1 year ago

I'm not specifically familiar with the function of the .NET File Watcher thread, but my guess would be that it's just polling and basically any dump you could grab would show it in lstat. My intuition is that that's benign.

I'm more interested in why the stacks are so deep. I don't suppose you can share managed stacks corresponding to the ones in your sample? Feel free to anonymize the frames from your app.

amcasey commented 1 year ago

Oh, and did the console app eventually break?

parnasm commented 1 year ago

Hey, no, the console app is still running fine. Our app I still have trapped in the VS debugger, but I cannot seem to get any stack traces except for our main thread and two worker threads, all of which are simply waiting for something:

311188 0x0 in System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) // timeout is infinite

311211 .NET ThreadPool Worker 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan waitHandles, bool waitAll, int millisecondsTimeout) // timeout is infinte

311215 .NET Long Running Task 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan waitHandles, bool waitAll, int millisecondsTimeout) // timeout is 150 seconds

This is typical of our app in its quiescent state, where it's basically waiting for a network connection.

All other threads appear to be in .NET code, also generally waiting for something, but if I double click any of them, the Stack window is empty and the source window says "Source not available."

Screenshot 2023-06-12 at 13 22 28

The only other interesting thread is:

311208 Console logger queue processing thread 0x0 in Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.TryDequeue(out Microsoft.Extensions.Logging.Console.LogMessageEntry item)

Any advice how to dig deeper?

amcasey commented 1 year ago

If you really wanted, you could enable Microsoft Symbol Servers and SourceLink support to get a look at those source files, but it's probably not that interesting at the moment.

Does VS for Mac have the Parallel Stacks window? Sometimes, that gives a clearer overview of what's going on. It may also offer Parallel Tasks, which would show what's on hold.

If that's not an option, you may have better luck inspecting the stack outside VS. You can try saving a dump and running analyzing it with dotnet-dump. (Caveat: I'm not 100% sure they use compatible dump formats.) It's a lot less convenient than VS, but sometimes it exposes different information (i.e. they're complementary).

parnasm commented 1 year ago

I tried to run dotnet-dump --type Mini or Heap and it said:

Writing dump with heap to /Users/marcel/Work/Host4Mac/Phd/core_20230612_144117 Process 2444 not running compatible .NET runtime.

So I ran it "--Full" and 10 minutes later I had a 36GB file, but I had to detach the VS Debugger in order for it to do its thing. I will not attempt to attach a 36GB file here. I was able to run dotnet-dump analyze and get a parallelstacks display (VS for Mac does not have this ability):

parallelstacks.txt

Unfortunately, the parallelstacks display does not line up with what I first saw when I took an Activity Monitor Sample of this process and looked at the same thread (i.e. 311188 or 0x4bf94). That showed this crazy looping: Call graph: 1422 Thread_311188 DispatchQueue_1: com.apple.main-thread (serial)

Basically, by the time I had detached VS debugger and taken the dump, the process was in a strange state (Activity Monitor showed memory usage fluctuating but no CPU usage) and it was very shortly terminated by the OS.

I still have the dump file, and can try to look at whatever you suggest. Meanwhile, I have stopped the Console app and restarted the WorkerTest, which is the barebones app that demonstrates the bad behavior. If I can catch that in the act, I will try to dotnet-dump it without first attaching the VS debugger.

amcasey commented 1 year ago

Huh, I wonder why Heap didn't work - that's usually the best choice. Maybe a macos restriction? Sorry about the 36 GB file. πŸ˜†

I noticed that long sequence of repeated frames too, but my concern was that that might be just be the native code that triggers a dotnet method call, in which case, it's not very interesting on its own.

This thread looked kind of interesting:

                   4 Grpc.Net.Client.Internal.GrpcCall<System.__Canon, System.__Canon>.IsCancellationOrDeadlineException(Exception)
                   4 Grpc.Net.Client.Internal.GrpcCall<System.__Canon, System.__Canon>+<RunCall>d__73.MoveNext()

Maybe something is timing out?

Some other random thoughts:

  1. Are you logging to a temp directory? If so, does that directory periodically get wiped? (Sorry, I never found your plist file from the original report.)
  2. I'd be interested to know what !dumpheap stat produces, but there's a pretty good chance it will choke on 36 GB. πŸ˜† Knowing what it's doing with all that space might help us narrow down what's going wrong.
parnasm commented 1 year ago

"This thread looked kind of interesting: 4 Grpc.Net.Client.Internal.GrpcCall<System.Canon, System.Canon>.IsCancellationOrDeadlineException(Exception) 4 Grpc.Net.Client.Internal.GrpcCall<System.Canon, System.Canon>+d__73.MoveNext() Maybe something is timing out?"

Our process is normally in communication with a user agent running on the desktop via Grpc. That channel has certainly timed out after the phd process sat in the debugger over the weekend ;) In any case, Grpc is not involved in the little Workertest sample app.

"Are you logging to a temp directory? If so, does that directory periodically get wiped? (Sorry, I never found your plist file from the original report.)"

No, we log to /Library/Logs.

"I'd be interested to know what !dumpheap stat produces, but there's a pretty good chance it will choke on 36 GB. πŸ˜† Knowing what it's doing with all that space might help us narrow down what's going wrong."

Surprisingly, I was able to run that command and snipped off the statistics at the end, and I've attached the result here:

DumpHeap Stat.txt

The bottom line is: 15767346 System.Strings taking up 31628923302 bytes(!)

parnasm commented 1 year ago

Here's a representative System.String which has many, many siblings of the exact same size, all with similar contents:

dumpobj 00000002808a9828
Name: System.String MethodTable: 0000000107fd30d8 EEClass: 0000000108003cd8 Tracked Type: false Size: 2028(0x7ec) bytes File: /Applications/Phd.app/Contents/osx-arm64/System.Private.CoreLib.dll String: /Users/marcel/Library/Trial/NamespaceDescriptors/v2/factorPackSets/756326b5-143a-4bcd-a25e-00a9c3da94d6/factorPacks/SIRI_UNDERSTANDING_ASR_ASSISTANT/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-69DA150A-5B36-4D21-8AE9-2FA5B0845761/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-69DA150A-5B36-4D21-8AE9-2FA5B0845761/_refs/revlinks/link-69DA150A-5B36-4D21-8AE9-2FA5B0845761/_refs/revlinks/link-69DA150A-5B36-4D21-8AE9-2FA5B0845761/_refs/revlinks/link-BA0CB0FE-ADC5-4F5F-B949-B79E3D579645/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-80CAE5EA-AB40-40A3-B2F3-DC2CB602CCE1/_refs/revlinks/link-69DA150A-5B36-4D21-8AE9-2FA5B0845761 Fields: MT Field Offset Type VT Attr Value Name 0000000107184488 40002e5 8 System.Int32 1 instance 1003 _stringLength 000000010718bf58 40002e6 c System.Char 1 instance 2f _firstChar 0000000107fd30d8 40002e4 d8 System.String 0 static 0000000280004240 Empty

And here's a representative of a slightly different size class:

dumpobj 0000000902732cb0
Name: System.String MethodTable: 0000000107fd30d8 EEClass: 0000000108003cd8 Tracked Type: false Size: 2016(0x7e0) bytes File: /Applications/Phd.app/Contents/osx-arm64/System.Private.CoreLib.dll String: /Users/marcel/Library/Trial/NamespaceDescriptors/v2/factorPackSets/756326b5-143a-4bcd-a25e-00a9c3da94d6/factorPacks/SIRI_UNDERSTANDING_MORPHUN/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-04A39E51-B6F8-48EB-93CC-8A8052DD9AC0/_refs/revlinks/link-78408A84-6F58-44B0-BF2A-BEAD7302F401/_refs/revlinks/link-E453A5FF-2EA3-4CC2-BAE9-86FEC4E8F5D9/_refs/revlinks/link-5F8C1EEC-A042-4FB7-B703-36B55F018682/_refs/revlinks/link-5F8C1EEC-A042-4FB7-B703-36B55F018682 Fields: MT Field Offset Type VT Attr Value Name 0000000107184488 40002e5 8 System.Int32 1 instance 997 _stringLength 000000010718bf58 40002e6 c System.Char 1 instance 2f _firstChar 0000000107fd30d8 40002e4 d8 System.String 0 static 0000000280004240 Empty

amcasey commented 1 year ago

"This thread looked kind of interesting: 4 Grpc.Net.Client.Internal.GrpcCall<System.Canon, System.Canon>.IsCancellationOrDeadlineException(Exception) 4 Grpc.Net.Client.Internal.GrpcCall<System.Canon, System.Canon>+d__73.MoveNext() Maybe something is timing out?"

Our process is normally in communication with a user agent running on the desktop via Grpc. That channel has certainly timed out after the phd process sat in the debugger over the weekend ;) In any case, Grpc is not involved in the little Workertest sample app.

Makes sense. Thanks for confirming!

"Are you logging to a temp directory? If so, does that directory periodically get wiped? (Sorry, I never found your plist file from the original report.)"

No, we log to /Library/Logs.

Does it just grow indefinitely? It must use a circular buffer or something?

"I'd be interested to know what !dumpheap stat produces, but there's a pretty good chance it will choke on 36 GB. πŸ˜† Knowing what it's doing with all that space might help us narrow down what's going wrong."

Surprisingly, I was able to run that command and snipped off the statistics at the end, and I've attached the result here:

DumpHeap Stat.txt

The bottom line is: 15767346 System.Strings taking up 31628923302 bytes(!)

That is a lot of strings.

Here's a representative System.String which has many, many siblings of the exact same size, all with similar contents:

I'm looking in my local ~/Library/Trial and it looks like it's for keeping track of which experimental flights you're in. Most seem to relate to Siri - do you have a way to check whether any mention "dotnet"? (I realize grepping 30 GB of strings is non-trivial.)

I notice that those paths appear to contain cycles (of symlinks, presumably). It definitely looks like someone is trying to index that directory and isn't doing cycle detection correctly.

I assume your app would have no reason to be watching the user's library folder? I can't think of a reason dotnet would be, but I'll ask some experts.

Edit: Of course it's not your app - it repros with a toy app. Still trying to figure out whether either dotnet or aspnetcore would be looking there.

amcasey commented 1 year ago

I don't know how reliable this source is, but the report is interesting: https://eclecticlight.co/2022/03/31/what-is-triald-and-why-is-it-taking-so-much-disk-space/

If that folder sometimes explodes in size and your process is indexing it for some reason, it seems like you'd see symptoms very much like what you're seeing now.

parnasm commented 1 year ago

That report regarding Trial is indeed very interesting! Combine Apple periodically pushing out weird cyclical file system entries with a bug somewhere down in libcoreclr and you have a plausible scenario for the behavior we are seeing.

And yes, I managed to catch the toy app doing exactly the same thing this morning. This time the dotnet-dump file amounted to 41GB, the bulk of which corresponds to 32851502 System.Strings taking 36309111130 bytes. I've attached the parallelstacks report here:

parallelstacks.txt

Again, the only interesting thread is the System.IO.FileSystemWatcher. A random sample of the strings in the dotnet-dump file again shows references like this:

dumpobj 00000005d932dc30
Name: System.String MethodTable: 00000001064d30d8 EEClass: 0000000106503cd8 Tracked Type: false Size: 1116(0x45c) bytes File: /usr/local/share/dotnet/shared/Microsoft.NETCore.App/7.0.3/System.Private.CoreLib.dll String: /Users/marcel/Library/Trial/NamespaceDescriptors/v2/factorPackSets/756326b5-143a-4bcd-a25e-00a9c3da94d6/factorPacks/SIRI_UNDERSTANDING_ASR_ASSISTANT/_refs/revlinks/link-F8F45818-A484-4929-A70A-F953C1FEA83D/_refs/revlinks/link-369B8F28-C4F9-4EA5-A15D-43C82E13636F/_refs/revlinks/link-369B8F28-C4F9-4EA5-A15D-43C82E13636F/_refs/revlinks/link-369B8F28-C4F9-4EA5-A15D-43C82E13636F/_refs/revlinks/link-9B4796C0-45FD-4536-ACCB-DD6662604896/_refs/revlinks/link-BA0CB0FE-ADC5-4F5F-B949-B79E3D579645/_refs/revlinks/link-F8F45818-A484-4929-A70A-F953C1FEA83D Fields: MT Field Offset Type VT Attr Value Name 0000000105684488 40002e5 8 System.Int32 1 instance 547 _stringLength 000000010568bf58 40002e6 c System.Char 1 instance 2f _firstChar 00000001064d30d8 40002e4 d8 System.String 0 static 0000000280004240 Empty

If you can get someone familiar with the relevant code to take a look at this, I would very much appreciate it!

parnasm commented 1 year ago

My Trial folder is also attached here:

Trial.zip

It contains 446 items amounting to only 1.9MB. My copy of "Find Any File" tells me there are no references to "dotnet" within, but by following the links starting with the object I dumped in the previous post, you can easily find the cycle.

parnasm commented 1 year ago

For completeness, here's the Activity Monitor Sample of the Worker toy app, taken just before the app was killed by the OS:

Sample of Worker-04-20-2023.txt

davidfowl commented 1 year ago

@parnasm can you use the dotnet diagnostic tools to collect more information https://learn.microsoft.com/en-us/dotnet/core/diagnostics/.

Specifically, dotnet counters and dotnet trace.

amcasey commented 1 year ago

And yes, I managed to catch the toy app doing exactly the same thing this morning.

To confirm, that's the console app? I couldn't see how this could be specific to aspnetcore, but it's nice to have it verified.

It contains 446 items amounting to only 1.9MB. My copy of "Find Any File" tells me there are no references to "dotnet" within, but by following the links starting with the object I dumped in the previous post, you can easily find the cycle.

Based on further reading, I don't think that folder is actually for flighting - at least, not for general purpose flighting - so I no longer have any reason to believe dotnet might be using it. Thanks for checking though!

If you can get someone familiar with the relevant code to take a look at this, I would very much appreciate it!

We are well outside my wheelhouse now. I think the best way to put you in touch with those experts would be to move this bug to the dotnet/runtime repo. To set expectations, things will probably slow down a bit as it finds its way to the right people.

amcasey commented 1 year ago

@parnasm can you use the dotnet diagnostic tools to collect more information https://learn.microsoft.com/en-us/dotnet/core/diagnostics/.

Specifically, dotnet counters and dotnet trace.

@davidfowl What are you looking for in that output? The repro runs for 24+ hours, so the output from those tools might be pretty unwieldy.

amcasey commented 1 year ago

Oops, I thought I'd have time to summarize this issue for runtime but I have a meeting. I'll be back shortly with an overview.

parnasm commented 1 year ago

Not the console app, but the simple Worker test attached to my original report, i.e. this:

public class Worker : BackgroundService { [...] protected override async Task ExecuteAsync(CancellationToken stoppingToken) { while (!stoppingToken.IsCancellationRequested) { _logger.LogInformation("Worker running at: {time}", DateTimeOffset.Now); await Task.Delay(1000, stoppingToken); } } }

I have run this as well as our production app under the XCode Allocations and Leaks Instrument and they can both run totally flat for >24 hrs before they suddenly blow up as described.

amcasey commented 1 year ago

Not the console app, but the simple Worker test attached to my original report

Nuts. That's not quite as tidy, but I think we probably still need some runtime experts to weigh in.

The sudden allocation before the crash is consistent with my mental model too - something introduces a cycle in a watched directory and it blows up.

amcasey commented 1 year ago

Thread overview:

On MacOS, create a global service that runs a simple worker service (i.e. dotnet new worker) that prints out a timestamped log message once per second. Run it for a long time (~ 24 hours) and it is eventually killed by the OS for some combination of using too much CPU and too much memory. Based on dumps collected shortly before it's killed, we believe that a file watcher (of unknown origin) is picking up a change in ~/Library/Trial - a directory that triald updates in the background during periods of low activity, especially overnight - at a time when that directory contains cyclical symlinks (example paths, example directory structure). This causes it to allocate strings for many paths and (presumably) consume a lot of CPU traversing an infinite directory graph.

So, the main open question is what is watching ~/Library/Trial and why?

amcasey commented 1 year ago

Looks like I was too hasty (chiefly because I thought we had a repo in a vanilla console app). In my non-reproing service, these are the gcroots of the only FileSystemWatcher instance:

HandleTable:
    0000000106fe13a8 (strong handle)
          -> 00019f7d7420     System.Object[] 
          -> 0001877e9468     Microsoft.Extensions.Logging.EventSource.LoggingEventSource (static variable: Microsoft.Extensions.Logging.EventSource.ExceptionInfo.<Empty>k__BackingField)
          -> 000187802c58     System.Threading.CancellationTokenSource 
          -> 000187802c88     System.Threading.CancellationTokenSource+Registrations 
          -> 000187802cc8     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187802b78     Microsoft.Extensions.Primitives.ChangeToken+ChangeTokenRegistration<System.String> 
          -> 000187802b38     System.Action<System.String> 
          -> 000187802950     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.LoggerFilterOptions> 
          -> 0001878026a0     Microsoft.Extensions.Options.OptionsFactory<Microsoft.Extensions.Logging.LoggerFilterOptions> 
          -> 000187802560     Microsoft.Extensions.Options.IConfigureOptions<Microsoft.Extensions.Logging.LoggerFilterOptions>[] 
          -> 0001877e8cb0     Microsoft.Extensions.Logging.LoggerFilterConfigureOptions 
          -> 0001877e8b78     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877e6a60     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877e4e28     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877e4e68     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877e4ea0     Microsoft.Extensions.Configuration.Json.JsonConfigurationProvider 
          -> 0001877e4d40     Microsoft.Extensions.Configuration.Json.JsonConfigurationSource 
          -> 0001877e4a18     Microsoft.Extensions.FileProviders.PhysicalFileProvider 
          -> 0001877e6eb0     Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher 
          -> 0001877e50a8     System.IO.FileSystemWatcher 

    0000000106fe13e8 (strong handle)
          -> 00019f7d3018     System.Object[] 
          -> 000187814228     System.Threading.TimerQueue[] (static variable: System.ValueTuple<System.Int64, System.DateTime>.s_tickCountToTimeMap)
          -> 000187814570     System.Threading.TimerQueue 
          -> 000187bc6668     System.Threading.TimerQueueTimer 
          -> 000187bc6610     System.Threading.Tasks.Task+DelayPromiseWithCancellation 
          -> 000187818850     System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<gh48548.Worker+<ExecuteAsync>d__2> 
          -> 0001878188b0     System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<Microsoft.Extensions.Hosting.Internal.Host+<TryExecuteBackgroundServiceAsync>d__13> 
          -> 000187808878     gh48548.Worker 
          -> 000187809b70     System.Threading.CancellationTokenSource+Linked1CancellationTokenSource 
          -> 000187808910     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187809bb0     System.Threading.CancellationTokenSource+Registrations 
          -> 0001878075a0     System.Threading.CancellationTokenSource+Linked1CancellationTokenSource 
          -> 000187807678     System.Threading.CancellationTokenSource+CallbackNode 
          -> 0001878077f0     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187806f48     Microsoft.Extensions.Hosting.Internal.ConsoleLifetime 
          -> 000187803ff0     Microsoft.Extensions.Hosting.Internal.ApplicationLifetime 
          -> 000187803378     Microsoft.Extensions.Logging.Logger<Microsoft.Extensions.Hosting.Internal.ApplicationLifetime> 
          -> 000187803c48     Microsoft.Extensions.Logging.Logger 
          -> 000187803c70     Microsoft.Extensions.Logging.LoggerInformation[] 
          -> 000187803ce8     Microsoft.Extensions.Logging.Console.ConsoleLogger 
          -> 0001877fda78     Microsoft.Extensions.Logging.Console.SimpleConsoleFormatter 
          -> 0001877fdc88     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>+ChangeTrackerDisposable 
          -> 0001877fd8a8     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd5b0     Microsoft.Extensions.Options.OptionsFactory<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd400     Microsoft.Extensions.Options.IConfigureOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>[] 
          -> 0001877fd448     Microsoft.Extensions.Logging.ConsoleLoggerFormatterConfigureOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatter, Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd4a0     System.Action<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd488     Microsoft.Extensions.Options.ConfigureFromConfigurationOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>+<>c__DisplayClass0_0 
          -> 0001877fd460     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877fb5c8     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877fb530     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877fb570     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877fb550     Microsoft.Extensions.Configuration.ChainedConfigurationProvider 
          -> 0001877fb4a8     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877e6a60     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877e4e28     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877e4e68     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877e4ea0     Microsoft.Extensions.Configuration.Json.JsonConfigurationProvider 
          -> 0001877e4d40     Microsoft.Extensions.Configuration.Json.JsonConfigurationSource 
          -> 0001877e4a18     Microsoft.Extensions.FileProviders.PhysicalFileProvider 
          -> 0001877e6eb0     Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher 
          -> 0001877e50a8     System.IO.FileSystemWatcher 
amcasey commented 1 year ago

As a point of interest, triald does not appear to be running on the mac I've been using for repro attempts.

amcasey commented 1 year ago

I grabbed dumps of the worker running as a service and running locally. Each has a single FileSystemWatcher. The local process is watching the directory from which I launched the process (or possibly the cwd), whereas the service is watching /. 😲

Edit: I did another experiment and it looks like the local process is watching the working directory.

parnasm commented 1 year ago

Obviously, triald is running on my machine.

How did you determine that the service is watching /. ?

amcasey commented 1 year ago

In the dump, you can use dumpheap -type FileSystemWatcher to find the MT, then dumpheap -mt XXXX to find the address, then dumpobj YYYY to find the _directory field then dumpobj ZZZZ to see the string value.

parnasm commented 1 year ago

FWIW, I can see the same thing, it is watching /.

Also, at the risk of supplying data we don't need, I'll add that I've long suspected the FileSystemWatcher to be the source of the problem, so at one time (many months ago) I explicitly added a new FileSystemWatcher to the Worker app and had it watch "/" as well. However, without knowing what the trigger was for the bad behavior, this alone did not cause any problem. I think it pretty clearly comes down to the triald cyclical file system links.

amcasey commented 1 year ago

It's going to take some time to disentangle this, but I think you can probably mitigate the issue by setting your content root explicitly so that it doesn't try to watch the whole world. Normally, I'd validate that before suggesting it, but I have to run and you seem both capable and motivated. πŸ˜‰

parnasm commented 1 year ago

I'll give it a try, thanks for your help!