dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.11k stars 4.7k forks source link

Blazor WASM crashes after a time with Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91 558006 @ dotnet.6.0.0.csqdfrwtlv.js:1 #62054

Closed carlbm closed 1 year ago

carlbm commented 2 years ago

Description

After leaving a blazor wasm app open for a length of time (and through sleep/wake cycles) the app will eventually crash (normally 8-12 hours, but can be quicker).

The console looks like:

image

With Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91

558006 @ dotnet.6.0.0.csqdfrwtlv.js:1

Reproduction Steps

Create WASM app that has timed interaction with a server, leave for a while.

Expected behavior

The app should continue to work

Actual behavior

The app crashes. There is nothing the user can do except for reloading the page

Regression?

Unsure

Known Workarounds

No response

Configuration

Running on .Net 6 Windows, chrome browser Version 95.0.4638.69 (Official Build) (64-bit) x64 unsure tested in chrome

Other information

No response

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details
### Description After leaving a blazor wasm app open for a length of time (and through sleep/wake cycles) the app will eventually crash (normally 8-12 hours, but can be quicker). The console looks like: ![image](https://user-images.githubusercontent.com/2923969/143462452-5d29634a-7f48-416c-81bc-6beb85f2f259.png) With Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91 558006 @ dotnet.6.0.0.csqdfrwtlv.js:1 ### Reproduction Steps Create WASM app that has timed interaction with a server, leave for a while. ### Expected behavior The app should continue to work ### Actual behavior The app crashes. There is nothing the user can do except for reloading the page ### Regression? Unsure ### Known Workarounds _No response_ ### Configuration Running on .Net 6 Windows, chrome browser Version 95.0.4638.69 (Official Build) (64-bit) x64 unsure tested in chrome ### Other information _No response_
Author: carlbm
Assignees: -
Labels: `arch-wasm`, `untriaged`, `area-Codegen-AOT-mono`
Milestone: -
ghost commented 2 years ago

Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.

Issue Details
### Description After leaving a blazor wasm app open for a length of time (and through sleep/wake cycles) the app will eventually crash (normally 8-12 hours, but can be quicker). The console looks like: ![image](https://user-images.githubusercontent.com/2923969/143462452-5d29634a-7f48-416c-81bc-6beb85f2f259.png) With Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91 558006 @ dotnet.6.0.0.csqdfrwtlv.js:1 ### Reproduction Steps Create WASM app that has timed interaction with a server, leave for a while. ### Expected behavior The app should continue to work ### Actual behavior The app crashes. There is nothing the user can do except for reloading the page ### Regression? Unsure ### Known Workarounds _No response_ ### Configuration Running on .Net 6 Windows, chrome browser Version 95.0.4638.69 (Official Build) (64-bit) x64 unsure tested in chrome ### Other information _No response_
Author: carlbm
Assignees: -
Labels: `arch-wasm`, `untriaged`, `area-GC-mono`
Milestone: -
lambdageek commented 2 years ago

@carlbm Do you have an app that you could share that demonstrates the issue?

kg commented 2 years ago

If you can't share an app, can you describe what the app does? Are there timers running, does it keep sockets open? Does it issue network requests? When you say sleep/wake cycles do you mean you're sleeping and waking your PC?

ghost commented 2 years ago

This issue has been marked needs more info since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

carlbm commented 2 years ago

Hi Yes sleep/wake cycles are my PC going to sleep - although with further analysis it appears that this isn't necessary to trigger the crash. When I run the app locally, I can have multiple browsers accessing the site with no crash for multiple days. When the site has been deployed to an Azure App Service (the build uses 'dotnet publish ...') it can crash within minutes. It does seem that it only crashes when the browser/tab does not have focus.

In terms of the app, the client has a signalr connection and multiple grpc connections to the server. The grpc connections are based on a timer running in the client and are not kept open.

Here's a list of dependencies and versions:

<PackageReference Include="IdentityModel" Version="5.1.0" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly" Version="6.0.0" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly.DevServer" Version="6.0.0" 
       PrivateAssets="all" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly.Authentication" Version="6.0.0" />
<PackageReference Include="Microsoft.AspNetCore.SignalR.Client" Version="6.0.0" />
<PackageReference Include="Microsoft.Extensions.Http" Version="5.0.0" />

<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="5.0.1" />
<PackageReference Include="Microsoft.TypeScript.MSBuild" Version="4.4.2">
  <PrivateAssets>all</PrivateAssets>
  <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>

<PackageReference Include="Excubo.Generators.Blazor" Version="1.14.1" />

<PackageReference Include="Grpc.Net.Client.Web" Version="2.40.0" />
<PackageReference Include="Grpc.Net.Client" Version="2.40.0" />

<PackageReference Include="Append.Blazor.Notifications" Version="1.1.0" />
<PackageReference Include="blazor-dragdrop" Version="2.3.0" />
<PackageReference Include="BlazorAnimate" Version="3.0.0" />
<PackageReference Include="Blazored.LocalStorage" Version="4.1.5" />
<PackageReference Include="Blazorise.Bootstrap" Version="0.9.4.6" />
<PackageReference Include="Blazorise.Icons.FontAwesome" Version="0.9.4.6" />
<PackageReference Include="ChartJs.Blazor" Version="1.1.0" />
<PackageReference Include="CurrieTechnologies.Razor.Clipboard" Version="1.3.1" />
<PackageReference Include="Fluxor" Version="4.1.0" />
<PackageReference Include="Fluxor.Blazor.Web" Version="4.1.0" />

<PackageReference Include="HtmlSanitizer" Version="6.0.441" />
<PackageReference Include="Humanizer.Core" Version="2.11.10" />
<PackageReference Include="MatBlazor" Version="2.9.0-develop-042" />
<PackageReference Include="Plotly.Blazor" Version="2.3.1" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />
<PackageReference Include="Sotsera.Blazor.Toaster" Version="3.0.0" />
<PackageReference Include="TimeZoneConverter" Version="3.5.0" />
<PackageReference Include="TinyMCE.Blazor" Version="0.0.7" />
<PackageReference Include="Z.Blazor.Diagrams" Version="2.1.5" />
<PackageReference Include="JetBrains.Annotations" Version="2021.2.0" />
<PackageReference Include="NodaTime" Version="3.0.7" />
<PackageReference Include="protobuf-net.BuildTools" Version="3.0.101">
  <PrivateAssets>all</PrivateAssets>
  <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="protobuf-net.Grpc" Version="1.0.152" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />
<PackageReference Include="System.ServiceModel.Primitives" Version="4.8.1" />

Hopefully this helps.

carlbm commented 2 years ago

This is happening with a different message too.

What is the best way for me to work out what's causing this? It's affecting my users fairly badly, with some reports of it crashing while they were using the site (ie the tab being in focus etc)

image

carlbm commented 2 years ago

Or alternatively a way to catch this so that there can be an elegant retry mechanism/notification put in place?

kg commented 2 years ago

This is typically a memory corruption issue, so there wouldn't be any way to handle it without taking down the app. You might be able to do some sort of atexit handler to reload the tab, but I'm not sure.

carlbm commented 2 years ago

What is the best way to deal with this then? I can give you the url of the published site if it helps? Or even share the project privately?

kg commented 2 years ago

If you provide us a working version of it that we can run (edit) in our browsers, that might make it faster for us to investigate. I'm not sure how quickly we can investigate it just using the list of deps you sent us since there are so many - any one of them could be causing heap corruption if they misuse our APIs. (I'm not saying this is not our bug - it could be a bug in Blazor or the managed runtime - but the amount of third party code involved will make it harder to investigate.) Because this problem involves the GC and heap corruption I'm not a specialist here so I can't provide estimates on how time-consuming this is to find and fix.

I would try registering unhandled error handlers in your JS: https://developer.mozilla.org/en-US/docs/Web/API/Window/unhandledrejection_event and https://developer.mozilla.org/en-US/docs/Web/API/Window/error_event - and using those to show a message to the user that the tab has crashed so they can reload it. You should also log the error to your server by POSTing it to some endpoint that will record it. Having this is generally good practice since bugs in your application could produce unhandled errors with no UI feedback (even if this particular case isn't a bug in your application).

You could potentially auto-reload the tab as a short-term fix here, but I don't advise doing that unless absolutely necessary - it would make troubleshooting some types of problem harder and might also mean users stop reporting certain types of bugs. Also if you decide to do auto-reload, note that the 'error' event can fire for things other than script errors, so you'll need to make sure you're not auto-reloading when (for example) an image fails to load from your CDN.

carlbm commented 2 years ago

What's the best way to get in touch? I can give you the URL of the website? Or I can send you the project

kg commented 2 years ago

What's the best way to get in touch? I can give you the URL of the website? Or I can send you the project

If you don't want every person who views this bug to have the link, you can email it to me at kagadd@microsoft.com and I will forward it to the dotnet contributors who work on wasm/blazor.

carlbm commented 2 years ago

I've removed the following packages and the issue seems to have gone:

<PackageReference Include="Grpc.Net.Client.Web" Version="2.41.0" />
<PackageReference Include="Grpc.Net.Client" Version="2.41.0" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />

I'm busy making the application work, but based on the above I should be able to put together a minimal sample at some point soon

I'll add @JamesNK too in case he has any insight or knows about any issues with blazor wasm

Thanks for your help so far @kg

kg commented 2 years ago

Glad to hear you found a workaround! Based on the involvement of Grpc.Net.Client, it's possible it could be touching a bug in our websocket stack, so it would still be worthwhile to have a minimal sample so we can try to identify whether it's a bug in the runtime.

JamesNK commented 2 years ago

I’ve never heard of this bug before.

FYI Grpc.Net.Client doesn’t use websockets. Network calls use HttpClient (which uses fetch internally).

ldsenow commented 2 years ago

I have this issue as well. I am using grpc and when i make calls to server side , blazor occasionally crashes.

JamesNK commented 2 years ago

Issue on grpc-dotnet: https://github.com/grpc/grpc-dotnet/issues/1526

It seems like this problem is in wasm. The only situation that is impacted is .NET 6 + wasm + gRPC. I don't think the gRPC library is the problem because it works fine on .NET 5, or on .NET 6 in a regular .NET app.

kg commented 2 years ago

That helps a lot, we can ideally examine what changed specifically between 5 and 6 in the wasm stack to identify what could be happening here.

ldsenow commented 2 years ago

image

.net 6 Launch via ctrl+f5 (kestrel)

image

.net 6 publish to IIS

image .net5 publish to IIS

Huge different on the memory heap!!!

It explains why i dont experience the memory leak while i am developing with kestrel but it crashes in production which is hosted under IIS.

JamesNK commented 2 years ago

I think this problem is specific to protobuf-net.Grpc in Blazor WASM. Both people reporting this problem are using it.

@mgravell

mgravell commented 2 years ago

Eesh. If this does somehow relate to protobuf-net, that's vexing, but I'm going to struggle to offer much guidance without help - I'm delighted that protobuf-net works in Blazor WASM, but I don't know a lot of the details about what voodoo is happening under the covers there, and I'd struggle to decipher or usefully opine upon the specific crash assertion. I'm willing, but somewhat lacking in WASM knowledge! There's also the time element: if this scenario is very hard to reproduce, often requiring large numbers of hours: I don't know how I can begin to look at that.

So: if there are things I can do to usefully help, I can make myself available, but: I don't know how I would even start to investigate this. If there is a Blazor WASM guru that understands those bits, I can make myself available for any library-specific context/changes.

kg commented 2 years ago

Since I don't have a repro case on hand, it'll take me a bit longer to reproduce this and start looking into it in depth. Probably in January.

ldsenow commented 2 years ago

I will try to arrange a repo for you.

Charnock commented 2 years ago

I think I've managed to get a minimal repro for this error here.

To repro:

hakenr commented 2 years ago

We are also hitting this issue with our applications (Blazor WASM, .NET6, protobuf-net gRPC client).

My very first observations:

  1. The memory leak seams to be there no matter if we are using .NET5 or .NET6. It seems that the browser heap size gets bigger when there are simultaneous gRPC requests made, possibly a race condition. I'm still not sure whether the memory leak is caused by protobuf-net, Grpc.Net.Client or HttpClient wasm implementation, but I will try to investigate further.
  2. In .NET5, on my computer, with our application, the browser heap gets bigger and bigger up to 2196 MB (2 GB for system/JSArrayBufferData) and then the Blazor WASM application crashes with OutOfMemoryException.
  3. In .NET6, on my computer, with the same application (upgraded to .NET6), the browser heap grows up to approx. 240 MB and then the WASM Mono Garbage Collector crashes as described above (Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91 with Uncaught RuntimeError: memory access out of bounds).

From my point of view these are probably two separate issues:

  1. A memory leak which was already there before .NET6. It is somewhere in protobuf-net / Grpc.Net.Client / HttpClient or somewhere there and occurs when there are simultaneous calls to the server. (Cc @mgravell)
  2. A new bug in WASM Mono Garbage Collector in .NET6 where the GC crashes when trying to deal with the memory pressure. (Cc @migueldeicaza, @SteveSandersonMS)

As it is very important for me to have both these issues resolved ASAP, I will try to investigate further:

  1. I will try to downgrade the repro app provided by @Charnock to .NET5 to reproduce the memory-leak in .NET5.
  2. I will try to reproduce the memory-leak with x86 gRPC client to be able to use available tools to analyze the memory dump and find the cause.
  3. I will try to reproduce the WASM Mono Garbage Collector issue without gRPC, just by plain memory allocation.

...let me know if you observed anything interesting which might help me with these experiments.

EDIT: I don't think the issue is related to IIS on server-side. It is probably just that it helps you hit the race-condition on client-side. EDIT2: : It seems the app has to be published for the issue to occur. When run from Visual Studio (no matter whether Kestrel/IIS Express, Debug/Release), the GC issue is not hit. I deployed the @Charnock's repro here, the error occurs immediately: https://WasmGrpcErrorReproNet6.azurewebsites.net/. EDIT3: The .NET5 port of the original repro is published here: https://WasmGrpcErrorReproNet5.azurewebsites.net/ (source code: https://github.com/hakenr/BlazorWasmGrpcMemoryRepro/tree/net5). The very first tests do not confirm my previous observation that the memory-leak itself was already there in .NET5. Will investigate further...

lewing commented 2 years ago

Does the problem occur when PublishTrimmed=false?

Regarding the GC, there are differences in when it is able to run between .NET6 and .NET5 if the problem isn't related to trimming we can look there next.

hakenr commented 2 years ago

Does the problem occur when PublishTrimmed=false?

Yes, this was the first what I tried as I expected that it has to be the cause (as almost always when something very weird happens only in Blazor WASM published version :-)).

(If it helps, I can publish the testing app with PublishTrimmed=false.)

cwevers commented 2 years ago

Fix this please, come on

mgravell commented 2 years ago

@cwevers with the best will in the world, this isn't a simple "just go change some code" thing; it still isn't clear to me.what is happening. If you'd care to propose a fix, I'm all ears.

kg commented 2 years ago

We have lots of quality improvements in the works specifically for GC safety that will hopefully address crashes of this type and potentially also address some memory leaks. Unfortunately we haven't identified a specific cause for this specific issue report yet.

hakenr commented 2 years ago

Sometimes, under the same circumstances (after several gRPC calls), the Mono GC fail is a little bit different than the one described above:

image

It is the sgen-gc.c:3993 with RuntimeError: memory access out of bounds

,..hope this might help to build the overall picture of the troubles in GC.

hakenr commented 2 years ago

My new observations:

  1. When I change the gRPC service creation to local instantiation (without use of Dependency Injection), I'm no longer able to reproduce the issue.
  2. When I change the Dependency Injection registrations to Transient (both GrpcChannel and the IDataService itself), the issue is still there:
  3. From the object-tree of reachable object from WebAssemblyHost, it seems that in NET5 the _disposables List is empty wheras in NET6 there are plenty of objects including the GrpcChannel. The opposite applies for ResolvedServices. There were implementation changes in ServiceProviderEngineScope in between NET5 and NET6. I'm absolutely not sure whether this can be somehow related.

cc @mgravell

BerndNK commented 2 years ago

We face the same issue in our application and we also use Blazor Wasm, protobuf-net.Grpc, Net 6 and DI.

I cannot provide a fix, but at least share my observations and suggestions.

@hakenr A key difference I suspect in your repository with the local instantiation, is that the channel gets disposed after usage. See https://github.com/hakenr/BlazorWasmGrpcMemoryRepro/blob/caceac22ce835d046c8788336da1db4b1afdb5bb/WasmGrpcErrorRepro/Client/Pages/Index.razor#L36-L48 **using** var channel = Grpc.Net.Client.GrpcChannel.ForAddress...

While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase So might I suggest to inherit IDisposable in Index.razor and dispose DataService and see whether the issue persists?

Another thing I observed, is that the issue only appears when the app is published. Locally I was never able to reproduce the issue. So it would be interesting to hear, whether you experiences the same @hakenr?

I also suspected Assembly trimming and/or compression. However even disabling both did not resolve the issue.

HTTP2 and HTTP1.1 also do not seem to make a difference. In our app we use HTTP2 and @hakenr example uses HTTP1.1 Still, might be worth a try to add new HttpClient(handler) {DefaultRequestVersion = Version.Parse("2.0")} to your second example

Another thing I noticed in @hakenr Transient example is that sending 4 x 125.000 slowly (clicking the button once every 2 seconds or so), does not result in the error. However pressing the button rapidly three times, that is 3 x 125.000, does reproduce the error. So it seems like an internal buffer when using multiple simultaneous requests does overflow or something similar?

hakenr commented 2 years ago

Hi @BerndNK, thanks for stepping in with your observations and suggestions.

A key difference I suspect in your repository with the local instantiation, is that the channel gets disposed after usage.

I noticed that too, but it worked with DI in .NET5. Some changes were made to the ID implementation in .NET6, so if something could have gone wrong there...

While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase So might I suggest to inherit IDisposable in Index.razor and dispose DataService and see whether the issue persists?

Good tip, will definitely try.

Another thing I observed, is that the issue only appears when the app is published. Locally I was never able to reproduce the issue. So it would be interesting to hear, whether you experiences the same @hakenr?

Same here.

I also suspected Assembly trimming and/or compression. However even disabling both did not resolve the issue.

Same here.

...I'll be back when I try the DI and Dispose option. (I suspect that GC is not able to handle FinalizationQueue fast enough and it's breaking down somewhere in there.)

BerndNK commented 2 years ago

I noticed that too, but it worked with DI in .NET5. Some changes were made to the ID implementation in .NET6, so if something could have gone wrong there...

That's a good point. In our application I did try the .NET7 Preview 1 and the behavior was exactly the same. I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.

hakenr commented 2 years ago

While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase So might I suggest to inherit IDisposable in Index.razor and dispose DataService and see whether the issue persists?

Well, there is probably no chance the DI+Dispose might help as there is the same DataService instance being used in Index.razor repeatedly without removing (disposing) the Index.razor component itself.

I would have to try to get some channel factory+disposal logic in the DataService calls itself, which I don't believe will be directly achievable (the interceptor won't probably help here). Right @mgravell?

I would have to change the protobuf-net (or event Grpc.Client) implementation to try disposing the channel (or maybe something else) within every gRPC call.

In our application I did try the .NET7 Preview 1 and the behavior was exactly the same. I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.

.NET5 works like a charm. We have several apps in production without any single issue. Well, I was able to achieve OutOfMemoryException in one case (at 2GBs), but still not sure what happened there (behaved more like regular memory leak, without crashing the .NET runtime itself).

kg commented 2 years ago

While I still can't promise anything, we've identified a couple bugs that could cause GC problems like what you're experiencing, so I hope to have fixes for testing in an upcoming preview.

hakenr commented 2 years ago

@kg great! Is there any chance to test those fixed bits now? Will it be .NET7 or can we hope for .NET6 fix?

kg commented 2 years ago

These changes will be significant enough that a backport to .NET6 would be difficult. The fixes are not landed yet, so it's not possible to test them.

hakenr commented 2 years ago

@kg Please let me know when those changes are available for testing. I still have doubts if we're not fighting two bugs - runtime issues in GC, but also some problems on the gRPC/protobuf-net/aspnetcore-DI side that manifest themselves in this.

hakenr commented 2 years ago

@kg Also better knowing what was fixed might help us find a workaround for .NET6 to survive until .NET gets released.

kg commented 2 years ago

The issues I've identified on our side are mostly in the C and JavaScript layers, so it is unlikely there will be C# side workarounds - but we will definitely keep you posted and share testing builds as soon as we can.

BerndNK commented 2 years ago

I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.

Our app uses a lot of Net 6 features, so I gave up on this. However, I managed to find a workaround for this issue. At least for our app. Simply deactivating optimization seemed to have fixed the issue. I did this, since I noticed that the issue only appeared when the app was published. So simply adding <Optimize>false</Optimize> to every .csproj file apparently fixed the issue for me.

Perhaps this is also worth a try for you @hakenr? Still, this is just a workaround as the performance took quite a hit through this change.

I also tried creating the GRPC channel on each use and disposing the old instance, since I thought this was a key difference in the two example above. This did however not have any effect, which I thought was interesting to share.

hakenr commented 2 years ago

@BerndNK, unfortunately, I cannot confirm the workaround. The basic repro-app when there is <Optimize>false</Optimize> set, still fails when published:

https://WasmGrpcErrorReproNet6OptimizeFalse.azurewebsites.net/

But the direction you mentioned is something I definitely want to investigate deeper - the difference in between published and regular version of the app (till now, nobody was able to reproduce the issue locally without publishing it).

It has to be something else. Maybe the compression. Will do some more tests...

hakenr commented 2 years ago

Some more experiments

hakenr commented 2 years ago

@kg, any idea what can make the difference in between plain build and published version? What else can we try to make our apps work in production the way they work in development?

kg commented 2 years ago

This sort of problem will manifest at random, and unless you identify the specific source of it you can't really work around it, unfortunately.

hakenr commented 2 years ago

OK, just gathering inputs what else can I try to identify the crucial difference in between "build" and "published" version. Runtime relinking is my next target of investigation. :-D