Closed carlbm closed 1 year ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.
Author: | carlbm |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `untriaged`, `area-Codegen-AOT-mono` |
Milestone: | - |
Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.
Author: | carlbm |
---|---|
Assignees: | - |
Labels: | `arch-wasm`, `untriaged`, `area-GC-mono` |
Milestone: | - |
@carlbm Do you have an app that you could share that demonstrates the issue?
If you can't share an app, can you describe what the app does? Are there timers running, does it keep sockets open? Does it issue network requests? When you say sleep/wake cycles do you mean you're sleeping and waking your PC?
This issue has been marked needs more info
since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.
Hi Yes sleep/wake cycles are my PC going to sleep - although with further analysis it appears that this isn't necessary to trigger the crash. When I run the app locally, I can have multiple browsers accessing the site with no crash for multiple days. When the site has been deployed to an Azure App Service (the build uses 'dotnet publish ...') it can crash within minutes. It does seem that it only crashes when the browser/tab does not have focus.
In terms of the app, the client has a signalr connection and multiple grpc connections to the server. The grpc connections are based on a timer running in the client and are not kept open.
Here's a list of dependencies and versions:
<PackageReference Include="IdentityModel" Version="5.1.0" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly" Version="6.0.0" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly.DevServer" Version="6.0.0"
PrivateAssets="all" />
<PackageReference Include="Microsoft.AspNetCore.Components.WebAssembly.Authentication" Version="6.0.0" />
<PackageReference Include="Microsoft.AspNetCore.SignalR.Client" Version="6.0.0" />
<PackageReference Include="Microsoft.Extensions.Http" Version="5.0.0" />
<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="5.0.1" />
<PackageReference Include="Microsoft.TypeScript.MSBuild" Version="4.4.2">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="Excubo.Generators.Blazor" Version="1.14.1" />
<PackageReference Include="Grpc.Net.Client.Web" Version="2.40.0" />
<PackageReference Include="Grpc.Net.Client" Version="2.40.0" />
<PackageReference Include="Append.Blazor.Notifications" Version="1.1.0" />
<PackageReference Include="blazor-dragdrop" Version="2.3.0" />
<PackageReference Include="BlazorAnimate" Version="3.0.0" />
<PackageReference Include="Blazored.LocalStorage" Version="4.1.5" />
<PackageReference Include="Blazorise.Bootstrap" Version="0.9.4.6" />
<PackageReference Include="Blazorise.Icons.FontAwesome" Version="0.9.4.6" />
<PackageReference Include="ChartJs.Blazor" Version="1.1.0" />
<PackageReference Include="CurrieTechnologies.Razor.Clipboard" Version="1.3.1" />
<PackageReference Include="Fluxor" Version="4.1.0" />
<PackageReference Include="Fluxor.Blazor.Web" Version="4.1.0" />
<PackageReference Include="HtmlSanitizer" Version="6.0.441" />
<PackageReference Include="Humanizer.Core" Version="2.11.10" />
<PackageReference Include="MatBlazor" Version="2.9.0-develop-042" />
<PackageReference Include="Plotly.Blazor" Version="2.3.1" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />
<PackageReference Include="Sotsera.Blazor.Toaster" Version="3.0.0" />
<PackageReference Include="TimeZoneConverter" Version="3.5.0" />
<PackageReference Include="TinyMCE.Blazor" Version="0.0.7" />
<PackageReference Include="Z.Blazor.Diagrams" Version="2.1.5" />
<PackageReference Include="JetBrains.Annotations" Version="2021.2.0" />
<PackageReference Include="NodaTime" Version="3.0.7" />
<PackageReference Include="protobuf-net.BuildTools" Version="3.0.101">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="protobuf-net.Grpc" Version="1.0.152" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />
<PackageReference Include="System.ServiceModel.Primitives" Version="4.8.1" />
Hopefully this helps.
This is happening with a different message too.
What is the best way for me to work out what's causing this? It's affecting my users fairly badly, with some reports of it crashing while they were using the site (ie the tab being in focus etc)
Or alternatively a way to catch this so that there can be an elegant retry mechanism/notification put in place?
This is typically a memory corruption issue, so there wouldn't be any way to handle it without taking down the app. You might be able to do some sort of atexit handler to reload the tab, but I'm not sure.
What is the best way to deal with this then? I can give you the url of the published site if it helps? Or even share the project privately?
If you provide us a working version of it that we can run (edit) in our browsers, that might make it faster for us to investigate. I'm not sure how quickly we can investigate it just using the list of deps you sent us since there are so many - any one of them could be causing heap corruption if they misuse our APIs. (I'm not saying this is not our bug - it could be a bug in Blazor or the managed runtime - but the amount of third party code involved will make it harder to investigate.) Because this problem involves the GC and heap corruption I'm not a specialist here so I can't provide estimates on how time-consuming this is to find and fix.
I would try registering unhandled error handlers in your JS: https://developer.mozilla.org/en-US/docs/Web/API/Window/unhandledrejection_event and https://developer.mozilla.org/en-US/docs/Web/API/Window/error_event - and using those to show a message to the user that the tab has crashed so they can reload it. You should also log the error to your server by POSTing it to some endpoint that will record it. Having this is generally good practice since bugs in your application could produce unhandled errors with no UI feedback (even if this particular case isn't a bug in your application).
You could potentially auto-reload the tab as a short-term fix here, but I don't advise doing that unless absolutely necessary - it would make troubleshooting some types of problem harder and might also mean users stop reporting certain types of bugs. Also if you decide to do auto-reload, note that the 'error' event can fire for things other than script errors, so you'll need to make sure you're not auto-reloading when (for example) an image fails to load from your CDN.
What's the best way to get in touch? I can give you the URL of the website? Or I can send you the project
What's the best way to get in touch? I can give you the URL of the website? Or I can send you the project
If you don't want every person who views this bug to have the link, you can email it to me at kagadd@microsoft.com and I will forward it to the dotnet contributors who work on wasm/blazor.
I've removed the following packages and the issue seems to have gone:
<PackageReference Include="Grpc.Net.Client.Web" Version="2.41.0" />
<PackageReference Include="Grpc.Net.Client" Version="2.41.0" />
<PackageReference Include="protobuf-net.NodaTime" Version="3.0.101" />
I'm busy making the application work, but based on the above I should be able to put together a minimal sample at some point soon
I'll add @JamesNK too in case he has any insight or knows about any issues with blazor wasm
Thanks for your help so far @kg
Glad to hear you found a workaround! Based on the involvement of Grpc.Net.Client, it's possible it could be touching a bug in our websocket stack, so it would still be worthwhile to have a minimal sample so we can try to identify whether it's a bug in the runtime.
I’ve never heard of this bug before.
FYI Grpc.Net.Client doesn’t use websockets. Network calls use HttpClient (which uses fetch internally).
I have this issue as well. I am using grpc and when i make calls to server side , blazor occasionally crashes.
Issue on grpc-dotnet: https://github.com/grpc/grpc-dotnet/issues/1526
It seems like this problem is in wasm. The only situation that is impacted is .NET 6 + wasm + gRPC. I don't think the gRPC library is the problem because it works fine on .NET 5, or on .NET 6 in a regular .NET app.
That helps a lot, we can ideally examine what changed specifically between 5 and 6 in the wasm stack to identify what could be happening here.
.net 6 Launch via ctrl+f5 (kestrel)
.net 6 publish to IIS
.net5 publish to IIS
Huge different on the memory heap!!!
It explains why i dont experience the memory leak while i am developing with kestrel but it crashes in production which is hosted under IIS.
I think this problem is specific to protobuf-net.Grpc in Blazor WASM. Both people reporting this problem are using it.
@mgravell
Eesh. If this does somehow relate to protobuf-net, that's vexing, but I'm going to struggle to offer much guidance without help - I'm delighted that protobuf-net works in Blazor WASM, but I don't know a lot of the details about what voodoo is happening under the covers there, and I'd struggle to decipher or usefully opine upon the specific crash assertion. I'm willing, but somewhat lacking in WASM knowledge! There's also the time element: if this scenario is very hard to reproduce, often requiring large numbers of hours: I don't know how I can begin to look at that.
So: if there are things I can do to usefully help, I can make myself available, but: I don't know how I would even start to investigate this. If there is a Blazor WASM guru that understands those bits, I can make myself available for any library-specific context/changes.
Since I don't have a repro case on hand, it'll take me a bit longer to reproduce this and start looking into it in depth. Probably in January.
I will try to arrange a repo for you.
I think I've managed to get a minimal repro for this error here.
To repro:
docker build . -t repro
from the directory with the Dockerfile.docker run -it -p 8080:8080 repro
.http://localhost:8080
.IEnumerable<int>
filled with -1
, and the Count configures how large the enumerable is.We are also hitting this issue with our applications (Blazor WASM, .NET6, protobuf-net gRPC client).
My very first observations:
Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91
with Uncaught RuntimeError: memory access out of bounds
).From my point of view these are probably two separate issues:
As it is very important for me to have both these issues resolved ASAP, I will try to investigate further:
...let me know if you observed anything interesting which might help me with these experiments.
EDIT: I don't think the issue is related to IIS on server-side. It is probably just that it helps you hit the race-condition on client-side. EDIT2: : It seems the app has to be published for the issue to occur. When run from Visual Studio (no matter whether Kestrel/IIS Express, Debug/Release), the GC issue is not hit. I deployed the @Charnock's repro here, the error occurs immediately: https://WasmGrpcErrorReproNet6.azurewebsites.net/. EDIT3: The .NET5 port of the original repro is published here: https://WasmGrpcErrorReproNet5.azurewebsites.net/ (source code: https://github.com/hakenr/BlazorWasmGrpcMemoryRepro/tree/net5). The very first tests do not confirm my previous observation that the memory-leak itself was already there in .NET5. Will investigate further...
Does the problem occur when PublishTrimmed=false?
Regarding the GC, there are differences in when it is able to run between .NET6 and .NET5 if the problem isn't related to trimming we can look there next.
Does the problem occur when PublishTrimmed=false?
Yes, this was the first what I tried as I expected that it has to be the cause (as almost always when something very weird happens only in Blazor WASM published version :-)).
(If it helps, I can publish the testing app with PublishTrimmed=false
.)
Fix this please, come on
@cwevers with the best will in the world, this isn't a simple "just go change some code" thing; it still isn't clear to me.what is happening. If you'd care to propose a fix, I'm all ears.
We have lots of quality improvements in the works specifically for GC safety that will hopefully address crashes of this type and potentially also address some memory leaks. Unfortunately we haven't identified a specific cause for this specific issue report yet.
Sometimes, under the same circumstances (after several gRPC calls), the Mono GC fail is a little bit different than the one described above:
It is the sgen-gc.c:3993 with RuntimeError: memory access out of bounds
,..hope this might help to build the overall picture of the troubles in GC.
My new observations:
cc @mgravell
We face the same issue in our application and we also use Blazor Wasm, protobuf-net.Grpc, Net 6 and DI.
I cannot provide a fix, but at least share my observations and suggestions.
@hakenr A key difference I suspect in your repository with the local instantiation, is that the channel gets disposed after usage.
See
https://github.com/hakenr/BlazorWasmGrpcMemoryRepro/blob/caceac22ce835d046c8788336da1db4b1afdb5bb/WasmGrpcErrorRepro/Client/Pages/Index.razor#L36-L48
**using** var channel = Grpc.Net.Client.GrpcChannel.ForAddress...
While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase
So might I suggest to inherit IDisposable
in Index.razor and dispose DataService
and see whether the issue persists?
Another thing I observed, is that the issue only appears when the app is published. Locally I was never able to reproduce the issue. So it would be interesting to hear, whether you experiences the same @hakenr?
I also suspected Assembly trimming and/or compression. However even disabling both did not resolve the issue.
HTTP2 and HTTP1.1 also do not seem to make a difference. In our app we use HTTP2 and @hakenr example uses HTTP1.1
Still, might be worth a try to add
new HttpClient(handler) {DefaultRequestVersion = Version.Parse("2.0")}
to your second example
Another thing I noticed in @hakenr Transient example is that sending 4 x 125.000 slowly (clicking the button once every 2 seconds or so), does not result in the error. However pressing the button rapidly three times, that is 3 x 125.000, does reproduce the error. So it seems like an internal buffer when using multiple simultaneous requests does overflow or something similar?
Hi @BerndNK, thanks for stepping in with your observations and suggestions.
A key difference I suspect in your repository with the local instantiation, is that the channel gets disposed after usage.
I noticed that too, but it worked with DI in .NET5. Some changes were made to the ID implementation in .NET6, so if something could have gone wrong there...
While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase So might I suggest to inherit IDisposable in Index.razor and dispose DataService and see whether the issue persists?
Good tip, will definitely try.
Another thing I observed, is that the issue only appears when the app is published. Locally I was never able to reproduce the issue. So it would be interesting to hear, whether you experiences the same @hakenr?
Same here.
I also suspected Assembly trimming and/or compression. However even disabling both did not resolve the issue.
Same here.
...I'll be back when I try the DI and Dispose option. (I suspect that GC is not able to handle FinalizationQueue fast enough and it's breaking down somewhere in there.)
I noticed that too, but it worked with DI in .NET5. Some changes were made to the ID implementation in .NET6, so if something could have gone wrong there...
That's a good point. In our application I did try the .NET7 Preview 1 and the behavior was exactly the same. I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.
While the transient channel, to my knowledge, should not be disposed. Since blazor won't do this automatically, unless you use OwningComponentBase So might I suggest to inherit IDisposable in Index.razor and dispose DataService and see whether the issue persists?
Well, there is probably no chance the DI+Dispose might help as there is the same DataService instance being used in Index.razor repeatedly without removing (disposing) the Index.razor component itself.
I would have to try to get some channel factory+disposal logic in the DataService calls itself, which I don't believe will be directly achievable (the interceptor won't probably help here). Right @mgravell?
I would have to change the protobuf-net (or event Grpc.Client) implementation to try disposing the channel (or maybe something else) within every gRPC call.
In our application I did try the .NET7 Preview 1 and the behavior was exactly the same. I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.
.NET5 works like a charm. We have several apps in production without any single issue. Well, I was able to achieve OutOfMemoryException in one case (at 2GBs), but still not sure what happened there (behaved more like regular memory leak, without crashing the .NET runtime itself).
While I still can't promise anything, we've identified a couple bugs that could cause GC problems like what you're experiencing, so I hope to have fixes for testing in an upcoming preview.
@kg great! Is there any chance to test those fixed bits now? Will it be .NET7 or can we hope for .NET6 fix?
These changes will be significant enough that a backport to .NET6 would be difficult. The fixes are not landed yet, so it's not possible to test them.
@kg Please let me know when those changes are available for testing. I still have doubts if we're not fighting two bugs - runtime issues in GC, but also some problems on the gRPC/protobuf-net/aspnetcore-DI side that manifest themselves in this.
@kg Also better knowing what was fixed might help us find a workaround for .NET6 to survive until .NET gets released.
The issues I've identified on our side are mostly in the C and JavaScript layers, so it is unlikely there will be C# side workarounds - but we will definitely keep you posted and share testing builds as soon as we can.
I did however not try a rollback to .NET 5. I'll try that in our app and report back here whether I can confirm that the issue is Net 6 related.
Our app uses a lot of Net 6 features, so I gave up on this.
However, I managed to find a workaround for this issue. At least for our app.
Simply deactivating optimization seemed to have fixed the issue. I did this, since I noticed that the issue only appeared when the app was published. So simply adding
<Optimize>false</Optimize>
to every .csproj file apparently fixed the issue for me.
Perhaps this is also worth a try for you @hakenr? Still, this is just a workaround as the performance took quite a hit through this change.
I also tried creating the GRPC channel on each use and disposing the old instance, since I thought this was a key difference in the two example above. This did however not have any effect, which I thought was interesting to share.
@BerndNK, unfortunately, I cannot confirm the workaround. The basic repro-app when there is <Optimize>false</Optimize>
set, still fails when published:
https://WasmGrpcErrorReproNet6OptimizeFalse.azurewebsites.net/
But the direction you mentioned is something I definitely want to investigate deeper - the difference in between published and regular version of the app (till now, nobody was able to reproduce the issue locally without publishing it).
It has to be something else. Maybe the compression. Will do some more tests...
Some more experiments
<Optimize>false</Optimize>
does not help,Debug
build does not help<BlazorEnableCompression>false</BlazorEnableCompression>
does not help<BlazorCacheBootResources>false</BlazorCacheBootResources>
does not help<RunAOTCompilation>true</RunAOTCompilation>
does not help
...still trying to find out the difference between "build" and "publish" version.@kg, any idea what can make the difference in between plain build and published version? What else can we try to make our apps work in production the way they work in development?
This sort of problem will manifest at random, and unless you identify the specific source of it you can't really work around it, unfortunately.
OK, just gathering inputs what else can I try to identify the crucial difference in between "build" and "published" version. Runtime relinking is my next target of investigation. :-D
Description
After leaving a blazor wasm app open for a length of time (and through sleep/wake cycles) the app will eventually crash (normally 8-12 hours, but can be quicker).
The console looks like:
With Assertion: should not be reached at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:91
558006 @ dotnet.6.0.0.csqdfrwtlv.js:1
Reproduction Steps
Create WASM app that has timed interaction with a server, leave for a while.
Expected behavior
The app should continue to work
Actual behavior
The app crashes. There is nothing the user can do except for reloading the page
Regression?
Unsure
Known Workarounds
No response
Configuration
Running on .Net 6 Windows, chrome browser Version 95.0.4638.69 (Official Build) (64-bit) x64 unsure tested in chrome
Other information
No response