Closed arekpalinski closed 2 years ago
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
I managed to repro AVE (the same stacktrace) by running a given test in the loop but this time there is no heap corruption:
0:012> !verifyheap
No heap corruption detected.
Also my question about potential problems from point 5. about using onCancel
directly instead of using the callback argument is not relevant - it reproduced without it.
Here is the memory dump: https://drive.google.com/file/d/1EosHaFy4KDfggT2q3adXYF3jhLDNq2uV/view?usp=sharing
It was created as follow:
procdump.exe -ma -e 1 -f C0000005.ACCESS_VIOLATION -g -x C:\workspace\memory-dumps "C:\workspace\ravendb-5.2\test\Tryouts\bin\Release\net6.0\Tryouts.exe"
I will continue the investigation on my side.
Thanks for reporting the issue. We will look at the dump, is the standalone repro easily shareable?
Thanks for looking into this. In order to reproduce (it's not consistent) please do:
Tryouts
project in ReleaseThe repro is basically running the same test multiple times. See my last commit there - https://github.com/arekpalinski/ravendb/commit/7b9fd16c930699593c917360acd30a3da3f3c3d5. On my end it reproduced usually after 2500 - 3000 iterations (machine with 32GB of memory) but I have just run it on different machine (16GB of memory) and it reproduced after 600 iterations.
Hi @mangod9. Did you have a chance to check if my repro steps results in AVE on your end?
No havent had a chance to investigate further. Will try to look into it next week.
Thank you. We got another failure, this time it happened on x86 build running the tests. The stacktrace is a bit different but still GC is involved:
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 7251b419 (coreclr!WKS::my_get_size+0x0000000b)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000001
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 00000000
Attempt to read from address 00000000
coreclr!WKS::gc_heap::mark_object_simple+0x69
coreclr!WKS::gc_heap::mark_through_cards_for_segments+0x413
coreclr!WKS::gc_heap::mark_phase+0x1bf
coreclr!WKS::gc_heap::gc1+0x6b
coreclr!WKS::gc_heap::garbage_collect+0x175
coreclr!WKS::GCHeap::GarbageCollectGeneration+0xea
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x19
coreclr!WKS::gc_heap::try_allocate_more_space+0x19c
coreclr!WKS::gc_heap::allocate_more_space+0x18
coreclr!WKS::GCHeap::Alloc+0x4f
coreclr!AllocateObject+0x117
coreclr!JIT_New+0xbe
I can share the dump if needed but I think the repro steps that I shared are more valuable.
I think we started to see that since we moved to .NET 6. But as I said we do unmanaged memory allocations on our own so this might be us causing this as as well (we investigated it on our side and don't see how we could cause it but it's still an option).
Although I think it might be important that my repro resulted in AVE in some GC stuff but the heap was not corrupted (yet?).
Hi. any update on this?
I just cloned the repro and seem to be able to repro it after some time. When I look at the dump you shared and the local repro, the corrupted object pointer seems to be located between these two objects. Could you please confirm that for all repros that is the case, and if so is there anything special about these?
0:320> !lno 0x000001a5`d7f9c5e0
Before: 000001a5d7f9c350 664 (0x298) Raven.Server.Documents.Queries.AST.MatchPath[]
After: 000001a5d7f9c5e8 32 (0x20) System.Collections.Generic.List`1[[Sparrow.StringSegment, Sparrow]]
Heap local consistency confirmed.
From looking closer, this doesnt look like a heap corruption but rather an issue with reporting gcinfo on the stack, possibly some race condition leading to an off-by-8 condition.
Exactly, the repro I provided doesn't cause a heap corruption. That's why I commented:
Although I think it might be important that my repro resulted in AVE in some GC stuff but the heap was not corrupted (yet?).
We use unamanged code so potentially heap corruption could be the root cause. But given that it reproduces without corrupting the heap I think the issue isn't caused by our unmanaged code.
Hi @mangod9. Do you have any leads on that? Is is a race condition as you suspected?
Yeah appears to be an issue with gcinfo. Still trying to track down the issue, will try by enabling stress logs and see if we can figure it out. Btw, do you happen to know if this started happening after moving to .NET 6?
I think we didn't see it before (on .NET 5) but I'm not absolutely sure that's the case. It happens occasionally. In contrast to the repro code which fails pretty consistently.
Actually I noticed that it didn't happen to us for some time. Meanwhile we upgraded dotnet version so I though it could be fixed. Although I have run the repro code again (using dotnet 6.0.3) and it's still happening.
We run a massive app on asp.net core and we started seeing similar random errors like this too after upgrading from .NET 5 to .NET 6:
Application: w3wp.exe
CoreCLR Version: 6.0.322.12309
.NET Version: 6.0.3
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FF932A2A5A9 (00007FF932940000) with exit code 80131506.
I'm currently reverting some nuget packages we upgraded to narrow down the possible causes.
UPD: turns out in our case it was ImageSharp 2.1, reverting to 1.0.4 resolved the issue. I'm going to leave this comment here in case anyone bumps into the same issue
got back to investigating this again. On looking through the repro it appears to be a codegen optimization issue with how gcinfos are reported. If I comment out this line: https://github.com/arekpalinski/ravendb/blob/7b9fd16c930699593c917360acd30a3da3f3c3d5/src/Raven.Server/Documents/Queries/Parser/QueryParser.cs#L391
and inline define prev
here:
https://github.com/arekpalinski/ravendb/blob/7b9fd16c930699593c917360acd30a3da3f3c3d5/src/Raven.Server/Documents/Queries/Parser/QueryParser.cs#L465 with
if (_synteticWithQueries?.TryGetValue(alias, out SynteticWithQuery prev) == true && !prev.IsEdge)
the failure didnt repro for me (at-least with 10k iterations). @arekpalinski, could you please try that and check if it still reproduces for you? Thx
I can confirm that it didn't repro on my end as well after making those changes. I have also opened the PR to our repo with that workaround (https://github.com/ravendb/ravendb/pull/13968).
That indeed looks as it resolves the problem. Thank you for narrowing it down.
Ok good to know. We will continue to investigate whats causing the bug, though I havent been able to narrow down the repro.
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | arekpalinski |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | 6.0.x |
Moving to the codegen area, since this looks to be an issue with missing initialization of local structs in certain cases.
I have a prospective fix and a simpler repro case.
6.06 is out now so if you get a chance, please verify it properly fixes this issue.
@arekpalinski I'm going to assume this is fixed; please re-open if you discover that's not the case.
We're approaching the release of a new version soon. So for now we'll stay with the workaround just in case. Afterwards I will revert it and will get back to you if I see any issues.
When running our test suite we got a crash:
The process was terminated due to an internal error in the .NET Runtime at IP 00007FFDC50AB5FC (00007FFDC5010000) with exit code 80131506.
:We have configured automatic memory dumps creation which resulted in creating the following memory dump:
https://drive.google.com/file/d/19S1k74Foe9V6A03hRwIuebE42GVQUirI/view?usp=sharing
In our project (github.com/ravendb/ravendb) we use unmanaged memory directly, so it might be that it's because of our code.
The following analysis was made so far in WinDBG.
!analyze -v
I the crashing stacktrace is:I see
onCancel
member so it's likely the following fromTimeoutManager.cs
:https://github.com/ravendb/ravendb/blob/193624d559fe2e6525cc383de362c83d19aacffd/src/Sparrow/Utils/TimeoutManager.cs#L139
It is
System.Action1[[System.Object, System.Private.CoreLib]]
so my suspicion is that it's this actiontcs => onCancel.TrySetCanceled()
.The attempt to get its
_target
results in:The address matches the output of
verifyheap
-bad member 0000022D04C05821 at 0000022DA4010000
so we know that the corrupted member is_target
.onCancel
variable intcs => onCancel.TrySetCanceled()
instead of using the callback action:tcs => ((TaskCompletionSource<object>)tcs).TrySetCanceled()
but effectively it's the same thing. Could it cause any GC problems and result in something like that?