dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

Many Valgrind Issues on linux #52872

Open trampster opened 3 years ago

trampster commented 3 years ago

Description

A completely empty .net console project ( public static void Main() {} )produces many Valgrind issues on linux including:

I found this because I was trying to use valgrind to debug an issue with some native (pinvoke) interop. But there where so many .net issues that it was impossible to find the ones from my interop (valgrind stops reporting after it reaches an issue limit)

valgrind.log

==122230== Memcheck, a memory error detector ==122230== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==122230== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==122230== Command: ./IntegrationTests ==122230==

NOTE I have skiped most of the file as it exceeds the 65 k limit on github full file is attached.

==122230== ==122230== Conditional jump or move depends on uninitialised value(s) ==122230== at 0x5CAD289: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C58B652: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5C38BAE9: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5DF2AB6: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C418AA: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5B08830: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C8695A: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5F7165D: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x4868608: start_thread (pthread_create.c:477) ==122230== by 0x4CF7292: clone (clone.S:95) ==122230== ==122230== Use of uninitialised value of size 8 ==122230== at 0x5CAD2A5: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C58B652: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5C38BAE9: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5DF2AB6: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C418AA: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5B08830: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C8695A: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5F7165D: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x4868608: start_thread (pthread_create.c:477) ==122230== by 0x4CF7292: clone (clone.S:95) ==122230== ==122230== Use of uninitialised value of size 8 ==122230== at 0x5CAD366: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C58B652: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5C38BAE9: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/System.Private.CoreLib.dll) ==122230== by 0x5DF2AB6: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C418AA: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5B08830: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5C8695A: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x5F7165D: ??? (in /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/libcoreclr.so) ==122230== by 0x4868608: start_thread (pthread_create.c:477) ==122230== by 0x4CF7292: clone (clone.S:95) ==122230== ==122230== ==122230== HEAP SUMMARY: ==122230== in use at exit: 2,161,722 bytes in 2,482 blocks ==122230== total heap usage: 24,315 allocs, 21,833 frees, 8,585,401 bytes allocated ==122230== ==122230== LEAK SUMMARY: ==122230== definitely lost: 60 bytes in 1 blocks ==122230== indirectly lost: 0 bytes in 0 blocks ==122230== possibly lost: 6,561 bytes in 20 blocks ==122230== still reachable: 2,155,101 bytes in 2,461 blocks ==122230== suppressed: 0 bytes in 0 blocks ==122230== Rerun with --leak-check=full to see details of leaked memory ==122230== ==122230== Use --track-origins=yes to see where uninitialised values come from ==122230== For lists of detected and suppressed errors, rerun with: -s ==122230== ERROR SUMMARY: 13735 errors from 908 contexts (suppressed: 0 from 0)

Configuration

dotnet --version gives 5.0.203 OS: ubuntu 20.04 Installed using the .deb

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jkotas commented 3 years ago

@tmds Was this taken care of by the fixes you made earlier this year?

danmoseley commented 3 years ago

@trampster it would be interesting to know the results with the latest 6.0 preview.

trampster commented 3 years ago

There appears to be no .deb (or any other installer) available for linux in the preview https://dotnet.microsoft.com/download/dotnet/6.0

tmds commented 3 years ago

@trampster you can download a binary from this website: https://dotnet.microsoft.com/download/dotnet/6.0.

.NET leaks some 'global' allocated memory, so it is expected to see some definitely lost. The uninitialized errors should be fixed.

danmoseley commented 3 years ago

I haven't used Valgrind: if we intentionally (?) leak memory on exit, is there an established way to "baseline" such intentional leaks, so we can easily spot new issues if they emerge in future?

trampster commented 3 years ago

FYI 'definitely lost' means that that memory was unreachable at program exit, that is that there was no pointer that could reach that memory. (I'm yet to see a good reason to do this on purpose)

This is different from 'still reachable' which means there was still a pointer to the memory at program exit, leaving this for the OS to clean up is acceptable.

trampster commented 3 years ago

Releasing a product as important as .net with Use of uninitialised values and Conditional jumps on uninitialised values is a big problem.

It is errors like these that people exploit to compromise software.

I would recommend running valgrind in your CI and gating your release on it. It is not acceptable to release with these errors.

trampster commented 3 years ago

I ran valgrind on .net 6,

There are still many 'Use of uninitialised values' and 'Conditional jumps on uninitialised values'

And there are more 'definitely lost bytes'

(Sample from end of log, complete log is attached)

==162133== Use of uninitialised value of size 8 ==162133== at 0x5BA3E000: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x5B8912C3: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x55F78A6: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5456DCA: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5310590: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x548CB9A: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x578AFCD: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x4868608: start_thread (pthread_create.c:477) ==162133== by 0x4CF7292: clone (clone.S:95) ==162133== ==162133== Use of uninitialised value of size 8 ==162133== at 0x5BA3E006: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x5B8912C3: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x55F78A6: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5456DCA: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5310590: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x548CB9A: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x578AFCD: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x4868608: start_thread (pthread_create.c:477) ==162133== by 0x4CF7292: clone (clone.S:95) ==162133== ==162133== Conditional jump or move depends on uninitialised value(s) ==162133== at 0x54B1D89: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5BA3E022: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x5B8912C3: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x55F78A6: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5456DCA: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5310590: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x548CB9A: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x578AFCD: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x4868608: start_thread (pthread_create.c:477) ==162133== by 0x4CF7292: clone (clone.S:95) ==162133== ==162133== Use of uninitialised value of size 8 ==162133== at 0x54B1DA5: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5BA3E022: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x5B8912C3: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x55F78A6: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5456DCA: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5310590: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x548CB9A: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x578AFCD: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x4868608: start_thread (pthread_create.c:477) ==162133== by 0x4CF7292: clone (clone.S:95) ==162133== ==162133== Use of uninitialised value of size 8 ==162133== at 0x54B1E66: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5BA3E022: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x5B8912C3: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/System.Private.CoreLib.dll) ==162133== by 0x55F78A6: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5456DCA: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x5310590: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x548CB9A: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x578AFCD: ??? (in /home/daniel/dotnet/shared/Microsoft.NETCore.App/6.0.0-preview.3.21201.4/libcoreclr.so) ==162133== by 0x4868608: start_thread (pthread_create.c:477) ==162133== by 0x4CF7292: clone (clone.S:95) ==162133== ==162133== ==162133== HEAP SUMMARY: ==162133== in use at exit: 2,135,150 bytes in 2,349 blocks ==162133== total heap usage: 16,517 allocs, 14,168 frees, 6,857,844 bytes allocated ==162133== ==162133== LEAK SUMMARY: ==162133== definitely lost: 140 bytes in 4 blocks ==162133== indirectly lost: 0 bytes in 0 blocks ==162133== possibly lost: 4,719 bytes in 14 blocks ==162133== still reachable: 2,130,291 bytes in 2,331 blocks ==162133== suppressed: 0 bytes in 0 blocks ==162133== Rerun with --leak-check=full to see details of leaked memory ==162133== ==162133== Use --track-origins=yes to see where uninitialised values come from ==162133== For lists of detected and suppressed errors, rerun with: -s ==162133== ERROR SUMMARY: 12823 errors from 743 contexts (suppressed: 0 from 0)

dotnet6valgrind.log

am11 commented 3 years ago

"baseline" such intentional leaks

e.g. since 5.0 preview 3, there are 542 more cond jump cases found in .NET 6 preview 5, when merely running dotnet new console (net6: 718, net5: 176)

i have also noticed platforms taking valgrid reported memleaks seriously and keeping the number to 0 (e.g. llvm/clang does that), no matter how trivial.

danmoseley commented 3 years ago

(BTW, I did not mean to imply it is doing this intentionally, I'm an observer in this issue.)

cc @GrabYourPitchforks

tmds commented 3 years ago

A completely empty .net console project ( public static void Main() {} )

I reported and fixed a few issues here: https://github.com/dotnet/runtime/issues/46905. I just verified these haven't regressed.

I haven't used Valgrind

valgrind didn't work well on .NET Core before: it didn't properly handle some code generated by the JIT and crashed. I reported a bug for it: https://bugs.kde.org/show_bug.cgi?id=422174 which got fixed less than a year ago.

is there an established way to "baseline" such intentional leaks, so we can easily spot new issues if they emerge in future?

Yes, valgrind can do this using 'suppression' files.

i have also noticed platforms taking valgrid reported memleaks seriously and keeping the number to 0 (e.g. llvm/clang does that), no matter how trivial.

I have worked on projects like this too. It is nice when you are in a state where you are at zero and remain at zero.

When there are many existing issues, it's different. Fixing them is important, but when they don't cause issues it doesn't become urgent. It's a matter of making time and improving things. I intend to make some time for it again.

It's a similar story for Coverity issues.

trampster commented 3 years ago

Given that this issue isn't getting resolved anytime soon, how would you recommend I debug the segfault I'm getting after I pinvoke to some native code.

The segfault itself is in .net code (occurs while throwing/handling a .net excetion) but I assume it's caused by corrupt memory caused by the native code I pinvoked to or by the pinvoke itself.

Valgrind doesn't help because there is a reporting limit and .net Valgrind issues fill it up before it can get to my native code.

am11 commented 3 years ago

To chase down segmentation faults, I'd recommend installing SOS plugin and running your application under lldb debugger:

# step0: install lldb from package manager

# step1: install sos plugin
$ dotnet tool install --global dotnet-sos
$ dotnet-sos install
# also possible to install non- --globally, and also possible to fetch the
# standalone sos binary without the need for dotnet SDK..  see
# github/dotnet/diagnostics repo for details, docs, issues surrounding sos

# step2: the actual debugging
$ lldb /path/to/dirX/yourapplication arg1 arg2 ..
# in lldb REPL
$ setsymbolserver -directory /path/to/dirX
$ process handle SIGSEGV --notify true --pass true --stop true
$ run
# when sigsegv occurs, these are your friends for analysis:
$ bt # lldb command for backtrace
$ bt --all # lldb command for bt all threads
$ clrstack -a # sos command that reads managed frames and lists managed methods, local variables etc.

to get all symbols properly resolved, yourapplication.pdb (for managed debug info), pdb of managed dependency assemblies and libYouArePInvoking.so.dbg (native debug symbol file) would need to be present next to their corresponding binaries.

trampster commented 3 years ago

What do I put for the setsymbolserver directory?

This is what I get for the backtrace:

  • frame #0: 0x00007ffff7a9218b libc.so.6raise + 203 frame #1: 0x00007ffff7a71859 libc.so.6abort + 299 frame #2: 0x00007ffff737216e libcoreclr.so___lldb_unnamed_symbol15183$$libcoreclr.so + 30 frame #3: 0x00007ffff73720bc libcoreclr.so_lldb_unnamedsymbol15179$$libcoreclr.so + 220 frame #4: 0x00007ffff715429d libcoreclr.so`lldb_unnamed_symbol7947$$libcoreclr.so + 957 frame #5: 0x00007ffff7154326 libcoreclr.so___lldb_unnamed_symbol7948$$libcoreclr.so + 134 frame #6: 0x00007ffff70b2b4b libcoreclr.so_lldb_unnamedsymbol5960$$libcoreclr.so + 539 frame #7: 0x00007fff7d8bb7dd frame #8: 0x00007fff7e8e9561 frame #9: 0x00007fff7dcf5ffb frame #10: 0x00007ffff71f6ab7 libcoreclr.so`lldb_unnamed_symbol9785$$libcoreclr.so + 124 frame #11: 0x00007ffff70458ab libcoreclr.so___lldb_unnamed_symbol4377$$libcoreclr.so + 1643 frame #12: 0x00007ffff6f18fca libcoreclr.so_lldb_unnamedsymbol302$$libcoreclr.so + 890 frame #13: 0x00007ffff6f19319 libcoreclr.so`lldb_unnamed_symbol303$$libcoreclr.so + 393 frame #14: 0x00007ffff6f57733 libcoreclr.so___lldb_unnamed_symbol1149$$libcoreclr.so + 627 frame #15: 0x00007ffff6f0228d libcoreclr.socoreclr_execute_assembly + 413 frame #16: 0x00007ffff75a0b4a libhostpolicy.so___lldb_unnamed_symbol134$$libhostpolicy.so + 826 frame #17: 0x00007ffff75a0fb1 libhostpolicy.so_lldb_unnamed_symbol135$$libhostpolicy.so + 49 frame #18: 0x00007ffff75a19ed libhostpolicy.socorehost_main + 173 frame #19: 0x00007ffff7800fd2 libhostfxr.so___lldb_unnamedsymbol182$$libhostfxr.so + 1746 frame #20: 0x00007ffff77ff72b libhostfxr.so`lldb_unnamed_symbol180$$libhostfxr.so + 667 frame #21: 0x00007ffff77fbfe4 libhostfxr.sohostfxr_main_startupinfo + 148 frame #22: 0x0000555555564fc5 IntegrationTests_lldb_unnamedsymbol136$$IntegrationTests + 1045 frame #23: 0x00005555555654f0 IntegrationTests`lldb_unnamed_symbol137$$IntegrationTests + 144 frame #24: 0x00007ffff7a730b3 libc.so.6__libc_start_main + 243 frame #25: 0x0000555555558eaa IntegrationTests___lldb_unnamed_symbol11$$IntegrationTests + 41

The CLR stack gives:

OS Thread Id: 0x4f5d9 (1) Child SP IP Call Site 00007FFFFFFFCC10 00007ffff7a9218b [HelperMethodFrame: 00007fffffffcc10] 00007FFFFFFFCD80 00007FFF7D8BB7DD System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/ExceptionServices/ExceptionDispatchInfo.cs @ 56] PARAMETERS: this =

00007FFFFFFFCD90 00007FFF7D884835 System.Threading.Tasks.Task+<>c.b__1400(System.Object) [//src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 1883] PARAMETERS: this = state =

00007FFFFFFFCDA0 00007FFF7DD012D9 Tait.Pttox.Tests.SingleThreadSynchronizationContext.Run(System.Func`1) [/home/daniel/Work/hydra-src/apps/Xamarin-PTToX/tests/IntegrationTests/SingleThreadedSynchronizationContext.cs @ 35] PARAMETERS: this (0x00007FFFFFFFCE08) = 0x00007fff480212c0 action (0x00007FFFFFFFCE00) = 0x00007fff48074680 LOCALS: 0x00007FFFFFFFCDF8 = 0x00007fff480746c0 0x00007FFFFFFFCDE8 = 0x00007fff4807c3f0 0x00007FFFFFFFCDE4 = 0x0000000000000000 0x00007FFFFFFFCDE0 = 0x0000000000000000 0x00007FFFFFFFCDDC = 0x0000000000000001

00007FFFFFFFCE20 00007FFF7E8E9561 Tait.Pttox.Tests.PttoxSteps.IChangeGroupsTo(System.String) [/home/daniel/Work/hydra-src/apps/Xamarin-PTToX/tests/IntegrationTests/Steps/PttoxSteps.cs @ 126] PARAMETERS: this (0x00007FFFFFFFCE48) = 0x00007fff48021288 groupName (0x00007FFFFFFFCE40) = 0x00007fff4800b170 LOCALS: 0x00007FFFFFFFCE38 = 0x00007fff48074660

00007FFFFFFFCE60 00007FFF7DCF5FFB Tait.Pttox.Tests.Program.Main() [/home/daniel/Work/hydra-src/apps/Xamarin-PTToX/tests/IntegrationTests/Program.cs @ 16] LOCALS: 0x00007FFFFFFFCE78 = 0x00007fff48021288

I managed to filter out all the .net valgrind issues using a suppression file, and was able to confirm that our native code isn't producing any valgrind warnings. So the problem is either in .net or in our interop code.

segfault happens in .net code while raising/handling an exception, but only if I have done the pinvoke first.

am11 commented 3 years ago

I have never seen that one. If you installed dotnet-sdk from package manager, then simply unset DOTNET_ROOT environment variable and it should work fine.

What do I put for the setsymbolserver directory?

Absolute path to the directory that contains your application's PDB file(s). But you can skip setsymbolserver if yours is not a singlefile application.

trampster commented 3 years ago

@am11 That fixed the warning.

Any idea why my backtrace has unnamed symbol instead of anything useful?

am11 commented 3 years ago

___lldb_unnamed_symbol

Looks like SOS didn't downloaded the symbols when you first ran lldb. It downloads the symbols when we first launch the app. Probably timed out. You can manually do that using another tool:

$ dotnet tool install --global dotnet-symbol
$ dotnet-symbol $(command -v dotnet)
# assuming you are on ubuntu and you have installed dotnet from package manager
$ dotnet-symbol /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/*.so

it will download symbols for dotnet(1) and all the shared object files for 5.0.6. You can adjust the paths based on your installation accordingly. End result would be you will have libcoreclr.so.dbg (native symbols file) placed next to libcoreclr.so in your installation directory. Same goes for rest of the .so files.

trampster commented 3 years ago

dotnet install --global dotnet-symbol

Could not execute because the specified command or file was not found. Possible reasons for this include:

am11 commented 3 years ago

Typo

trampster commented 3 years ago

ERROR: Access to the path '/usr/bin/dotnet.dbg' is denied. -> Permission denied

Do I need to be root, (doesn't seem like a good idea)

am11 commented 3 years ago

You can skip dotnet(1) if you don't care about corehost (driver that initializes coreclr). However, for /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.6/, you may want to chown (change ownership) of that directory to your user if you aren't comfortable with sudo. It downloads ._debug files from dotnet symstore blobs in a store temp location, then renames them to .dbg to the location next to the binaries. If it times out, you can pass --timeout 20 (unit is minutes) to dotnet-symbol command.

trampster commented 3 years ago

I have the native backtrace as expected it is handling the exception when it dies:

(lldb) bt

> * thread #1, name = 'IntegrationTest', stop reason = signal SIGABRT
>   * frame #0: 0x00007ffff7a9218b libc.so.6`raise + 203
>     frame #1: 0x00007ffff7a71859 libc.so.6`abort + 299
>     frame #2: 0x00007ffff737216e libcoreclr.so`::PROCAbort() at process.cpp:3473:5
>     frame #3: 0x00007ffff73720bc libcoreclr.so`PROCEndProcess(hProcess=<unavailable>, uExitCode=<unavailable>, bTerminateUnconditionally=<unavailable>) at process.cpp:1473:13
>     frame #4: 0x00007ffff715429d libcoreclr.so`UnwindManagedExceptionPass1(ex=<unavailable>, frameContext=<unavailable>) at exceptionhandling.cpp:0
>     frame #5: 0x00007ffff7154326 libcoreclr.so`DispatchManagedException(ex=0x00007fffffffcd40, isHardwareException=<unavailable>) at exceptionhandling.cpp:4686:17
>     frame #6: 0x00007ffff70b2b4b libcoreclr.so`IL_Throw(obj=<unavailable>) at jithelpers.cpp:4195:5
>     frame #7: 0x00007fff7d8fb7dd
>     frame #8: 0x00007fff7f0dbc31
>     frame #9: 0x00007fff7dd35ffb
>     frame #10: 0x00007ffff71f6ab7 libcoreclr.so`CallDescrWorkerInternal at unixasmmacrosamd64.inc:838
>     frame #11: 0x00007ffff70458ab libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) at callhelpers.cpp:68:5
>     frame #12: 0x00007ffff7045850 libcoreclr.so`MethodDescCallSite::CallTargetWorker(this=<unavailable>, pArguments=0x00007fffffffd130, pReturnValue=0x0000000000000000, cbReturnValue=0) at callhelpers.cpp:544
>     frame #13: 0x00007ffff6f18fca libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) [inlined] MethodDescCallSite::Call(this=0x00007fffffffd198, pArguments=0x00007fffffffd130) at callhelpers.h:458:9
>     frame #14: 0x00007ffff6f18fc1 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) at assembly.cpp:1464
>     frame #15: 0x00007ffff6f18e4a libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) [inlined] RunMain(this=<unavailable>, pParam=<unavailable>)::$_0::operator()(Param*) const::'lambda'(Param*)::operator()(Param*) const at assembly.cpp:1536
>     frame #16: 0x00007ffff6f18e4a libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) at assembly.cpp:1538
>     frame #17: 0x00007ffff6f18e37 libcoreclr.so`RunMain(pFD=<unavailable>, numSkipArgs=1, piRetVal=<unavailable>, stringArgs=<unavailable>) at assembly.cpp:1538
>     frame #18: 0x00007ffff6f19319 libcoreclr.so`Assembly::ExecuteMainMethod(this=0x00005555557bfaa0, stringArgs=0x00007fffffffd580, waitForOtherThreads=YES) at assembly.cpp:1648:18
>     frame #19: 0x00007ffff6f57733 libcoreclr.so`CorHost2::ExecuteAssembly(this=<unavailable>, dwAppDomainId=<unavailable>, pwzAssemblyPath=u"/home/daniel/Work/hydra-src/apps/Xamarin-PTToX/tests/IntegrationTests/bin/Debug/net5.0/IntegrationTests.dll", argc=<unavailable>, argv=0x0000000000000000, pReturnValue=0x00007fffffffd6e0) at corhost.cpp:384:39
>     frame #20: 0x00007ffff6f0228d libcoreclr.so`::coreclr_execute_assembly(hostHandle=0x00005555557c96f0, domainId=1, argc=<unavailable>, argv=0x0000000000000000, managedAssemblyPath=<unavailable>, exitCode=0x00007fffffffd6e0) at unixinterface.cpp:431:24
>     frame #21: 0x00007ffff75a0b4a libhostpolicy.so`run_app_for_context(context=<unavailable>, argc=<unavailable>, argv=0x0000000000000000) at hostpolicy.cpp:240:32
>     frame #22: 0x00007ffff75a0fb1 libhostpolicy.so`run_app(argc=0, argv=0x00007fffffffdce0) at hostpolicy.cpp:275:12
>     frame #23: 0x00007ffff75a19ed libhostpolicy.so`::corehost_main(argc=<unavailable>, argv=0x00007fffffffdcd8) at hostpolicy.cpp:408:12
>     frame #24: 0x00007ffff7800fd2 libhostfxr.so`___lldb_unnamed_symbol182$$libhostfxr.so + 1746
>     frame #25: 0x00007ffff77ff72b libhostfxr.so`___lldb_unnamed_symbol180$$libhostfxr.so + 667
>     frame #26: 0x00007ffff77fbfe4 libhostfxr.so`hostfxr_main_startupinfo + 148
>     frame #27: 0x0000555555564fc5 IntegrationTests`___lldb_unnamed_symbol136$$IntegrationTests + 1045
>     frame #28: 0x00005555555654f0 IntegrationTests`___lldb_unnamed_symbol137$$IntegrationTests + 144
>     frame #29: 0x00007ffff7a730b3 libc.so.6`__libc_start_main + 243
>     frame #30: 0x0000555555558eaa IntegrationTests`___lldb_unnamed_symbol11$$IntegrationTests + 41
danmoseley commented 3 years ago

@trampster I edited your post to add triple back ticks (`) above and below to format it.

perlun commented 3 weeks ago

For reference, this is still an issue with .NET 8 (seen in my GitLab pipeline here: https://gitlab.perlang.org/perlang/perlang/-/jobs/264). Haven't tested with .NET 9 yet but unless some effort has been put into resolving this, I would doubt that the problems have magically gone away.

Yes, valgrind can do this using 'suppression' files.

I think this :point_up: would be the way to go. In fact, providing a set of suppression files would be very useful since (I guess) those could also be used by us who are more "consumers" of the platform than actually developing it. It means we could still use valgrind for our own code (particularly related to P/Invoke and such where it becomes relevant), suppressing the "false positives" that have nothing to do with the particular problems we are debugging.

Having that said, the current state is not "incredibly bad". Valgrind can still be used with .NET (as long as you add the COMPlus_GCHeapHardLimit=C800000 environment variable), but you can perhaps not get the full value out of it because you can't really enable all the leak checking.

(Edit: If you just want to get rid of the "still reachable" warnings, I think that --show-reachable=no could be of help. Not an advanced Valgrind user myself, it's just helpful when working (wrestling) with unmanaged C++ code. :see_no_evil:)

perlun commented 3 weeks ago

(btw, please someone fix the typo in the issue title. :slightly_smiling_face:)

am11 commented 3 weeks ago

@perlun for one, Mismatched free() count is 0 with dotnet9/10 builds. :)

On linux-musl-arm64, here is the summary: 8.0.403: ==2155== ERROR SUMMARY: 26058 errors from 756 contexts (suppressed: 0 from 0) 10.0.100-alpha.1.24508.1: ==1874== ERROR SUMMARY: 15087 errors from 968 contexts (suppressed: 0 from 0)

still things to improve, but overall, things are improving on this front.

COMPlus_GCHeapHardLimit=C800000

nit: slightly better DOTNET_GCHeapHardLimit=C800000 since COMPlus_ prefix is legacy.

perlun commented 3 weeks ago

still things to improve, but overall, things are improving on this front.

Nice to hear that @am11! šŸ„‡ šŸ‘ Much appreciated. Do we even run Valgrind in CI perhaps?

(and thanks for the DOTNET_GCHeapHardLimit suggestion too. šŸ™‡ The COMPlus approach was something I learned from https://github.com/dotnet/runtime/issues/76986#issuecomment-1420014949 I think)

am11 commented 3 weeks ago

Do we even run Valgrind in CI perhaps?

We don't currently run Valgrind in CI because the system is at its limit. However, @janvorli and others occasionally review Valgrind reports manually.

To resolve symbols for Valgrind, you can use the dotnet-symbol tool:

$ dotnet tool install --global dotnet-symbol
$ dotnet-symbol --timeout 20 docs/examples/quickstart/hello_world.per
$ dotnet-symbol --timeout 20 src/Perlang.ConsoleApp/bin/Debug/net8.0/perlang

# Now, run Valgrind with the following command:
$ DOTNET_GCHeapHardLimit=C800000 valgrind --undef-value-errors=no \
    --error-exitcode=1 --leak-check=full --show-leak-kinds=all \
    src/Perlang.ConsoleApp/bin/Debug/net8.0/perlang docs/examples/quickstart/hello_world.per

This will report issues like:

==16620== Use of uninitialized value of size 8
==16620==    at 0x5E1380: RawSetMethodTable (src/coreclr/vm/object.h:148)
==16620==    by 0x5E1380: SetMethodTable (src/coreclr/vm/object.h:154)
==16620==    by 0x5E1380: JIT_NewS_MP_FastPortable(CORINFO_CLASS_STRUCT_*) (src/coreclr/vm/jithelpers.cpp:1237)

and

==16620== Conditional jump or move depends on uninitialized value(s)
==16620==    at 0x5E1EB4: JIT_NewArr1VC_MP_FastPortable(CORINFO_CLASS_STRUCT_*, long) (src/coreclr/vm/jithelpers.cpp:1467)

It seems like these type of warnings are related to JIT helper calls in the execution engine (EE), and something in the EE-to-JIT transition isn't fully complying to Valgrind's expectations.

perlun commented 2 weeks ago

Thanks again @am11, for taking your time to go into detail about my use case like this! šŸ™‡ Beyond expectations. šŸŒŸ (Not sure the dotnet-symbol --timeout 20 docs/examples/quickstart/hello_world.per will work though, because this is the source code for a Perlang program and not a "normal" .NET binary? But your comment is useful anyway and might help others. šŸ‘)

perlun commented 2 days ago

still things to improve, but overall, things are improving on this front.

For reference, here's the Valgrind suppression file I added to my project for now: https://gitlab.perlang.org/perlang/perlang/-/blob/21ab6e771e171e0410f0b61c5d6377e09a583e3e/scripts/valgrind-suppressions.txt. This makes it possible for me to Valgrind-check my C++-based shared library, being called from a C# executable, without false positives unrelated to my own code base. šŸŽ‰

In case anyone else is working on a similar project, feel free to use it as you wish. Use --suppressions=/path/to/file.supp in the Valgrind command line to enable an additional suppression file (Valgrind provides some default suppressions, located in /usr/libexec/valgrind/default.supp on my system). You can also use --gen-suppressions=yes to generate suppressions automatically for a program being executed (and then just copy those suppressions into your suppression file).