dotnet / diagnostics

This repository contains the source code for various .NET Core runtime diagnostic tools and documents.
MIT License
1.18k stars 354 forks source link

Need help with SOS commands testing on RISC-V. #4716

Closed viewizard closed 5 months ago

viewizard commented 5 months ago

Hello,

we are testing SOS commands work on Linux riscv64 now, all tested commands works on riscv64 (all related commits already in diagnoctics and clrmd upstreams), unfortunately, we can't figure out how to test last untested commands. Could you please help with some advices or C# example code for:

1) analyzeoom (Displays the info of the last OOM that occurred on an allocation request to the GC heap.)

Should this be related to managed exception or real (Linux kernel side) out of memory in (for example) user native code that called by user managed code?

2) crashinfo (Displays the crash details that created the dump.)

I was not able get any info (all I see is ERROR: No crash info to display) with unhandled managed exceptions or SIGABRT sent to process directly (dump created by kernel event with createdump added to /proc/sys/kernel/core_pattern).

3) dumpalc (Displays details about a collectible AssemblyLoadContext into which the specified object is loaded.)

Not really sure how this should be tested. I wrote example test with var context = new AssemblyLoadContext("MyContext", isCollectible: true); ... dll load + code execution, but was not able to figure out what object address should be used. Even address to AssemblyLoadContext object itself show me:

Name:        System.Runtime.Loader.DefaultAssemblyLoadContext
The managed instance of this context doesn't exist yet

4) dumpsig (Dumps the signature of a method or field specified by sigaddr moduleaddr.) dumpsigelem (Dumps a single element of a signature object.)

Is the any way I could find signature address to test this commands?

5) notreachableinrange (A helper command for !finalizerqueue)

Could you please point me, what addresses should be used here? All times I try some addresses, I see only something like this:

> notreachableinrange 000ada008aa8 000ada008af0                                                                                                         
Warning: 3f358e6e00 is not a valid object
         Address               MT           Size
    003f358e6e00                                 
Warning: 0 is not a valid object
    000000000000                                 
Warning: 1f is not a valid object
    00000000001f                                 
Warning: 10100000003 is not a valid object
    010100000003                                 
Warning: 0 is not a valid object
    000000000000                                 
Warning: 0 is not a valid object
    000000000000                                 
Warning: 100000001 is not a valid object
    000100000001                                 
Warning: 0 is not a valid object
    000000000000                                 
Warning: 0 is not a valid object
    000000000000                                 

Statistics:
          MT Count TotalSize Class Name
004801100202     1         0 <unknown_type_4801100202>
000000000000     8         0 <unknown_type_0>
Total 9 objects, 0 bytes

CC @wscho77 @HJLeee @gbalykov

mikem8361 commented 5 months ago

For analyzeoom check with @leculver.

For dumpalc, dumpsig, and dumpsigelem, I'm not sure. They are not well tested even on other platforms.

crashinfo is like printexception for Native AOT apps that have crashed with an unhandled exception.

notreachableinrange is a test helper so I wouldn't worry about it.

leculver commented 5 months ago

I don't know that we have strong advice on analyzeoom. At its heart, there are some fields in the middle of GC code which tells you why we hit a particular GC-related OutOfMemoryException.

So for example, if some random component in a library or the BCL throws OutOfMemoryException for some reason, !analyzeoom wasn't meant for that. Instead, it's for when the GC itself attempted to allocate more memory and failed for some reason, that's what it's for.

AFAIK, the only way to "test" this would be to somehow allocate a lot of memory in the process. This could simply be a List<int[]> l = new(); while (true) l.Add(new int[100_000]); and see what shakes out. Or you could directly pinvoke int mmap to allocate memory until it tells you it can't, then try to allocate managed objects for a while.

Note that the actual values that analyzeoom reports may be different from run to run, even on a regular x64 platform, because what the GC was doing while it failed will determine what the command says.

As long as you can get analyzeoom to say something, it's probably working fine.

viewizard commented 5 months ago

@leculver @mikem8361 thank you a lot!