dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.48k stars 4.76k forks source link

Tracking issue for illumos and Solaris x86-64 port work #34944

Open am11 opened 4 years ago

am11 commented 4 years ago

Cut from https://github.com/dotnet/runtime/issues/4173.

Given below is a high-level list of work items for Solaris x86-64 port:

gwr commented 5 months ago

I'd like to compare the behavior of this same code running on Linux (under gdb). Unfortunately, my (native) build on Linux is not completing. See: https://gist.github.com/gwr/3520dfbf14190e9225e8214f434ca38e/raw/LinuxBuild01.txt Can anyone suggest what's going wrong with that build? Thanks!

am11 commented 5 months ago

File not found: '/g/ws/dotnet/runtime/THIRD-PARTY-NOTICES.TXT'. [/g/ws/dotnet/runtime/src/coreclr/.nuget/Microsoft.NETCore.ILAsm/Microsoft.NETCore.ILAsm.pkgproj]

That file definitely exists, right? It's the intermittent issue with nuget https://github.com/NuGet/Home/issues/13572 (too many inodes). Just rebuilt the packs subset ./build.sh packs -c Debug -gcc --keepnativesymbols true a few times until it builds the .tar.gz we are interested in. 😅

gwr commented 5 months ago

Yeah, that doesn't seem to be working for me. It keeps failing the nuget steps. Anything else I can do to try to work-around that? on this Linux VM?

am11 commented 5 months ago

You can directly use corerun (an internal test host) instead of dotnet.

$ cd runtime
$ src/tests/build.sh -generatelayoutonly -p:LibrariesConfiguration=Debug

then:

$ gdb --args artifacts/tests/coreclr/linux.x64.Debug/Tests/Core_Root/corerun \
    ../helloworld/bin/Debug/net9.0/helloworld.dll 
am11 commented 5 months ago

@gwr, sometimes we also have stray dotnet processes, killing them helps. pkill -KILL dotnet (to reclaim the inodes and other resources)

gwr commented 5 months ago

Thanks. the test/coreclr thing did what I needed. With that and comparing behaviors, I believe I have a good fix to get rid of the need for the DOTNET_GCHeapHardLimit override. Pushed to: https://github.com/dotnet/runtime/compare/main...gwr:dotnet-runtime:illumos1

gwr commented 5 months ago

Now that helloworld is working OK, can you please remind me what test and debug steps to take next? eg. on System.Diagnostics.Process? For starters, after I build, I don't see an illumos dll in the artifacts. Help, @am11 ? Are you on matrix.org by any chance? (element IRC)

am11 commented 5 months ago

@gwr https://github.com/dotnet/runtime/issues/34944#issuecomment-2197520665 has a rough sketch.

Unless illumos and solaris differ, we can keep it under sunos rather than separate. Replace src/libraries/System.Diagnostics.Process/src/System.Diagnostics.Process.csproj with https://gist.github.com/am11/4b943df8712c6ce257a22b3aafad29f7. Basically I made a copy of freebsd lines with sunos. Of course you will need to create those files physically as well for the project build to succeed. :)

gwr commented 5 months ago

I could still use some pointers on how to attempt a build of these libs:

src/libraries/System.Diagnostics.Process/src/System.Diagnostics.Process.csproj
src/libraries/System.IO.FileSystem.Watcher/src/System.IO.FileSystem.Watcher.csproj
src/libraries/System.Net.Security/src/System.Net.Security.csproj

(@am11?) Thanks!

am11 commented 5 months ago

@gwr, my previous comment has the starting point. The prereq is to understand what other platform implementations are doing to determine which features are needed and which stack is suitable. You may find feature disparity across platforms in few cases, so this work also requires understanding what is not possibly implemented in terms of public facing APIs and marking those API with attributes like [UnsupportedOSPlatformGuard("illumos"), UnsupportedOSPlatformGuard("solaris")].

gwr commented 5 months ago

I've done some C# and can look at and understand what the other platforms are doing. However, when I try to buidl System.Diagnostics.Process nothing even appears to attempt building anything for illumos. I guess maybe there's some configuration stuff (cmake?) that needs to change? Here's what I see:

gwr@ubuntu18:/g/ws/dotnet/runtime$ ./dotnet.sh build -p:TargetOS=illumos src/libraries/System.Diagnostics.Process/src

  Determining projects to restore...
  All projects are up-to-date for restore.
  ILLink.RoslynAnalyzer -> /g/ws/dotnet/runtime/artifacts/bin/ILLink.RoslynAnalyzer/Debug/netstandard2.0/ILLink.RoslynAnalyzer.dll
  ILLink.CodeFixProvider -> /g/ws/dotnet/runtime/artifacts/bin/ILLink.CodeFixProvider/Debug/netstandard2.0/ILLink.CodeFixProvider.dll
  ILCompiler.DependencyAnalysisFramework -> /g/ws/dotnet/runtime/artifacts/bin/ILCompiler.DependencyAnalysisFramework/Debug/ILCompiler.DependencyAnalysisFramework.dll
  Mono.Linker -> /g/ws/dotnet/runtime/artifacts/bin/Mono.Linker/ref/Debug/net9.0/illink.dll
  Mono.Linker -> /g/ws/dotnet/runtime/artifacts/bin/Mono.Linker/Debug/net9.0/illink.dll
  ILLink.Tasks -> /g/ws/dotnet/runtime/artifacts/bin/ILLink.Tasks/Debug/net9.0/ILLink.Tasks.dll
  Microsoft.Interop.SourceGeneration -> /g/ws/dotnet/runtime/artifacts/bin/Microsoft.Interop.SourceGeneration/Debug/netstandard2.0/Microsoft.Interop.SourceGeneration.dll
  LibraryImportGenerator -> /g/ws/dotnet/runtime/artifacts/bin/LibraryImportGenerator/Debug/netstandard2.0/Microsoft.Interop.LibraryImportGenerator.dll
  ComInterfaceGenerator -> /g/ws/dotnet/runtime/artifacts/bin/ComInterfaceGenerator/Debug/netstandard2.0/Microsoft.Interop.ComInterfaceGenerator.dll
  ILLink.RoslynAnalyzer -> /g/ws/dotnet/runtime/artifacts/bin/ILLink.RoslynAnalyzer/Debug/netstandard2.0/ILLink.RoslynAnalyzer.dll
  System.Runtime -> /g/ws/dotnet/runtime/artifacts/bin/System.Runtime/ref/Debug/net9.0/System.Runtime.dll
  System.ComponentModel -> /g/ws/dotnet/runtime/artifacts/bin/System.ComponentModel/ref/Debug/net9.0/System.ComponentModel.dll
  System.Diagnostics.FileVersionInfo -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.FileVersionInfo/ref/Debug/net9.0/System.Diagnostics.FileVersionInfo.dll
  System.Collections -> /g/ws/dotnet/runtime/artifacts/bin/System.Collections/ref/Debug/net9.0/System.Collections.dll
  System.Collections.NonGeneric -> /g/ws/dotnet/runtime/artifacts/bin/System.Collections.NonGeneric/ref/Debug/net9.0/System.Collections.NonGeneric.dll
  System.ObjectModel -> /g/ws/dotnet/runtime/artifacts/bin/System.ObjectModel/ref/Debug/net9.0/System.ObjectModel.dll
  System.Runtime.InteropServices -> /g/ws/dotnet/runtime/artifacts/bin/System.Runtime.InteropServices/ref/Debug/net9.0/System.Runtime.InteropServices.dll
  System.ComponentModel.Primitives -> /g/ws/dotnet/runtime/artifacts/bin/System.ComponentModel.Primitives/ref/Debug/net9.0/System.ComponentModel.Primitives.dll
  System.Collections.Specialized -> /g/ws/dotnet/runtime/artifacts/bin/System.Collections.Specialized/ref/Debug/net9.0/System.Collections.Specialized.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/ref/Debug/net9.0/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-ios/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-maccatalyst/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-windows/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-linux/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-tvos/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-osx/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0-freebsd/System.Diagnostics.Process.dll
  System.Diagnostics.Process -> /g/ws/dotnet/runtime/artifacts/bin/System.Diagnostics.Process/Debug/net9.0/System.Diagnostics.Process.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:01:04.07
gwr@ubuntu18:/g/ws/dotnet/runtime$ 

Note there's no "illumos" in any of that. I want to make it at least try to build for ilumos. What am I missing? Thanks again!

gwr commented 5 months ago

@gwr #34944 (comment) has a rough sketch.

Oh. Missed this. Thanks.

gwr commented 5 months ago

OK, I'm not much familiar with .csproj files. Thanks for the help with that. Is there any guidance on the layout of things under: src/libraries/Common/src/Interop/ Eg. `Linux/System.Native.vsLinux/*.cs` and others.

What are good tests for these libraries etc? Instructions?

Oh yeah: Are these libraries necessary for self-hosting? (native build) My work would be easier once I can build native.

Thanks.

am11 commented 5 months ago

Is there any guidance on the layout of things under: src/libraries/Common/src/Interop/

https://github.com/dotnet/runtime/blob/4ef65f869207154a4ad6a513bad798f8a96b7f61/docs/coding-guidelines/interop-guidelines.md

e.g. we added https://github.com/dotnet/runtime/blob/4ef65f869207154a4ad6a513bad798f8a96b7f61/src/libraries/Common/src/Interop/SunOS/procfs/Interop.ProcFsStat.TryReadProcessStatusInfo.cs#L18-L20

its C code lives here: https://github.com/dotnet/runtime/blob/4ef65f869207154a4ad6a513bad798f8a96b7f61/src/native/libs/System.Native/pal_io.c#L1823

Linux procfs is a bit "special" (src/libraries/Common/src/Interop/Linux/procfs) because those are text files and we read them directly from C# without interop with C. illumos procfs is binary based, therefore we need the regular interop.

Oh yeah: Are these libraries necessary for self-hosting? (native build)

Yes; they are necessary to complete the shared framework (sfx), here is why:

gwr commented 5 months ago

OK, some progress here. Any test and debug tips? https://github.com/dotnet/runtime/compare/main...gwr:dotnet-runtime:illumos2

am11 commented 5 months ago

Testing is a bit tricky, since the test executor itself can spawn a child process and fail due to the classic chicken-egg situation (we are porting the System.Diagnostics.Process which implements process spawning). You can give it a try.

On linux:

$ ./dotnet.sh build -p:TargetOS=illumos -p:CrossBuild=true src/libraries/System.Diagnostics.Process/tests

Then copy artifacts/bin/System.Diagnostics.Process.Tests/Debug/net9.0-unix to illumos machine, say ~/projects/runtime-tests/System.Diagnostics.Process.Tests. To run:

DOTNET_REMOTEEXECUTOR_SUPPORTED=0 dotnet \
  ~/projects/runtime-tests/System.Diagnostics.Process.Tests/Debug/net9.0-unix/xunit.console.dll \
  ~/projects/runtime-tests/System.Diagnostics.Process.Tests/Debug/net9.0-unix/System.Diagnostics.Process.Tests.dll \
  -notrait category=nonillumostests -notrait category=nonsolaristests \
  -notrait category=OuterLoop -notrait category=failing

If this complains about targetframework 9.0.0-preview... etc. replace it in xunit.console.runtimeconfig.json and System.Diagnostics.Process.Tests.runtimeconfig.json (as we did in helloworld.runtimeconfig.json earlier).

Once ball starts rolling, you can look at [PlatformSpecific(TestPlatforms.Linux)] etc. which may be applicable on illumos, e.g.

https://github.com/dotnet/runtime/blob/1fe7d189db4a49bc676ddb206456709e089c2293/src/libraries/System.Diagnostics.Process/tests/ProcessTests.cs#L1667 to include the platform (TestPlatforms.illumos and TestPlatforms.Solaris are the supported enum values). Similarly, the skip platform condition looks like: https://github.com/dotnet/runtime/blob/1fe7d189db4a49bc676ddb206456709e089c2293/src/libraries/System.Diagnostics.Process/tests/ProcessTests.cs#L605

gwr commented 5 months ago

Thanks. I'm debugging. Is there a way to ask dotnet to pause during (or shortly after) initialization so I can attach to the process with gdb? It's difficult to get the environment and all the args setup if I let gdb actually try to start the program. I think I saw a pause for debug attach somewhere...

Eg. maybe like #2456 proposes?

Thanks

am11 commented 5 months ago

For managed (C#) code. It requires a few things.

For native (C/C++/assembly) runtime code debugging, just set a breakpoint and continue or use something like while (true) { if (ptrace(PTRACE_TRACEME, 0, nullptr, nullptr) == -1) break; }

I'd use the poor man's printf-debugging technique (using Console.WriteLine("I'm here!"); etc. in C# and printf in C/C++) for now to get the base set of libraries ported.

gwr commented 5 months ago

Linux procfs is a bit "special" (src/libraries/Common/src/Interop/Linux/procfs) because those are text files and we read them directly from C# without interop with C. illumos procfs is binary based, therefore we need the regular interop.

BTW, SunOS and illumos have the same style of /proc/pid/* that Linux has. We should be able to do similarly as the Linux code if we want.

am11 commented 5 months ago

Last I checked it has a binary interface unlike linux, i.e. you can do stuff like cat /proc/$$/meminfo on linux but can't cat /proc/$$/psinfo on illumos where it requires reading with structs.

gwr commented 5 months ago

Last I checked it has a binary interface unlike linux, i.e. you can do stuff like cat /proc/$$/meminfo on linux but can't cat /proc/$$/psinfo on illumos where it [requires reading with structs]

Ah right. Yeah, the content that flows over those file descriptors is not human readable. (and on the plus side, does not require any text parsing:)

am11 commented 5 months ago

Yup, note that interop layer also incurs some cost (it adds additional thunks / frames for marshaling). So reading it as text file in C# on linux with non-allocate-y text parsing is working ok. Also, System.Diagnostics.Process is not performance critical; i.e. end-users are most likely not going to put process spawning on performance-sensitive path in their code (so I believe correctness is more important than perf for this lib).

gwr commented 5 months ago

For managed (C#) code. It requires a few things.

  • Already ported: HP libunwind (in-tree copy is at src/native/external/libunwind),
  • gdb is not supported [...] so we need llvm-toolchain or just lldb,

We have most of llvm/clang (current is clang-18). I don't see the "lldb" debugger. I guess that's still todo.

... and libSOS, which has a lldbplugin [...] If llvm-toolchain is ported on illumos [...], we can bring it onboard. It will require some tweaking in rootfs toolchain etc. but it's a nontrivial task.

Hopefully we can stick with gcc for the rootfs toolchain for a while.

For native (C/C++/assembly) runtime code debugging, just set a breakpoint and continue or use something like [... ptrace, sleep, ...]

I've been doing that, but I'm having trouble coming up with a good place to put breaks, eg after all the exec and dll loading happens. Any suggestions where's a good place for a startup breakpoint?

gwr commented 5 months ago

Trying to debug with gdb looks like a lost cause

(gdb) run sdp-test/net9.0-unix/xunit.console.dll \
    sdp-test/net9.0-unix/System.Diagnostics.Process.Tests.dll \
    -notrait category=nonillumostests \
    -notrait category=nonsolaristests \
    -notrait category=OuterLoop \
    -notrait category=failing
Starting program: /tank/ws/dnt/dotnet sdp-test/net9.0-unix/xunit.console.dll \
    sdp-test/net9.0-unix/System.Diagnostics.Process.Tests.dll \
    -notrait category=nonillumostests \
    -notrait category=nonsolaristests \
    -notrait category=OuterLoop \
    -notrait category=failing
[Thread debugging using libthread_db enabled]
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1 (LWP 1)]
0x00007ffef94b8667 in ?? ()
(gdb) where
#0  0x00007ffef94b8667 in ?? ()
#1  0x0000000000000047 in ?? ()
#2  0x0000000000000001 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) 

Though if I continue, it does give me a backtrace of the C# code:

Continuing.
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.IO.Enumeration.FileSystemEnumerableFactory+<>c__DisplayClass2_0.<UserFiles>b__1(System.IO.Enumeration.FileSystemEntry ByRef)
   at System.IO.Enumeration.FileSystemEnumerable`1+DelegateEnumerator[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].ShouldIncludeEntry(System.IO.Enumeration.FileSystemEntry ByRef)
   at System.IO.Enumeration.FileSystemEnumerator`1[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
   at System.Collections.Generic.List`1[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]..ctor(System.Collections.Generic.IEnumerable`1<System.__Canon>)
   at System.IO.Directory.GetFiles(System.String, System.String, System.IO.EnumerationOptions)
   at System.IO.Directory.GetFiles(System.String, System.String)
   at Xunit.ConsoleClient.ConsoleRunner.GetAvailableRunnerReporters()
   at Xunit.ConsoleClient.ConsoleRunner.EntryPoint(System.String[])
   at Xunit.ConsoleClient.Program.Main(System.String[])

Thread 2 received signal SIGABRT, Aborted.
0x00007fffaf3fb6aa in _lwp_kill () from /lib/64/libc.so.1

Is that all I have to work with here? (until lldb)

am11 commented 5 months ago

The exception stacktrace will show up without gdb as well. The exception is pointing to this method: https://github.com/dotnet/runtime/blob/64efe2654c8455e7591aa07e7e8505064f571fc4/src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerableFactory.cs#L114

You can probably repro it with helloworld app using this in Program.cs

EnumerationOptions options = new()
{
    IgnoreInaccessible = false,
    RecurseSubdirectories = true
};

foreach (var file in Directory.GetFiles("/tmp", "*", options))
{
    Console.WriteLine(file);
}

publish helloworld from linux, copy to illumos and run.

AustinWise commented 5 months ago

The exception stacktrace will show up without gdb as well. The exception is pointing to this method: https://github.com/dotnet/runtime/blob/64efe2654c8455e7591aa07e7e8505064f571fc4/src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerableFactory.cs#L114

You can probably repro it with helloworld app using this in Program.cs


EnumerationOptions options = new()

{

    IgnoreInaccessible = false,

    RecurseSubdirectories = true

};

foreach (var file in Directory.GetFiles("/tmp", "*", options))

{

    Console.WriteLine(file);

}

publish helloworld from linux, copy to illumos and run.

This is pretty much the same repro I wrote for https://github.com/dotnet/runtime/issues/104448 . With that fix, running xunit library tests works.

Sorry for not being more clear that was the problem that PR fixes, I was a bit rushed to get some 4th things.

gwr commented 5 months ago

This is pretty much the same repro I wrote for #104448 . With that fix, running xunit library tests works.

Sorry for not being more clear that was the problem that PR fixes, I was a bit rushed to get some 4th things.

Thanks. I pullled your fixes for #104447 and #104448 to my local working branch. Here's what I get now:

$ DOTNET_REMOTEEXECUTOR_SUPPORTED=0 \
./dotnet sdp-test/net9.0-unix/xunit.console.dll \
    sdp-test/net9.0-unix/System.Diagnostics.Process.Tests.dll \
    -notrait category=nonillumostests \
    -notrait category=nonsolaristests \
    -notrait category=OuterLoop \
    -notrait category=failing
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.IO.Enumeration.FileSystemEnumerableFactory+<>c__DisplayClass2_0.<UserFiles>b__1(System.IO.Enumeration.FileSystemEntry ByRef)
   at System.IO.Enumeration.FileSystemEnumerable`1+DelegateEnumerator[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].ShouldIncludeEntry(System.IO.Enumeration.FileSystemEntry ByRef)
   at System.IO.Enumeration.FileSystemEnumerator`1[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
   at System.Collections.Generic.List`1[[System.__Canon, System.Private.CoreLib, Version=9.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]..ctor(System.Collections.Generic.IEnumerable`1<System.__Canon>)
   at System.IO.Directory.GetFiles(System.String, System.String, System.IO.EnumerationOptions)
   at System.IO.Directory.GetFiles(System.String, System.String)
   at Xunit.ConsoleClient.ConsoleRunner.GetAvailableRunnerReporters()
   at Xunit.ConsoleClient.ConsoleRunner.EntryPoint(System.String[])
   at Xunit.ConsoleClient.Program.Main(System.String[])
.run-test: line 13: 19114: Abort(coredump)
Abort

How do I track those name back to the source code? Are those something my "demangle" command could make sense of the way that works for C++ code?

gwr commented 5 months ago

Would it be useful for us to have a "feature" branch or something? Then I wouldn't have to cherry-pick your fixes out of the PRs, or you mine. :)

am11 commented 5 months ago

The second stacktrace seems to be same as the first one?

gwr commented 5 months ago

The second stacktrace seems to be same as the first one?

Oh. Right. Huh...

gwr commented 5 months ago

I could use some help tracking the flow from (for example) the files changed in #104448 into any temporary objects and then the deliverables I copy onto the target. It looks like the change object (and behavior) is not getting onto my test setup.

For example, the key change is in pal_io.cpp so I looked for that:

cd .../artifacts
$ find . -name 'pal_io.*' -print
./obj/native/net9.0-illumos-Debug-x64/System.Native/CMakeFiles/System.Native-Static.dir/pal_io.c.o.d
./obj/native/net9.0-illumos-Debug-x64/System.Native/CMakeFiles/System.Native-Static.dir/pal_io.c.o
./obj/native/net9.0-illumos-Debug-x64/System.Native/CMakeFiles/System.Native.dir/pal_io.c.o.d
./obj/native/net9.0-illumos-Debug-x64/System.Native/CMakeFiles/System.Native.dir/pal_io.c.o
./obj/coreclr/illumos.x64.Debug/libs-native/System.Native/CMakeFiles/System.Native-Static.dir/pal_io.c.o.d
./obj/coreclr/illumos.x64.Debug/libs-native/System.Native/CMakeFiles/System.Native-Static.dir/pal_io.c.o

So does that land in the dotnet program? or where? Thanks

am11 commented 5 months ago

In this case, it's called libSystem.Native.so (as it is in src/native/libs/System.Native which has the CMakeLists.txt file with project(System.Native) directive), so I'd copy assets from find artifacts/bin -iname 'libSystem.Native*' onto the VM

artifacts/obj is intermediate objects directory which participate in building the product binaries that go in artifacts/bin and later artifacts/packages.

Separately, (not for each change like this one, but) it's good idea to refresh the environment from time to time to avoid later surprises; rm -rf artifacts on linux, rebuild clr+libs+packs subsets, and copy over runtime tar.gz to illumos machine and recreate ~/.dotnet (a helper script might come handy to automate it).

AustinWise commented 5 months ago

I could use some help tracking the flow from (for example) the files changed in https://github.com/dotnet/runtime/pull/104448 into any temporary objects and then the deliverables I copy onto the target.

Personally what I've been doing is doing a full ./build.sh clr+libs+packs -cross -os illumos and then copying over artifacts/packages/Debug/Shipping/dotnet-runtime-9.0.0-dev-illumos-x64.tar.gz to the target. It's a little slow, but it appears to be reliable.

How do I track those name back to the source code? Are those something my "demangle" command could make sense of the way that works for C++ code?

One thing you can do to get line numbers in these managed backtraces is to copy the symbol files over to the target. They live in .pdb files. So if you put System.Prive.CoreLib.pdb next to System.Private.CoreLib.dll, the runtime will automatically add the file paths and line numbers to the backtraces. You can find these PDB files in artifacts/packages/Debug/Shipping/Microsoft.NETCore.App.Runtime.illumos-x64.9.0.0-dev.symbols.nupkg. This is just a zip file. The structure is a little different the dotnet-runtime-9.0.0-dev-illumos-x64.tar.gz, but you should be able to figure out how to copy the PDB files next to their corresponding DLL files. (maybe there is a command line option to include these PDB files in the tar.gz file, but I have not checked).

I'm not aware of a standalone demangling program. There is a library for it. The readme describes several types of mangling, so it could be useful:

https://github.com/benaadams/Ben.Demystifier

For the specific example:

System.IO.Enumeration.FileSystemEnumerableFactory+<>c__DisplayClass2_0.<UserFiles>b__1

The + indicates the start of the a nested class. The <> at the start of a class name indicates a compiler generated class. In this case DisplayClass means it is a closure of a lambda method. The name of the method where this lambda was defined is part of the name (UserFiles). So to put it all together, in the class FileSystemEnumerableFactory there is a method UserFiles that declared a lambda function and it is currently executing. So here. (It is worth noting that normally it should not be possible for this method to cause an access violation (aka segv). This indicated memory corruption damaged the managed reference.)

am11 commented 5 months ago

@AustinWise if stacktrace is the same as before then either https://github.com/dotnet/runtime/pull/104448 fix didn't work, or test was done with old binaries. Maybe try running the same xunit.console.dll command to see if it repros on your box?

gwr commented 5 months ago

Yeah, the xunit.console.dll on the test system (after copying as above) shows old dates. Will try removing the artifacts directory.

AustinWise commented 5 months ago

@AustinWise if stacktrace is the same as before then either #104448 fix didn't work, or test was done with old binaries. Maybe try running the same xunit.console.dll command to see if it repros on your box?

I check the System.Diagnostice.Process tests to see if there was anything different. The runner gets past the test discovery phase without hitting the crash.

For what it's worth, the crash reproduced 100% of the time before my fix and reproduced 0% of the time after the fix. I have tested the fix both on SmartOS and OpenIndiana.

AustinWise commented 5 months ago

FYI on a gdb problem I'm having: .NET translates SIGFPE into DivideByZeroException. I noticed that when I'm attached to a process using GDB, this translation breaks. Something zeros out the siginfo->si_code that .NET relies upon to classify these signals. I'm not sure if this is a .NET problem, GDB problem, or illumos problem. Since I don't want to deal with that rabbit hole right now, I've hacked in a fix so I can keep using GDB: https://github.com/AustinWise/runtime/commit/f9f5886aac8caaa5254ad5509665bf987125f97b

am11 commented 5 months ago

Cool. Callstack was showing GetAvailableRunnerReporters(), which runs at the beginning before the tests execution. Hopefully, it will work for @gwr as well after the fresh build.

AustinWise commented 4 months ago

I noticed a problem with exception handling. .NET translates SIGSEGV into NullReferenceException. The sigsegv_handler is configured to use an alternate stack with sigaltstack. This handler does not behave like a normal signal handler: it switches the stack back to the original stack and resumes executing code. It never returns from the signal handler. On Linux this works fine: linux does not keep track of whether or not a signal handler returned after using the alternate stack. illumos however sets a bit called SS_ONSTACK when dispatching to a signal handler on an alternate stack and clears this bit when the handler returns. Before dispatching a signal, it checks to see if the SS_ONSTACK bit is set. If it set, the alternate stack is not used.

.NET assumes that the alternate stack is always used for signal handlers. This means when it uses SwitchStackAndExecuteHandler to switch stacks, it actually just moving up the stack a bit. This causes the siginfo and siginfo context parameters passed to the signal handler to be clobbered. Sadness ensures.

Here is a minimal C# reproduction program: https://github.com/AustinWise/CrashRepro/blob/master/csharp/Program.cs . It should print "Did not crash.". On illumos it will either crash with an unhandled AccessViolationException or an unhandled SIGSEGV. There is also a library test that triggers this behavior:

dotnet xunit.console.dll System.Runtime.Tests.dll -method "System.Tests.TupleTests.Equals_GetHashCode"

There is an existing environment variable that is supposed to work around this: DOTNET_EnableAlternateStackCheck=1 . However it appears this check does not work correctly. It checks to see if the point at which execution was interrupted by the signal is on an alternate stack. It should probably check whether the current stack the signal handler is using is the alternate stack. I have a commit that makes IsRunningOnAlternateStack more accurate and makes the aforementioned test program behave correctly: https://github.com/AustinWise/runtime/commit/6417f82ee3097bdbd8c78d16bd1ae610115fb98f

I'm not sure what the correct fix would be. Not use alternate stacks on illumos? Switch stacks by manipulating the context passed to the signal handler and returning from signal handler?

AustinWise commented 4 months ago

@gwr

I took a stab at the System.Diagnostic.Process support. The first commit sets up the build system and the function definitions needed. They all still throw PlatformNotSupportedException exception: https://github.com/AustinWise/runtime/commit/361f64a6abb0d7420c5f4249f7d22a6ad5015670

The second commit is hacky and incomplete. It is enough to get the RemoteExecutor working, which unblocks running a lot of tests: https://github.com/AustinWise/runtime/commit/c48ae3d4e3e350df59d9d41777ce2aaa5474663d Note that some elements of it are copy-pasted from the linux version. While linux uses a text based format and illumos uses a binary format, the general structure is similar.

I suspect I'm going to be busy for the next couple of weeks and won't have time to push this work forward during that time. I achieved my personal goal of getting the System.Runtime.Tests mostly working when run on my branch. The remaining failures look like they are caused by time zone data, but I have not looked into these deeply to confirm:

gwr commented 4 months ago

FYI on a gdb problem I'm having: .NET translates SIGFPE into DivideByZeroException. I noticed that when I'm attached to a process using GDB, this translation breaks. Something zeros out the siginfo->si_code that .NET relies upon to classify these signals. I'm not sure if this is a .NET problem, GDB problem, or illumos problem. Since I don't want to deal with that rabbit hole right now, I've hacked in a fix so I can keep using GDB: AustinWise@f9f5886

I've been doing some work on gdb, and I might like to look at this too. Is there any small reproduction environment available for looking at what gdb is doing with this?

AustinWise commented 4 months ago

FYI on a gdb problem I'm having: .NET translates SIGFPE into DivideByZeroException. I noticed that when I'm attached to a process using GDB, this translation breaks. Something zeros out the siginfo->si_code that .NET relies upon to classify these signals. I'm not sure if this is a .NET problem, GDB problem, or illumos problem. Since I don't want to deal with that rabbit hole right now, I've hacked in a fix so I can keep using GDB: AustinWise@f9f5886

I've been doing some work on gdb, and I might like to look at this too. Is there any small reproduction environment available for looking at what gdb is doing with this?

Here is a minimal C# program that reproduces the problem, reduced from this System.Runtime.Tests case:

using System;
using System.Runtime.CompilerServices;

try
{
    Console.WriteLine(TestDiv(1, 0));
}
catch (DivideByZeroException)
{
    Console.WriteLine("PASS");
}

[MethodImpl(MethodImplOptions.NoInlining)]
static long TestDiv(long a, long b)
{
    return a / b;
}

It runs fine without GDB attached (prints "PASS"). When GDB is attached, it crashes with this error:

Process terminated. InternalError
   at System.Environment.<FailFast>g____PInvoke|11_0(System.Runtime.CompilerServices.StackCrawlMarkHandle, UInt16*, System.Runtime.CompilerServices.ObjectHandleOnStack, UInt16*)
   at System.Environment.FailFast(System.Runtime.CompilerServices.StackCrawlMarkHandle, System.String, System.Runtime.CompilerServices.ObjectHandleOnStack, System.String)
   at System.Environment.FailFast(System.Threading.StackCrawlMark ByRef, System.String, System.Exception, System.String)
   at System.Environment.FailFast(System.String)
   at System.Runtime.EH.FallbackFailFast(System.Runtime.RhFailFastReason, System.Object)
   at System.Runtime.EH.FailFastViaClasslib(System.Runtime.RhFailFastReason, System.Object, IntPtr)
   at System.Runtime.EH.RhThrowHwEx(UInt32, ExInfo ByRef)
   at Program.<<Main>$>g__TestDiv|0_0(Int64, Int64)
   at Program.<Main>$(System.String[])

This crash is reproducible on both my SmartOS and OpenIndiana systems, which are using GDB 7 and and 14 respectively.

am11 commented 4 months ago

This isn't new. Lets discuss signals issue where it belongs: https://github.com/dotnet/runtime/issues/35362 and keep this tracking issue limited to high-level milestones. When you will run PAL tests, you will find the differences in platform.

gwr commented 4 months ago

OK. Sorry for making this ticket a bit "chatty". If I could have an email for you, I could use that for some of the "how do I..." questions and the like instead of making yet more noise here. My email is in all my commits. Thanks.

am11 commented 4 months ago

@gwr, I only meant to keep this issue as a main tracking one and branch off to separate issues (https://github.com/dotnet/runtime/issues) / discussions https://github.com/dotnet/runtime/discussions for specific concerns. This way we can call for help from other community members and area owners. In the current state of this thread, it is not easy to track each conversation and mentioning someone on issue with area-Meta would not be effective.

(also, I do not know all the answers, but I can help navigating things -- preferably on GitHub in open forums)

gwr commented 4 months ago

I took a stab at the System.Diagnostic.Process support. The first commit sets up the build system and the function definitions needed. They all still throw PlatformNotSupportedException exception: ...

That's interesting. Your "skeleton" looks somewhat like the Linux code. (confirmed below) I was trying to work from the FreeBSD code (and sharing the same BSD parts that Apple and FreeBSD share, eg. the resource control calls should work the same on illumos)

The second commit is hacky and incomplete. It is enough to get the RemoteExecutor working, which unblocks running a lot of tests: ... Note that some elements of it are copy-pasted from the linux version. While linux uses a text based format and illumos uses a binary format, the general structure is similar.

I've built what's on your branch, and can now reproduce your test results. Thanks.

gwr commented 4 months ago

I've made good progress thanks to the help from @AustinWise (thanks again!). No more failures in the System.Diagnostics.Process.Tests

Here's a github compare link for the latest: https://github.com/dotnet/runtime/compare/main...gwr:illumos5

Should I start opening pull requests for all of those changes? Or how best to proceed?

Any guidance on what to work on next among those libraries?

am11 commented 4 months ago

Great progress @gwr! I think you can open a PR for review. Note that maitnainers maybe busy for .NET 9 preview 7 preparations, so it may take a while. @AustinWise and I can take a look.

Also note that there is one illumos fix I ninja'd in https://github.com/dotnet/runtime/pull/105178, which is blocked due to p7 prep (Environment.SunOS file).

gwr commented 4 months ago

Here's the PR for code that runs System.Diagnostics.Process.Tests (skips but no fails) https://github.com/dotnet/runtime/pull/105403

BTW, I tried rebasing on main from Mon. this week and ran into problems downloading stuff. Not sure why, but it didn't seem to have anything to do with my changes.

gwr commented 3 months ago

Note this needs https://github.com/dotnet/runtime/pull/105207 integrated before it's fully functional including exception handling. [ since merged ]