dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.06k stars 4.69k forks source link

Port corehost to QNX7 #33374

Open guesshe opened 4 years ago

guesshe commented 4 years ago

Hi,

I am trying to port the entire runtime to qnx7 platform on x64 arch. I am able to build coreclr but it won't run unless I have dotnet executable built. Any suggestions on how to build corehost for qnx?

guesshe commented 4 years ago

@am11 Thanks for your reply! With this change applied, the program stuck at this method. Any idea if I can enable debug logging in libcoreclr.so?

am11 commented 4 years ago

@guesshe, for native runtime code, you can try using lldb by doing something like:

#!/usr/bin/env bash

$(command -v lldb) /path/to/yourapp

# inside lldb REPL, catch all C++ exceptions
(lldb) break set -E C++
(lldb) r
(lldb) bt

if you want to include stacktrace from managed side as well, then you would first need to build SOS plugin (libsosplugin.so) for LLDB from https://github.com/dotnet/diagnostics#building-the-repository, then:

#!/usr/bin/env bash

$(command -v lldb) /path/to/yourapp

# now inside the lldb REPL
break set -E C++
plugin load /path/to/libsosplugin.so
run
# will break on first C++ exception
dumpstack

for more info on diagnostics, there is much more content in the dotnet/diagnostics repo. Note, currently there is no gdb SOS plugin, only lldb is supported (https://github.com/dotnet/diagnostics/issues/272).

guesshe commented 4 years ago

@am11 Thanks! It turned out SA_RESTART is not supported in QNX. I removed this flag and it moved a bit further. Now I am having following crash. I guess it has something to do with runtime host but I am not sure, any ideas? Process 1585176 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b1fb40 Memory fault (core dumped)

am11 commented 4 years ago

So I solved my issue by using clang3.9 as assembler.

I think mixing clang and gcc toolchains is problematic. I'd try to fix the broken toolchain first and use either clang or gcc for the entire build.

guesshe commented 4 years ago

@am11 oh. Thanks! even for assembler? I will go back and fix the assembler issue. Any idea about the runtime crash? I thought it has something to do with build in host list.

am11 commented 4 years ago

@guesshe, the reason why i mentioned using same toolchain after looking at the segmentation fault is that we have previously hit by SigSegV and it is very hard to troubleshoot and understand the root cause in such case. So it is best if the entire product is build with same toolchain, to rule out such unrelated/external culprits.

Any idea about the runtime crash? I thought it has something to do with build in host list.

I did not get a chance to look deeper, but if you hit it after rebuilding the runtime with gcc, e.g.

# workaround for gcc5
CFLAGS=-Wa,--divide CXXFLAGS=-Wa,--divide ./build.sh -configuration debug

or build entire product with clang or gcc (v7 or above) if possible, then could you attach debugger and collect some data?

guesshe commented 4 years ago

@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????

guesshe commented 4 years ago

@am11 So I fixed the assembler issue by setting CMAKE_ASM_FLAGS to -Wa,--divide but I am still having this crash Process 1699864 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b28b40 I am trying to find qnx supported lldb.

wfurt commented 4 years ago

@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????

Unicode? is LANG/LC_ALL supported on QNX? What is the file location?

am11 commented 4 years ago

fltno=11

@guesshe, I searched just this string (verbatim) on Google, and surprisingly found majority of QNX related hits on the first result page. This article describes how they solved such SIGSEGV with fltno=11 issue in a simple app on QNX using dladdr(3) . Perhaps you would need to adjust some linker flags to get the paging policy right or maybe some code changes, I am not sure. However, one thing I would try is comment out this line https://github.com/dotnet/runtime/blob/363b7add1906547eeba681b3f3ec3f686a603dee/eng/native/configureplatform.cmake#L343 and rebuild in order to verify whether or not it is due to -fPIC.

guesshe commented 4 years ago

@am11 Thanks! I will try to figure it out.

guesshe commented 4 years ago

@wfurt This shows up in my host linux. Not on target. I thought these could be build output files but they are all empty.

am11 commented 4 years ago

@guesshe, I remember when NetBSD folks ported coreclr:

janvorli commented 4 years ago

That's the way I would recommend too (and we did it the same way when we were porting .NET Core to Linux 5 years ago)

guesshe commented 4 years ago

@am11 @janvorli Thanks! I will follow this path.

guesshe commented 4 years ago

@am11 continuing my debugging journey and getting the test suite running in QNX. I suspect my issue has something to do with how this function is called GetCLRRuntimeHost but I don't know how this related to cruntime implementation.

guesshe commented 4 years ago

@am11 @janvorli So I managed to build and run the pal_test suite on my QNX VM. I got this result but I doubt it is valid as I saw some process crash during the text execution. Does that produce a PASS status? I had to modify the bash script to be able to execute in ksh environment but that was not a big change. Next I will focus on fixing up the crashes I saw during the test execution. Most of them happened at strlen and Unable to set thread priority to 0 (error 22)

Finished running PAL tests: PAL Test Results: Passed: 726 Failed: 0

guesshe commented 4 years ago

@am11 @jkotas Do I need this managed library? System.Private.CoreLib.dll for coreclr to work?

janvorli commented 4 years ago

You don't need it for PAL tests, but you need it for the next steps. This is the core managed library containing all the basic functionality and glue between the managed and native parts of the runtime.

As for some PAL tests failing and the results still showing that no tests have failed, this is strange and seems like we may have a bug not recognizing crashes as failures.

guesshe commented 4 years ago

@janvorli @am11 @jkotas Here are two types of crashes I saw during testing. One related to strelen function and the other is thread priority. Process 114688025 (paltest_fprintf_test2) terminated SIGSEGV code=1 fltno=11 ip=0000000100078e10(/usr/lib/ldqnx-64.so.2@strlen+0x0000000000000000) mapaddr=0000000000078e10. ref=0000000000000000 Memory fault (core dumped)

.{1-807d485} ASSERT [THREAD ] at /home//GitRepo/dotnet_runtime_nto/src/coreclr/src/pal/src/thread/thread.cpp.1263: Unable to set thread priority to 0 (error 22) Process 166789145 (paltest_criticalsectionfunctions_test2) terminated SIGTRAP code=1 fltno=3 ip=00000000080b9f57(/mnt/river/tmp/pal_tests/src/pal/tests/palsuite/threading/CriticalSectionFunctions/test2/paltest_criticalsectionfunctions_test2@DebugBreak+0x000000000005c031) mapaddr=0000000000071f57.

Here is the crash when I tried to launch my helloworld.dll using corerun. Process 172077080 (corerun) terminated SIGSEGV code=2 fltno=11 ip=00000001010834cb(/mnt/river/tmp/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f087) mapaddr=00000000002354cb. ref=0000000101bfab40 Memory fault (core dumped) I didn't quite understand how the function GetCLRRuntimeHost works. I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux, this might be the issue? I plan to revisit later as it doesn't seem to be a easy fix.

janvorli commented 4 years ago

Making the context.cpp stuff work is essential, primarily for hardware exception handling and for GC thread suspension.

As for the failing PAL tests, you can run the specific tests under a debugger and see why it fails or crashes. Each PAL test is a standalone executable that can be run.

guesshe commented 4 years ago

@janvorli Thanks! Here is a question from our lead developer while I am working on get context.cpp file compiled. I have to change register access for QNX target. His question is "Is it possible to compile without hardware floating point support? Might help there if there is a compile option for software floating point instead of hw floating point -- there would be no need to save and restore FP registers"

janvorli commented 4 years ago

Is it possible to compile without hardware floating point support?

Unfortunately not. The JIT uses xmm registers a lot.

guesshe commented 4 years ago

@janvorli Thanks!

guesshe commented 4 years ago

@am11 @janvorli I fixed the register access issue and enabled context.cpp in my build. However, I am facing a new linker issue. But I do have -fPIC in my compilation flag and in the project I have CMAKE_POSITION_INDEPENDENT_CODE set to TRUE. Any suggestions here? /x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): warning: relocation against CONTEXT_CaptureContext' in readonly section.text'. /x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext' can not be used when making a shared object; recompile with -fPIC /home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: final link failed: Bad value cc: /home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld error 1

am11 commented 4 years ago

@guesshe, does adding this line https://github.com/am11/runtime/blob/208143dbb181782119e74441a536c9a8efc29808/eng/native/configureplatform.cmake#L290 at the same place and recompiling (after rm -rf artifacts) help? This is currently what I am doing for Solaris bringup (still very much work in progress), and it fixed a similar relocation error for me.

am11 commented 4 years ago

Also if you could show the diff in context2.S, we will understand the error better. Maybe suffixing @gotpcrel will fix the issue.

guesshe commented 4 years ago

@am11 I tried this solution and the result is the same. I didn't make any changes to context2.S file under amd64. What do you mean by suffixing @gotpcrel ?

am11 commented 4 years ago

I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux ... I fixed the register access issue and enabled context.cpp ... /x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext'

@guesshe, i mean git diff src/coreclr/src/pal/src/thread/context.cpp how you fixed context issue? Also are you building master or release/3x branch?

guesshe commented 4 years ago

@am11 @janvorli It seems supermi is not as critical as pal. Can I disable these sub-projects to test functionalities of pal? superpmi-shim-collector superpmi-shim-counter superpmi-shim-simple

guesshe commented 4 years ago

@am11 The way I did was adding QNX specific register access MARCOs. Like following, it is very similar to FreeBSD. I have to include QNX specific header files but not sure if I can share the header file as it is not under any opensource license. +#elif defined(QNX)

guesshe commented 4 years ago

OK. Now I disabled superpmi sub-project and it builds. But I got a new crash. Process 173518872 (corerun) terminated SIGSEGV code=2 fltno=11 ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2) mapaddr=000000000023552d. ref=0000000101bfcb40 Memory fault (core dumped) Any suggestions on how to debug this?

am11 commented 4 years ago

https://github.com/dotnet/runtime/issues/33374#issuecomment-609059509 did you try something to fix fltno=11?

guesshe commented 4 years ago

@am11 I didn't try anything specific. I bring back the context.cpp file and added qnx as targetOS. Now it complains about registerTMCloneTable, but I can't find this function in coreclr.

janvorli commented 4 years ago

@guesshe do all PAL tests pass now? If they don't, there is not much sense in trying to run corerun. Btw, maybe you do that, but until you get everything running, I would recommend running it under gdb (or lldb if you have one on QNX). It is very unlikely to figure out problems just by executing the code and reasoning based on the crash code. You'll need to view the stack trace, local variables, etc. Maybe you do that already, but from your questions above, it seemed you are just trying to run it without debugger.

guesshe commented 4 years ago

@janvorli Thanks! The only crash I saw is strlen and thread priority. I think strlen is fine but thread priority might be an issue. I am setting up debugger at the same time. I am trying to get the dump and reload on host gdb tool. Had a version conflict issue yesterday. Will try to resolve it today. I got some help from our lead developer regarding this crash. he said the qcc compiler supports transactional memory in its runtime, but all the symbols are namespaced with the prefix ITM (as in _ITM_registerTMCloneTable). How does this symbol defined in the binary?

janvorli commented 4 years ago

We don't call such a function directly from our code and when I've googled for it, it seems it comes from usage of register_tm_clones function that we also don't use. So I guess it comes from the standard C library or something like that.

guesshe commented 4 years ago

@janvorli Thanks! I got some feedback from our kernel developer. Hopefully it will help with understanding the issue. QNX does not use transactional memory so it has nothing to do with libc.

There is a weak function _ITM_registerTMCloneTable() that gets called by register_tm_clones() in libgcc (the compiler's supplied runtime library). Because it's a weak symbol, it's OK to not resolve it and the library will just skip the call to it.

Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?

janvorli commented 4 years ago

Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?

No, there is nothing like that. However, looking again at the error ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2, I've just realized it has probably nothing to do with that symbol. The offset (0x00000000000118b2) is too far away from that symbol to be in the same function. I think that what happens is that it fails at some place where there are no symbols available and it ends up reporting the closest symbol it finds, which by a mere chance ends up being the registerTMCloneTable.

guesshe commented 4 years ago

Thanks! I will first fix the thread priority issue and then put this bin in gdb and debug. Do you think the strlen is also related? I am not sure if i can fix the strlen, it might be some limited supoort issue.

Regards

River He

On Thu., Apr. 9, 2020, 12:20 Jan Vorlicek, notifications@github.com wrote:

Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?

No, there is nothing like that. However, looking again at the error ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable +0x00000000000118b2, I've just realized it has probably nothing to do with that symbol. The offset (0x00000000000118b2) is too far away from that symbol to be in the same function. I think that what happens is that it fails at some place where there are no symbols available and it ends up reporting the closest symbol it finds, which by a mere chance ends up being the registerTMCloneTable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/33374#issuecomment-611618143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCJEHQ2OCOXQVEPGZ7LWP3RLXYTHANCNFSM4LEKE3NA .

janvorli commented 4 years ago

I don't see why something as simple as strlen should be problematic, so it seems we end up getting wrong character pointer (maybe a NULL) somewhere and passing it to the strlen later. So the strlen failing is just an indicator of a problem somewhere else.

guesshe commented 4 years ago

@janvorli I fixed the thread priority issue. Now, apart from the strlen issues I am having following exceptions. However, these exceptions are not considered failed tests, I still get 726 test cases passed and 0 failure. ...'paltest_namedmutex_test1' failed at line 397. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 463. Expression: m2 != nullptr 'paltest_namedmutex_test1' failed at line 556. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 670. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 287. Expression: parentEvents[i] != nullptr 'paltest_namedmutex_test1' failed at line 695. Expression: InitializeParent(testName, parentEvents, childEvents) 'paltest_namedmutex_test1' failed at line 930. Expression: AbandonTests_Parent() 'paltest_namedmutex_test1' failed at line 273. Expression: WaitForSingleObject(childRunningEvent, FailTimeoutMilliseconds) == WAIT_OBJECT_0 'paltest_namedmutex_test1' failed at line 320. Expression: AcquireChildRunningEvent(testName, childRunningEvent) 'paltest_namedmutex_test1' failed at line 759. Expression: InitializeChild(testName, childRunningEvent, parentEvents, childEvents)

guesshe commented 4 years ago

@am11 @janvorli Is feature no stress_log supported? If I set -DFEATURE_NO_STRESSLOG, will this disable the feature?

janvorli commented 4 years ago

@guesshe you can set that, but I am not sure why would you want to do that.

guesshe commented 4 years ago

@janvorli @wfurt @jkotas With the help of our kernel developers, we managed to fix this crash and another stack issue. Now it proceeded to a point that looks very promising.

./corerun -c /lib hello_world_dotnet_core_qnx_netcore5_0.dll

coreclr_initialize failed - status: 0x80004005 By reading porting notes from @wfurt, I downloaded netcore 5 sdk 5.0 using snap and published to netcoreapp5.0 targetframework. However, I still got the same issue. The commit I checkout from master is 62112b0abb36654775552842231dc48a0d032655. Any suggestions? Is this because I am on master not on the preview branch?

wfurt commented 4 years ago

That maps to E_FAIL and there are many places where this can fail. You can try to set COREHOST_TRACE=1 and check if that provides any hints. (I assume you disabled r2r, right?) I don't think the branch matters.

guesshe commented 4 years ago

@wfurt Thanks! What is r2rm? Does this failure mean the cruntime is passed?

wfurt commented 4 years ago

There was typo. R2R -> Ready To Run. With crossgen, we may put in native bits so make startup faster. Because of that, you many not be able to simply copy assemblies targeted for other platform. It should work for the hello but I'm wondering how did you get BCL assemblies. Back then, I used COMPlus_ZapDisable=1 and COMPlus_ReadyToRun=0 when trying to use Linux assemblies on FreeBSD. @janvorli or @jkotas may know better if that is still applicable.

guesshe commented 4 years ago

@wfurt Is that an environment variable? I don't recall I set that. For BCL assemblies, I plan to upload the built tools and source code to target and build from there directly instead of cross-compiling.

wfurt commented 4 years ago

yes, environment. I'm not quite sure what you mean by the previous post. In order to build assemblies you need to have working dotnet cli and c# compiler is written (mostly) in c#. forerun cannot function without System.Private.CoreLib.dll (and perhaps others), so the question is how did you get one?