Open guesshe opened 4 years ago
@am11 Thanks for your reply! With this change applied, the program stuck at this method. Any idea if I can enable debug logging in libcoreclr.so?
@guesshe, for native runtime code, you can try using lldb by doing something like:
#!/usr/bin/env bash
$(command -v lldb) /path/to/yourapp
# inside lldb REPL, catch all C++ exceptions
(lldb) break set -E C++
(lldb) r
(lldb) bt
if you want to include stacktrace from managed side as well, then you would first need to build SOS plugin (libsosplugin.so
) for LLDB from https://github.com/dotnet/diagnostics#building-the-repository, then:
#!/usr/bin/env bash
$(command -v lldb) /path/to/yourapp
# now inside the lldb REPL
break set -E C++
plugin load /path/to/libsosplugin.so
run
# will break on first C++ exception
dumpstack
for more info on diagnostics, there is much more content in the dotnet/diagnostics repo. Note, currently there is no gdb SOS plugin, only lldb is supported (https://github.com/dotnet/diagnostics/issues/272).
@am11 Thanks! It turned out SA_RESTART is not supported in QNX. I removed this flag and it moved a bit further. Now I am having following crash. I guess it has something to do with runtime host but I am not sure, any ideas? Process 1585176 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b1fb40 Memory fault (core dumped)
So I solved my issue by using clang3.9 as assembler.
I think mixing clang and gcc toolchains is problematic. I'd try to fix the broken toolchain first and use either clang or gcc for the entire build.
@am11 oh. Thanks! even for assembler? I will go back and fix the assembler issue. Any idea about the runtime crash? I thought it has something to do with build in host list.
@guesshe, the reason why i mentioned using same toolchain after looking at the segmentation fault is that we have previously hit by SigSegV and it is very hard to troubleshoot and understand the root cause in such case. So it is best if the entire product is build with same toolchain, to rule out such unrelated/external culprits.
Any idea about the runtime crash? I thought it has something to do with build in host list.
I did not get a chance to look deeper, but if you hit it after rebuilding the runtime with gcc, e.g.
# workaround for gcc5
CFLAGS=-Wa,--divide CXXFLAGS=-Wa,--divide ./build.sh -configuration debug
or build entire product with clang or gcc (v7 or above) if possible, then could you attach debugger and collect some data?
@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????
@am11 So I fixed the assembler issue by setting CMAKE_ASM_FLAGS to -Wa,--divide but I am still having this crash Process 1699864 (dotnet) terminated SIGSEGV code=2 fltno=11 ip=0000000100fb14dd(/tmp/publish/publish/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f099) mapaddr=00000000002354dd. ref=0000000101b28b40 I am trying to find qnx supported lldb.
@am11 Any idea why the build generated empty files with names like this ???@??@8?@????@@@???????????????????
Unicode? is LANG/LC_ALL supported on QNX? What is the file location?
fltno=11
@guesshe, I searched just this string (verbatim) on Google, and surprisingly found majority of QNX related hits on the first result page. This article describes how they solved such SIGSEGV with fltno=11 issue in a simple app on QNX using dladdr(3)
. Perhaps you would need to adjust some linker flags to get the paging policy right or maybe some code changes, I am not sure. However, one thing I would try is comment out this line https://github.com/dotnet/runtime/blob/363b7add1906547eeba681b3f3ec3f686a603dee/eng/native/configureplatform.cmake#L343 and rebuild in order to verify whether or not it is due to -fPIC
.
@am11 Thanks! I will try to figure it out.
@wfurt This shows up in my host linux. Not on target. I thought these could be build output files but they are all empty.
@guesshe, I remember when NetBSD folks ported coreclr:
the first thing that was done was to pass all platform abstraction layer (PAL) tests, which excercise the CRT functions used by the runtime: https://github.com/dotnet/runtime/blob/59be94b69845ecfbd5a694483c2a4853e99cc64b/docs/workflow/testing/coreclr/unix-test-instructions.md#pal-tests
and then run a simple hello world app using corerun (a basic host that complies with the runtime): https://github.com/dotnet/runtime/blob/7d67d17a9f49ad5f365467fcd3bf0d25f2b9349a/docs/workflow/building/coreclr/linux-instructions.md
iff we get this far, then run the coreclr tests, see src/coreclr/build-test.sh
That's the way I would recommend too (and we did it the same way when we were porting .NET Core to Linux 5 years ago)
@am11 @janvorli Thanks! I will follow this path.
@am11 continuing my debugging journey and getting the test suite running in QNX. I suspect my issue has something to do with how this function is called GetCLRRuntimeHost but I don't know how this related to cruntime implementation.
@am11 @janvorli So I managed to build and run the pal_test suite on my QNX VM. I got this result but I doubt it is valid as I saw some process crash during the text execution. Does that produce a PASS status? I had to modify the bash script to be able to execute in ksh environment but that was not a big change. Next I will focus on fixing up the crashes I saw during the test execution. Most of them happened at strlen and Unable to set thread priority to 0 (error 22)
Finished running PAL tests: PAL Test Results: Passed: 726 Failed: 0
@am11 @jkotas Do I need this managed library? System.Private.CoreLib.dll for coreclr to work?
You don't need it for PAL tests, but you need it for the next steps. This is the core managed library containing all the basic functionality and glue between the managed and native parts of the runtime.
As for some PAL tests failing and the results still showing that no tests have failed, this is strange and seems like we may have a bug not recognizing crashes as failures.
@janvorli @am11 @jkotas Here are two types of crashes I saw during testing. One related to strelen function and the other is thread priority. Process 114688025 (paltest_fprintf_test2) terminated SIGSEGV code=1 fltno=11 ip=0000000100078e10(/usr/lib/ldqnx-64.so.2@strlen+0x0000000000000000) mapaddr=0000000000078e10. ref=0000000000000000 Memory fault (core dumped)
.{1-807d485} ASSERT [THREAD ] at /home/
Here is the crash when I tried to launch my helloworld.dll using corerun. Process 172077080 (corerun) terminated SIGSEGV code=2 fltno=11 ip=00000001010834cb(/mnt/river/tmp/libcoreclr.so@GetCLRRuntimeHost+0x000000000013f087) mapaddr=00000000002354cb. ref=0000000101bfab40 Memory fault (core dumped) I didn't quite understand how the function GetCLRRuntimeHost works. I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux, this might be the issue? I plan to revisit later as it doesn't seem to be a easy fix.
Making the context.cpp stuff work is essential, primarily for hardware exception handling and for GC thread suspension.
As for the failing PAL tests, you can run the specific tests under a debugger and see why it fails or crashes. Each PAL test is a standalone executable that can be run.
@janvorli Thanks! Here is a question from our lead developer while I am working on get context.cpp file compiled. I have to change register access for QNX target. His question is "Is it possible to compile without hardware floating point support? Might help there if there is a compile option for software floating point instead of hw floating point -- there would be no need to save and restore FP registers"
Is it possible to compile without hardware floating point support?
Unfortunately not. The JIT uses xmm registers a lot.
@janvorli Thanks!
@am11 @janvorli I fixed the register access issue and enabled context.cpp in my build. However, I am facing a new linker issue. But I do have -fPIC in my compilation flag and in the project I have CMAKE_POSITION_INDEPENDENT_CODE set to TRUE. Any suggestions here?
/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): warning: relocation against CONTEXT_CaptureContext' in readonly section
.text'.
/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext' can not be used when making a shared object; recompile with -fPIC
/home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: final link failed: Bad value
cc: /home/rihe/qnx700/host/linux/x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld error 1
@guesshe, does adding this line https://github.com/am11/runtime/blob/208143dbb181782119e74441a536c9a8efc29808/eng/native/configureplatform.cmake#L290 at the same place and recompiling (after rm -rf artifacts
) help? This is currently what I am doing for Solaris bringup (still very much work in progress), and it fixed a similar relocation error for me.
Also if you could show the diff in context2.S
, we will understand the error better. Maybe suffixing @gotpcrel
will fix the issue.
@am11 I tried this solution and the result is the same. I didn't make any changes to context2.S file under amd64. What do you mean by suffixing @gotpcrel ?
I did have to comment out one source file named ./src/pal/src/thread/context.cpp due to register access difference between QNX and Linux ... I fixed the register access issue and enabled context.cpp ... /x86_64/usr/bin/x86_64-pc-nto-qnx7.0.0-ld: ../../../pal/src/libcoreclrpal.a(context2.S.o): relocation R_X86_64_PC32 against symbol `CONTEXT_CaptureContext'
@guesshe, i mean git diff src/coreclr/src/pal/src/thread/context.cpp
how you fixed context issue? Also are you building master or release/3x branch?
@am11 @janvorli It seems supermi is not as critical as pal. Can I disable these sub-projects to test functionalities of pal? superpmi-shim-collector superpmi-shim-counter superpmi-shim-simple
@am11 The way I did was adding QNX specific register access MARCOs. Like following, it is very similar to FreeBSD. I have to include QNX specific header files but not sure if I can share the header file as it is not under any opensource license. +#elif defined(QNX)
+#define FPSTATE(uc) ((uc)->uc_mcontext.fpu.fxsave_area) +#define FPREG_ControlWord(uc) (FPSTATE(uc).fpu_control_word) +#define FPREG_StatusWord(uc) (FPSTATE(uc).fpu_status_word) +#define FPREG_TagWord(uc) (FPSTATE(uc).fpu_tag_word) +#define FPREG_MxCsr(uc) (FPSTATE(uc).mxcsr) +#define FPREG_MxCsr_Mask(uc) (FPSTATE(uc).mxcsr_mask) +#define FPREG_ErrorOffset(uc) (DWORD) &(FPSTATE(uc).fpu_rip) +#define FPREG_ErrorSelector(uc) ((WORD) &(FPSTATE(uc).fpu_rip) + 2) +#define FPREG_DataOffset(uc) (DWORD) &(FPSTATE(uc).fpu_rdp) +#define FPREG_DataSelector(uc) ((WORD) &(FPSTATE(uc).fpu_rdp) + 2)
+#define FPREG_Xmm(uc, index) (M128A) &(FPSTATE(uc).xmm_regs[index]) +#define FPREG_St(uc, index) (M128A) &(FPSTATE(uc).st_regs[index])
OK. Now I disabled superpmi sub-project and it builds. But I got a new crash. Process 173518872 (corerun) terminated SIGSEGV code=2 fltno=11 ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2) mapaddr=000000000023552d. ref=0000000101bfcb40 Memory fault (core dumped) Any suggestions on how to debug this?
https://github.com/dotnet/runtime/issues/33374#issuecomment-609059509 did you try something to fix fltno=11
?
@am11 I didn't try anything specific. I bring back the context.cpp file and added qnx as targetOS. Now it complains about registerTMCloneTable, but I can't find this function in coreclr.
@guesshe do all PAL tests pass now? If they don't, there is not much sense in trying to run corerun. Btw, maybe you do that, but until you get everything running, I would recommend running it under gdb (or lldb if you have one on QNX). It is very unlikely to figure out problems just by executing the code and reasoning based on the crash code. You'll need to view the stack trace, local variables, etc. Maybe you do that already, but from your questions above, it seemed you are just trying to run it without debugger.
@janvorli Thanks! The only crash I saw is strlen and thread priority. I think strlen is fine but thread priority might be an issue. I am setting up debugger at the same time. I am trying to get the dump and reload on host gdb tool. Had a version conflict issue yesterday. Will try to resolve it today. I got some help from our lead developer regarding this crash. he said the qcc compiler supports transactional memory in its runtime, but all the symbols are namespaced with the prefix ITM (as in _ITM_registerTMCloneTable). How does this symbol defined in the binary?
We don't call such a function directly from our code and when I've googled for it, it seems it comes from usage of register_tm_clones function that we also don't use. So I guess it comes from the standard C library or something like that.
@janvorli Thanks! I got some feedback from our kernel developer. Hopefully it will help with understanding the issue. QNX does not use transactional memory so it has nothing to do with libc.
There is a weak function _ITM_registerTMCloneTable() that gets called by register_tm_clones() in libgcc (the compiler's supplied runtime library). Because it's a weak symbol, it's OK to not resolve it and the library will just skip the call to it.
Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?
Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?
No, there is nothing like that. However, looking again at the error
ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable+0x00000000000118b2
, I've just realized it has probably nothing to do with that symbol. The offset (0x00000000000118b2) is too far away from that symbol to be in the same function. I think that what happens is that it fails at some place where there are no symbols available and it ends up reporting the closest symbol it finds, which by a mere chance ends up being the registerTMCloneTable
.
Thanks! I will first fix the thread priority issue and then put this bin in gdb and debug. Do you think the strlen is also related? I am not sure if i can fix the strlen, it might be some limited supoort issue.
Regards
River He
On Thu., Apr. 9, 2020, 12:20 Jan Vorlicek, notifications@github.com wrote:
Is it possible libcoreclr is being built with some option that turns unresolved weak symbols into an error?
No, there is nothing like that. However, looking again at the error ip=000000010108352d(/mnt/river/tmp/libcoreclr.so@registerTMCloneTable +0x00000000000118b2, I've just realized it has probably nothing to do with that symbol. The offset (0x00000000000118b2) is too far away from that symbol to be in the same function. I think that what happens is that it fails at some place where there are no symbols available and it ends up reporting the closest symbol it finds, which by a mere chance ends up being the registerTMCloneTable.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/runtime/issues/33374#issuecomment-611618143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCJEHQ2OCOXQVEPGZ7LWP3RLXYTHANCNFSM4LEKE3NA .
I don't see why something as simple as strlen should be problematic, so it seems we end up getting wrong character pointer (maybe a NULL) somewhere and passing it to the strlen later. So the strlen failing is just an indicator of a problem somewhere else.
@janvorli I fixed the thread priority issue. Now, apart from the strlen issues I am having following exceptions. However, these exceptions are not considered failed tests, I still get 726 test cases passed and 0 failure. ...'paltest_namedmutex_test1' failed at line 397. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 463. Expression: m2 != nullptr 'paltest_namedmutex_test1' failed at line 556. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 670. Expression: m != nullptr 'paltest_namedmutex_test1' failed at line 287. Expression: parentEvents[i] != nullptr 'paltest_namedmutex_test1' failed at line 695. Expression: InitializeParent(testName, parentEvents, childEvents) 'paltest_namedmutex_test1' failed at line 930. Expression: AbandonTests_Parent() 'paltest_namedmutex_test1' failed at line 273. Expression: WaitForSingleObject(childRunningEvent, FailTimeoutMilliseconds) == WAIT_OBJECT_0 'paltest_namedmutex_test1' failed at line 320. Expression: AcquireChildRunningEvent(testName, childRunningEvent) 'paltest_namedmutex_test1' failed at line 759. Expression: InitializeChild(testName, childRunningEvent, parentEvents, childEvents)
@am11 @janvorli Is feature no stress_log supported? If I set -DFEATURE_NO_STRESSLOG, will this disable the feature?
@guesshe you can set that, but I am not sure why would you want to do that.
@janvorli @wfurt @jkotas With the help of our kernel developers, we managed to fix this crash and another stack issue. Now it proceeded to a point that looks very promising.
coreclr_initialize failed - status: 0x80004005 By reading porting notes from @wfurt, I downloaded netcore 5 sdk 5.0 using snap and published to netcoreapp5.0 targetframework. However, I still got the same issue. The commit I checkout from master is 62112b0abb36654775552842231dc48a0d032655. Any suggestions? Is this because I am on master not on the preview branch?
That maps to E_FAIL and there are many places where this can fail. You can try to set COREHOST_TRACE=1 and check if that provides any hints. (I assume you disabled r2r, right?) I don't think the branch matters.
@wfurt Thanks! What is r2rm? Does this failure mean the cruntime is passed?
There was typo. R2R -> Ready To Run. With crossgen, we may put in native bits so make startup faster. Because of that, you many not be able to simply copy assemblies targeted for other platform. It should work for the hello but I'm wondering how did you get BCL assemblies. Back then, I used COMPlus_ZapDisable=1 and COMPlus_ReadyToRun=0 when trying to use Linux assemblies on FreeBSD. @janvorli or @jkotas may know better if that is still applicable.
@wfurt Is that an environment variable? I don't recall I set that. For BCL assemblies, I plan to upload the built tools and source code to target and build from there directly instead of cross-compiling.
yes, environment. I'm not quite sure what you mean by the previous post. In order to build assemblies you need to have working dotnet cli and c# compiler is written (mostly) in c#. forerun cannot function without System.Private.CoreLib.dll (and perhaps others), so the question is how did you get one?
Hi,
I am trying to port the entire runtime to qnx7 platform on x64 arch. I am able to build coreclr but it won't run unless I have dotnet executable built. Any suggestions on how to build corehost for qnx?