Closed tcorrin closed 4 years ago
@wfurt can you please take a look?
do you have steps or simplified repro @tcorrin ? What exact OS and curl version do you have.
I cannot make much progress @karelz unless I get more info from @tcorrin It is hard to guess without good example and repro steps.
OK, closing until we get more info.
OS Version:
CentOS Linux release 7.3.1611 (Core)
Curl version:
[vq@192 ~]$ curl --version curl 7.50.1 (x86_64-pc-linux-gnu) libcurl/7.50.1 OpenSSL/1.0.1e zlib/1.2.7 Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp Features: IPv6 Largefile NTLM NTLM_WB SSL libz UnixSockets
Unfortunately we have been unable to reproduce this apart from in the environment we are seeing it where our software will communicate without issue for 1 or 2 hours and then have the segfault. I can upload one of the coredumps to my googledrive for you to take a look at if that helps?
Dumps might be good to look at. The trouble is that it may very much be bug in your interop code. Memory corruption or something. Or maybe a bug in lubcurl. Investigating these things is time-consuming and we obviously do not scale to do that for every single one off failure, unless there is sign/hint that it might be truly .NET Core problem and it may affect more than 1 customer.
Can you please do first-level analysis of the dump? The callstack above looks incomplete to me. If you need guidance for .NET debugging on Linux, let us know.
The coredump can be downloaded from this link: https://drive.google.com/open?id=0B8FcUSx2JcXXdEl4YlZWZzR2Qjg We are now seeing a second instance of this issue so we are keen to get to the bottom of it.
Any guidance you could provide for debugging the coredump on linux would be appreciated.
So far I have analysed the dump in gdb and lldb with sos as per the instructions here: http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/ this did not yield any further useful information.
@wfurt @janvorli can you please look at the issue?
@tcorrin when you hit the issue again, please upload another dump. Having more dumps usually helps. Can you please also verify that this happened on 2 different machines (or at least VMs)?
Correct, this was 2 different machines
On 30 August 2017 at 01:14, Karel Zikmund notifications@github.com wrote:
Reopened dotnet/corefx#20177 https://github.com/dotnet/corefx/issues/20177.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/20177#event-1226699447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGapuRsOhS8Pv-dUNnbxqUp4NnP1TnOks5sdKlWgaJpZM4NjZLD .
@jchannon just to be super clear - 2 different physical machines, or 2 different VMs?
2 different customers so 2 different machines and 2 different VMs :)
On 30 August 2017 at 16:56, Karel Zikmund notifications@github.com wrote:
@jchannon https://github.com/jchannon just to be super clear - 2 different physical machines, or 2 different VMs?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/20177#issuecomment-326036205, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGapre5J8NXwsToxTLj8TFynpzPnERDks5sdYZDgaJpZM4NjZLD .
ok @karelz , I'll take a look at the dumps. Curl_resolv_timeout() is interesting. It seems like are are already on some error path - and that may be reason why it is hard to reproduce as standalone app.
@wfurt please ping me if you need any help with loading the dumps.
Hello @tcorrin . I'm trying to setup environment identical to yours, so I can properly resolve symbols. I have Centos 7.3.1611 but even if I try to update to latest I get curl-7.29.0-35.el7.centos.x86_64 (openssl-devel-1.0.1e-60.el7_3.1) How did you get your curl 7.50.1 on the system?
Also can you give me output of dotnet --info ? Getting all versions exactly right is critical for getting more useful data from core.
BTW I was thinking about it more @jchannon and I'm wondering if you could set up identical VM for investigation. If you have any code used in attempt to reproduce I would like to see it as well. That may give me clue what the app is trying to do and how it is structured.
Hi. We can give you a VM as our customer has it or we can give you a VM without our apps on it. Whatever is easiest for you. The way we get curl 7.50.1 is to build from source. The issue is I don't think its going to be case of run our apps and it falls over. We have 100+ customers with our software and only 2 have found a segfault.
@jchannon what motivated you to build curl from source? Is there any reason why you don't use 'default' versions available (and I think recommended by our install steps)?
I can't remember why that version now I'm afraid. As far as I know there are no curl version recommendations but could be wrong
if I get VM with the app, can I run it @jchannon . If so that would be great. If I can reproduce the crash I'm positive we can solve this.
The dotnet was built agains stock libcurl. Using different version may or may not be problem.
As minimum, I'll need debug version if your custom build - or at least configuration script so we can produce identical version.
We can give you a VM via an OVF file but you'll need to to connect to our VPN as the app needs to talk to another machine which it pings as it sends HTTP requests to. If you are able to do that let me know, if not we can come up with another plan
is there a specific libcurl version we should be checking for?
I would start with version coming with OS @mikeh688 . That is what is used to build dotnet. That would have widest deployment and test coverage.
Can you host the VM and give me ssh access @jchannon ? I can load OVF as well but as you pointed out it may need some other service. There are two levels of troubleshooting: first one is to get identical system with debug symbols so we can crack the core files. It needs to match 100% as far used libraries and the application (dotnet). Without it symbols would just be garbage. Second level is ability to reproduce the crash in somewhat controlled environment. With that one could instrument code and use other tricks to get more insight.
I would strongly encourage you to try it with stock curl. With that you can simply grab debuginfo packages and be sure the symbols do match used binaries. Even if there was good reason to upgrade curl, it will help with this investigation.
as a first step, i suggest we take your suggestion and get back to an image running on the stock libcurl. we'll give that to the customer facing an issue and see how we get on. if it clears the condition, great. If not, we'll work with you and get you an image; we dont know what induces the problem; it can strike at any point over a period of several weeks and the solution consists of multiple, relatively complex, parts.
| The dotnet was built agains stock libcurl. Using different version may or may not be problem.
can I get an exact version for when you say "stock" for curl so I can make sure we have the right one please @wfurt?
get the 7.29.0-35 (or 7.29.0 if you get it from curl site directly)
Due to the fact that the version of curl that is shipped with centos is compiled with NSS instead of openssl, we need to use an open-ssl compiled version of curl (see https://github.com/dotnet/corefx/issues/9728 for more information) . at the moment we are using 7.50 compiled by hand with openssl, which (in theory) should be more stable and more up to date that 7.29.
Sounds like a legitimate reason. In that case we need symbols (at minimum) for that specific hand-built curl. Without symbols, there is not much we can do :(
slight change in plan since the last post; we’re going to produce a build using libcurl/nss and therefore be exactly the same as you test against. we’ll start from there as our reference and ask the customer to test using that.
[EDIT] Removing email reply by @karelz
We understand that NSS has certain disadvantages. Using OpenSSL-based build is totally fine. We just need symbols, that's all. Of course, ultimately it is your decision which route to go -- rebuild new build with symbols or use NSS.
First, curl_global_init and Curl_resolv_timeout are not thread-safe
Make sure curl_global_init and curl_global_cleanup are invoked in your main thread curl_easy_init will invoke curl_global_init, and there is a global variable named 'initialized' was invoked by curl_global_init, But it's not safe to modify a global variable without thread lock.
curl_global_init and Curl_resolv_timeout are not thread-safe funciton sigsetjmp(curl_jmpenv, 1) and siglongjmp(curl_jmpenv, 1) are invoked in Curl_resolv_timeout invoke, but curl_jmpenv is a global variable. To disable the sigsetjmp/siglongjmp, You should do this: curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1);
There is now also a .NET Core 2.0 package available for CentOS. This includes a more recent version of libcurl that uses OpenSSL. To install the package:
yum install centos-release-dotnet
yum install rh-dotnet20
To use the package:
scl enable rh-dotnet20 bash
dotnet --info
Did we find some solution or more isolated repro? I'm going to close this unless we get more information.
Closing. If there is more evidence of any problems, please let us know and we can reopen if there is more info available. Thanks!
We are getting an issue running our netcore 1.1 app on centos 7 where we get a libcurl segfault. The generated core dump gave the following stack trace:
Reading up on this error https://curl.haxx.se/mail/lib-2013-05/0079.html or https://curl.haxx.se/mail/lib-2014-01/0098.html it seems that it has something to do with curl and multi threading and can be solved by setting CURLOPT_NOSIGNAL to 1 however the netcore code seems to be doing that here https://github.com/dotnet/corefx/blob/release/1.1.0/src/System.Net.Http/src/System/Net/Http/Unix/CurlHandler.EasyRequest.cs#L261 so I am confused at how my netcore app could be running into this issue.