dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Libcurl segfault when using httpclient #21912

Closed tcorrin closed 4 years ago

tcorrin commented 7 years ago

We are getting an issue running our netcore 1.1 app on centos 7 where we get a libcurl segfault. The generated core dump gave the following stack trace:

(gdb) bt
#0  0x00007f4b4e5bc28d in addbyter () from /lib64/libcurl.so.4
dotnet/corefx#1  0x00007f4b4e5bc649 in dprintf_formatf () from /lib64/libcurl.so.4
dotnet/corefx#2  0x00007f4b4e5bd8a5 in curl_mvsnprintf () from /lib64/libcurl.so.4
dotnet/corefx#3  0x00007f4b4e5ac94e in Curl_failf () from /lib64/libcurl.so.4
dotnet/corefx#4  0x00007f4b4e5a2f17 in Curl_resolv_timeout () from /lib64/libcurl.so.4
dotnet/corefx#5  0x00007f4b527b3bd0 in ?? ()
dotnet/corefx#6  0x00007f4b527b3a70 in ?? ()
dotnet/corefx#7  0x00007f4b50041e28 in ?? ()

Reading up on this error https://curl.haxx.se/mail/lib-2013-05/0079.html or https://curl.haxx.se/mail/lib-2014-01/0098.html it seems that it has something to do with curl and multi threading and can be solved by setting CURLOPT_NOSIGNAL to 1 however the netcore code seems to be doing that here https://github.com/dotnet/corefx/blob/release/1.1.0/src/System.Net.Http/src/System/Net/Http/Unix/CurlHandler.EasyRequest.cs#L261 so I am confused at how my netcore app could be running into this issue.

karelz commented 7 years ago

@wfurt can you please take a look?

wfurt commented 7 years ago

do you have steps or simplified repro @tcorrin ? What exact OS and curl version do you have.

wfurt commented 7 years ago

I cannot make much progress @karelz unless I get more info from @tcorrin It is hard to guess without good example and repro steps.

karelz commented 7 years ago

OK, closing until we get more info.

tcorrin commented 7 years ago

OS Version: CentOS Linux release 7.3.1611 (Core) Curl version: [vq@192 ~]$ curl --version curl 7.50.1 (x86_64-pc-linux-gnu) libcurl/7.50.1 OpenSSL/1.0.1e zlib/1.2.7 Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp Features: IPv6 Largefile NTLM NTLM_WB SSL libz UnixSockets

Unfortunately we have been unable to reproduce this apart from in the environment we are seeing it where our software will communicate without issue for 1 or 2 hours and then have the segfault. I can upload one of the coredumps to my googledrive for you to take a look at if that helps?

karelz commented 7 years ago

Dumps might be good to look at. The trouble is that it may very much be bug in your interop code. Memory corruption or something. Or maybe a bug in lubcurl. Investigating these things is time-consuming and we obviously do not scale to do that for every single one off failure, unless there is sign/hint that it might be truly .NET Core problem and it may affect more than 1 customer.

Can you please do first-level analysis of the dump? The callstack above looks incomplete to me. If you need guidance for .NET debugging on Linux, let us know.

tcorrin commented 7 years ago

The coredump can be downloaded from this link: https://drive.google.com/open?id=0B8FcUSx2JcXXdEl4YlZWZzR2Qjg We are now seeing a second instance of this issue so we are keen to get to the bottom of it.

Any guidance you could provide for debugging the coredump on linux would be appreciated.

So far I have analysed the dump in gdb and lldb with sos as per the instructions here: http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/ this did not yield any further useful information.

karelz commented 7 years ago

@wfurt @janvorli can you please look at the issue?

@tcorrin when you hit the issue again, please upload another dump. Having more dumps usually helps. Can you please also verify that this happened on 2 different machines (or at least VMs)?

jchannon commented 7 years ago

Correct, this was 2 different machines

On 30 August 2017 at 01:14, Karel Zikmund notifications@github.com wrote:

Reopened dotnet/corefx#20177 https://github.com/dotnet/corefx/issues/20177.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/20177#event-1226699447, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGapuRsOhS8Pv-dUNnbxqUp4NnP1TnOks5sdKlWgaJpZM4NjZLD .

karelz commented 7 years ago

@jchannon just to be super clear - 2 different physical machines, or 2 different VMs?

jchannon commented 7 years ago

2 different customers so 2 different machines and 2 different VMs :)

On 30 August 2017 at 16:56, Karel Zikmund notifications@github.com wrote:

@jchannon https://github.com/jchannon just to be super clear - 2 different physical machines, or 2 different VMs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/corefx/issues/20177#issuecomment-326036205, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGapre5J8NXwsToxTLj8TFynpzPnERDks5sdYZDgaJpZM4NjZLD .

wfurt commented 7 years ago

ok @karelz , I'll take a look at the dumps. Curl_resolv_timeout() is interesting. It seems like are are already on some error path - and that may be reason why it is hard to reproduce as standalone app.

janvorli commented 7 years ago

@wfurt please ping me if you need any help with loading the dumps.

wfurt commented 7 years ago

Hello @tcorrin . I'm trying to setup environment identical to yours, so I can properly resolve symbols. I have Centos 7.3.1611 but even if I try to update to latest I get curl-7.29.0-35.el7.centos.x86_64 (openssl-devel-1.0.1e-60.el7_3.1) How did you get your curl 7.50.1 on the system?

Also can you give me output of dotnet --info ? Getting all versions exactly right is critical for getting more useful data from core.

wfurt commented 7 years ago

BTW I was thinking about it more @jchannon and I'm wondering if you could set up identical VM for investigation. If you have any code used in attempt to reproduce I would like to see it as well. That may give me clue what the app is trying to do and how it is structured.

jchannon commented 7 years ago

Hi. We can give you a VM as our customer has it or we can give you a VM without our apps on it. Whatever is easiest for you. The way we get curl 7.50.1 is to build from source. The issue is I don't think its going to be case of run our apps and it falls over. We have 100+ customers with our software and only 2 have found a segfault.

karelz commented 7 years ago

@jchannon what motivated you to build curl from source? Is there any reason why you don't use 'default' versions available (and I think recommended by our install steps)?

jchannon commented 7 years ago

I can't remember why that version now I'm afraid. As far as I know there are no curl version recommendations but could be wrong

wfurt commented 7 years ago

if I get VM with the app, can I run it @jchannon . If so that would be great. If I can reproduce the crash I'm positive we can solve this.

The dotnet was built agains stock libcurl. Using different version may or may not be problem.
As minimum, I'll need debug version if your custom build - or at least configuration script so we can produce identical version.

jchannon commented 7 years ago

We can give you a VM via an OVF file but you'll need to to connect to our VPN as the app needs to talk to another machine which it pings as it sends HTTP requests to. If you are able to do that let me know, if not we can come up with another plan

mikeh688 commented 7 years ago

is there a specific libcurl version we should be checking for?

wfurt commented 7 years ago

I would start with version coming with OS @mikeh688 . That is what is used to build dotnet. That would have widest deployment and test coverage.

wfurt commented 7 years ago

Can you host the VM and give me ssh access @jchannon ? I can load OVF as well but as you pointed out it may need some other service. There are two levels of troubleshooting: first one is to get identical system with debug symbols so we can crack the core files. It needs to match 100% as far used libraries and the application (dotnet). Without it symbols would just be garbage. Second level is ability to reproduce the crash in somewhat controlled environment. With that one could instrument code and use other tricks to get more insight.

I would strongly encourage you to try it with stock curl. With that you can simply grab debuginfo packages and be sure the symbols do match used binaries. Even if there was good reason to upgrade curl, it will help with this investigation.

mikeh688 commented 7 years ago

as a first step, i suggest we take your suggestion and get back to an image running on the stock libcurl. we'll give that to the customer facing an issue and see how we get on. if it clears the condition, great. If not, we'll work with you and get you an image; we dont know what induces the problem; it can strike at any point over a period of several weeks and the solution consists of multiple, relatively complex, parts.

Yantrio commented 7 years ago

| The dotnet was built agains stock libcurl. Using different version may or may not be problem.

can I get an exact version for when you say "stock" for curl so I can make sure we have the right one please @wfurt?

wfurt commented 7 years ago

get the 7.29.0-35 (or 7.29.0 if you get it from curl site directly)

Yantrio commented 7 years ago

Due to the fact that the version of curl that is shipped with centos is compiled with NSS instead of openssl, we need to use an open-ssl compiled version of curl (see https://github.com/dotnet/corefx/issues/9728 for more information) . at the moment we are using 7.50 compiled by hand with openssl, which (in theory) should be more stable and more up to date that 7.29.

karelz commented 7 years ago

Sounds like a legitimate reason. In that case we need symbols (at minimum) for that specific hand-built curl. Without symbols, there is not much we can do :(

mikeh688 commented 7 years ago

slight change in plan since the last post; we’re going to produce a build using libcurl/nss and therefore be exactly the same as you test against. we’ll start from there as our reference and ask the customer to test using that.

[EDIT] Removing email reply by @karelz

karelz commented 7 years ago

We understand that NSS has certain disadvantages. Using OpenSSL-based build is totally fine. We just need symbols, that's all. Of course, ultimately it is your decision which route to go -- rebuild new build with symbols or use NSS.

lmingzhi618 commented 6 years ago

First, curl_global_init and Curl_resolv_timeout are not thread-safe

  1. Make sure curl_global_init and curl_global_cleanup are invoked in your main thread curl_easy_init will invoke curl_global_init, and there is a global variable named 'initialized' was invoked by curl_global_init, But it's not safe to modify a global variable without thread lock.

  2. curl_global_init and Curl_resolv_timeout are not thread-safe funciton sigsetjmp(curl_jmpenv, 1) and siglongjmp(curl_jmpenv, 1) are invoked in Curl_resolv_timeout invoke, but curl_jmpenv is a global variable. To disable the sigsetjmp/siglongjmp, You should do this: curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1);

tmds commented 6 years ago

There is now also a .NET Core 2.0 package available for CentOS. This includes a more recent version of libcurl that uses OpenSSL. To install the package:

yum install centos-release-dotnet
yum install rh-dotnet20 

To use the package:

scl enable rh-dotnet20 bash
dotnet --info 
wfurt commented 6 years ago

Did we find some solution or more isolated repro? I'm going to close this unless we get more information.

karelz commented 6 years ago

Closing. If there is more evidence of any problems, please let us know and we can reopen if there is more info available. Thanks!