dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.48k stars 4.77k forks source link

HttpContent Stream Read() and ReadAsync() consume 100% CPU on Scientific Linux 7.2 #17731

Closed crozone closed 4 years ago

crozone commented 8 years ago

Running 1.0.0-preview2-003121 on Scientific Linux release 7.2 (Nitrogen), which dotnet --info identifies as (OS Name: rhel, OS Version: 7.2, OS Platform: Linux, RID: rhel.7.2-x64).

In order to read from a SSE EventSource (which is basically just a HTTP GET that doesn't close immediately), my code currently:

  1. Creates a HttpClient
  2. Sets up a HttpRequestMessage and uses it in a SendAsync() on the HttpClient to get a HttpResponseMessage
  3. awaits HttpResponseMessage.Content.ReadAsStreamAsync()
  4. Enters into a loop calling ReadAsync() on the Stream, which is then fed into a decoder.

The issue is that on RHEL 7.2, await ReadAsync() blocks (as expected), but with 100% CPU used by the dotnet process. This also occurs if the ReadAsync() is replaced by the standard blocking Read(). To my untrained eye, it appears to be spin waiting or spin locking, or something in that vain.

I have tested and confirmed that the issue does not manifest itself on Windows 10 (build 14372) or Ubuntu 14.04.

I have attached a small example project that reproduces the issue. It is unfortunately slightly cumbersome to run, since an event source must be hosted for it to read, and I can't find an existing example event source server to point it to.

RHELBugTest.zip

davidsh commented 8 years ago

cc: @stephentoub

stephentoub commented 8 years ago

@crozone, thanks for the bug report, including repro code. I just tried this out with .NET Core on both Windows and on Ubuntu (I don't have a Scientific Linux set up handy). I see a 100% CPU usage on both of those, but not in the call to ReadAsync. Rather, the calls to ReadAsync are completing immediately and returning 0, indicating that the response stream has ended and there's nothing more to be read. The spinning is coming from the while (true) loop in your repro that just repeatedly calls ReadAsync even if it's already returned 0:

            while (true) {
                // *** ISSUE ***
                // TODO: This is where we get 100% CPU on RHEL
                int byteCount = await eventStream.ReadAsync(buffer, 0, buffer.Length);
                // *** END ISSUE ***

                char[] chars = new char[decoder.GetCharCount(buffer, 0, byteCount)];
                decoder.GetChars(buffer, 0, byteCount, chars, 0);
                Console.Write(new string(chars));
            }

I tried it using http://www.w3schools.com/HTML/demo_sse.php. Maybe there's something special about this endpoint vs yours that's causing a difference?

crozone commented 8 years ago

After further investigation, I'm now confident that I can reproduce this issue on RHEL SL, and that the issue doesn't occur on Ubuntu 14.04 and Windows 10. Against my event source (which is a non-public event source hosted at https://api.particle.io/v1/devices/events/, which I believe runs on top of node.js), the code correctly awaits the await eventStream.ReadAsync, and only continues when there is at least one byte of data to be read. It only appears to return 0 bytes after the connection has been closed. Again on Ubuntu and Windows, this uses virtually no CPU, on RHEL SL, it pegs the core at 100%.

I believe the reason you were seeing the 100% CPU from the while(true) loop in your repro attempt was because the w3schools event source demo doesn't behave like a typical event source would - it closes itself almost immediately, causing bytesRead to return 0 and produce an infinite reconnection loop (poor repro code on my part). I've updated the repro code to be more verbose in its debug output, and added an await Task.Delay(1000) on reconnect to prevent any tight loops.

To investigate further, I ran an strace -f against the process on both Ubuntu and RHEL SL to see what calls were being made by both processes, and got significantly different results. Ubuntu appears to perform relatively few futexs with a few wait4s thrown in between them. SL performs similar futexs, but with many poll and clock_gettime(CLOCK_MONOTONIC... syscalls in between them. Alarmingly, the output of strace for the program running for 15 seconds on Ubuntu was 200kb (about 3000 syscalls), on RHEL it was 11.5mb (about 115000 syscalls). It definitely looks like RHEL SL is spinning.

I have attached the new repro code, as well as the output the program's STDOUT, and strace outputs from the program running on both Ubuntu and RHEL SL (only the first few thousand lines from RHEL SL for size reasons).

New Repro Project: RHELBugTest2.zip

Program output: software-output.txt

Complete Ubuntu strace -f: ubuntu-14.04-AsyncRead-strace.txt

Partial RHEL SL strace -f: redhat-7.2-AsyncRead-strace-partial.txt

EDIT: Changed RHEL to SL, since technically, this is a Scientific Linux 7.2 build, not actual RHEL 7.2(although they're built from practically the same source)

CRCinAU commented 8 years ago

Dropping by as the sysadmin that runs said EL servers if any input is required as to any system config or other details that may be required. Don't be afraid to ask for info :)

stephentoub commented 8 years ago

@crozone and @CRCinAU, thanks for following up.

@crozone, would you be able to include an EventListener like https://github.com/dotnet/corefx/blob/master/src/Common/tests/System/Diagnostics/Tracing/ConsoleEventListener.cs in your repro, wrapping your main method in something like:

using (new ConsoleEventListener("Http"))
{
    ...
}

I'm curious to see what the log shows while this spinning is happening. This will turn on both libcurl's logging as well as additional logging we do in the System.Net.Http component, and will route it all to stdout (if you'd prefer to route it to a file, you could of course edit the ConsoleEventListener to write the data wherever you like).

Thanks!

crozone commented 8 years ago

Okay, I placed the using (new ConsoleEventListener("Http")){} around the event source read loop, and ran it on both the Ubuntu and Scientific Linux machines.

Here are the outputs (with private Bearer tokens redacted 8-) )

Ubuntu: ubuntu-debug.txt

Scientific Linux: sl-debug.txt

They both drop into very similar loops which repeat every 15 seconds as LineFeed characters are sent down from the server (as heartbeats).

The main difference I noticed is that the Ubuntu machine spits out [Microsoft-System-Net-Http-6] (3, 0, WaitForWork, Wait wake-up)., then [Microsoft-System-Net-Http-6] (3, 1, HandleIncomingRequests, Type: Unpause)., and then continues to produce [Microsoft-System-Net-Http-6] (3, 0, WaitForWork, Wait timeout). every few seconds until the next NewLine character is read.

On the Scientific Linux system, it never outputs a "WaitForWork", it just hits [Microsoft-System-Net-Http-6] (3, 1, HandleIncomingRequests, Type: Unpause). and then sits there until the NewLine is sent.

Is it possible that this behaviour is being caused by a differences in libCurl? libCurl appears to be "libcurl/7.29.0 " on the Scientific Linux installation, and "libcurl/7.35.0" on the Ubuntu installation.

TingluoHuang commented 8 years ago

Hi, our customer hit similar issue on CentOS, our coreclr console app is making long poll REST call. https://github.com/Microsoft/vsts-agent/issues/454

ericeil commented 8 years ago

@TingluoHuang thanks for the extra data! Do you know which libcurl version is involved in your customer's case?

ericeil commented 8 years ago

I'm so far unable to reproduce this myself, but this is likely because I don't have the same event source.

@crozone, thank you for sending the logs; I'm still looking through them. It may also be useful to collect a perf trace: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md.

Also, have you seen this behavior on the RTM .NET Core release?

TingluoHuang commented 8 years ago

Loop in my customer. :) @ppanyukov, can you provide the information @ericeil wanted?

Thanks, Ting

ppanyukov commented 8 years ago

@ericeil the version on libcurl is this:

libcurl-7.29.0-25.el7.centos.x86_64

I will get a newer version and see if the problem still exists.

ppanyukov commented 8 years ago

@ericeil This looks to be definitely related to libcurl.

I have updated to libcurl-7.50.0-2.0.cf.rhel7.x86_64 and the problem has gone away.

The syscall stats look way more healthy now:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.00    0.028000          17      1672       255 futex
 20.00    0.008000         242        33           poll
 10.00    0.004000         571         7         4 restart_syscall
  0.00    0.000000           0       352           mprotect
  0.00    0.000000           0         1           madvise
  0.00    0.000000           0       178           wait4
  0.00    0.000000           0        36           gettimeofday
  0.00    0.000000           0        36           getrusage
  0.00    0.000000           0       177           gettid
  0.00    0.000000           0       748           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.040000                  3240       259 total

0.00user 0.02system 0:17.90elapsed 0%CPU (0avgtext+0avgdata 2444maxresident)k
0inputs+0outputs (0major+159minor)pagefaults 0swaps

Here are the actual sequences:

[pid 21068] wait4(267,  <unfinished ...>
[pid  4643] <... futex resumed> )       = 0
[pid 21068] <... wait4 resumed> 0x7ffa937fd5e4, WNOHANG, NULL) = 0
[pid  4643] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 21068] mprotect(0x7ffb4d794000, 4096, PROT_READ|PROT_WRITE <unfinished ...>
[pid  4643] <... clock_gettime resumed> {1469706279, 262498381}) = 0
[pid 21068] <... mprotect resumed> )    = 0
[pid 21068] mprotect(0x7ffb4d794000, 4096, PROT_NONE <unfinished ...>
[pid  4643] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 21068] <... mprotect resumed> )    = 0
[pid  4643] <... clock_gettime resumed> {1469706279, 262641983}) = 0
[pid 21068] gettid( <unfinished ...>
[pid  4643] futex(0x1285614, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7989, {1469706279, 362641983}, ffffffff <unfinished ...>
[pid 21068] <... gettid resumed> )      = 915
[pid 21068] clock_gettime(CLOCK_MONOTONIC, {83226, 548594884}) = 0
[pid 21068] clock_gettime(CLOCK_REALTIME, {1469706279, 262839286}) = 0
[pid 21068] futex(0x7ffa84003044, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 149, {1469706299, 262839286}, ffffffff <unfinished ...>
[pid  4643] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  4643] futex(0x12855e8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  4643] futex(0x7ffaa0091664, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ffaa0091660, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid  4682] <... futex resumed> )       = 0
[pid  4643] <... futex resumed> )       = 1
[pid  4682] futex(0x7ffaa0091638, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  4643] futex(0x7ffaa0091638, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4682] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4643] <... futex resumed> )       = 0
[pid  4682] futex(0x7ffaa0091638, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  4643] futex(0x1285614, FUTEX_WAIT_PRIVATE, 7991, NULL <unfinished ...>
[pid  4682] futex(0x1285614, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1285610, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid  4643] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4682] <... futex resumed> )       = 0
[pid  4643] futex(0x12855e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  4682] futex(0x12855e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4643] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4682] <... futex resumed> )       = 0
[pid  4643] futex(0x12855e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4682] wait4(267,  <unfinished ...>

For those who want to play around with this version of libcurl on EL7, here is what I did to get it:

# EPEL repo is for libnghttp2.so.14 required by new libcurl
echo "Installing curl" \
    && yum -y install epel-release \
    && rpm -Uvh http://www.city-fan.org/ftp/contrib/yum-repo/rhel7/x86_64/city-fan.org-release-1-13.rhel7.noarch.rpm \
    && yum -y install libcurl

# rpm -qa | grep libcurl
libcurl-7.50.0-2.0.cf.rhel7.x86_64

Of course this repository and this build of libcurl will never be approved by anyone for production in most shops I think so lets not have this as a solution please :)

ppanyukov commented 8 years ago

Oh and this may be something in underlying libraries, not libcurl itself.

For completeness, here the list of dependencies.

The libcurl-7.50.0-2.0.cf.rhel7.x86_64 pulls in these:

=============================================================================================
 Package             Arch           Version                       Repository            Size
=============================================================================================
Updating:
 libcurl             x86_64         7.50.0-2.0.cf.rhel7           city-fan.org         377 k
Installing for dependencies:
 libicu              x86_64         50.1.2-15.el7                 base                 6.9 M
 libmetalink         x86_64         0.1.2-9.rhel7                 city-fan.org          25 k
 libnghttp2          x86_64         1.7.1-1.el7                   epel                  61 k
 libpsl              x86_64         0.7.0-1.el7                   city-fan.org          45 k
Updating for dependencies:
 curl                x86_64         7.50.0-2.0.cf.rhel7           city-fan.org         414 k
 libssh2             x86_64         1.7.0-5.0.cf.rhel7            city-fan.org         102 k

Transaction Summary
=============================================================================================
Install             ( 4 Dependent packages)
Upgrade  1 Package  (+2 Dependent packages)

Library dependencies. Standard Centos 7.2: libcurl-7.29.0-25.el7.centos.x86_64:

# ldd /lib64/libcurl.so.4
    linux-vdso.so.1 =>  (0x00007ffedfeef000)
    libidn.so.11 => /lib64/libidn.so.11 (0x00007f88331a8000)
    libssh2.so.1 => /lib64/libssh2.so.1 (0x00007f8832f7e000)
    libssl3.so => /lib64/libssl3.so (0x00007f8832d3a000)
    libsmime3.so => /lib64/libsmime3.so (0x00007f8832b13000)
    libnss3.so => /lib64/libnss3.so (0x00007f88327ed000)
    libnssutil3.so => /lib64/libnssutil3.so (0x00007f88325c0000)
    libplds4.so => /lib64/libplds4.so (0x00007f88323bc000)
    libplc4.so => /lib64/libplc4.so (0x00007f88321b7000)
    libnspr4.so => /lib64/libnspr4.so (0x00007f8831f78000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8831d5c000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f8831b58000)
    libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f883190b000)
    libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f8831626000)
    libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f88313f4000)
    libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f88311ef000)
    liblber-2.4.so.2 => /lib64/liblber-2.4.so.2 (0x00007f8830fe0000)
    libldap-2.4.so.2 => /lib64/libldap-2.4.so.2 (0x00007f8830d8d000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f8830b76000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f88307b4000)
    libssl.so.10 => /lib64/libssl.so.10 (0x00007f8830547000)
    libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007f883015e000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f882ff56000)
    /lib64/ld-linux-x86-64.so.2 (0x0000559d442ae000)
    libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f882fd46000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f882fb42000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f882f928000)
    libsasl2.so.3 => /lib64/libsasl2.so.3 (0x00007f882f70a000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f882f4e5000)
    libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f882f2ad000)
    libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f882f04c000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f882ee27000)
    libfreebl3.so => /lib64/libfreebl3.so (0x00007f882ec23000)

Library dependencies. Updated libcurl-7.50.0-2.0.cf.rhel7.x86_64:

# ldd /lib64/libcurl.so.4
    linux-vdso.so.1 =>  (0x00007ffc2e384000)
    libnghttp2.so.14 => /lib64/libnghttp2.so.14 (0x00007fddeea36000)
    libidn.so.11 => /lib64/libidn.so.11 (0x00007fddee803000)
    libssh2.so.1 => /lib64/libssh2.so.1 (0x00007fddee5d5000)
    libpsl.so.0 => /lib64/libpsl.so.0 (0x00007fddee35d000)
    libssl3.so => /lib64/libssl3.so (0x00007fddee11a000)
    libsmime3.so => /lib64/libsmime3.so (0x00007fddedef2000)
    libnss3.so => /lib64/libnss3.so (0x00007fddedbcc000)
    libnssutil3.so => /lib64/libnssutil3.so (0x00007fdded9a0000)
    libplds4.so => /lib64/libplds4.so (0x00007fdded79b000)
    libplc4.so => /lib64/libplc4.so (0x00007fdded596000)
    libnspr4.so => /lib64/libnspr4.so (0x00007fdded358000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdded13b000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fddecf37000)
    libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007fddecceb000)
    libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007fddeca05000)
    libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007fddec7d3000)
    libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007fddec5cf000)
    liblber-2.4.so.2 => /lib64/liblber-2.4.so.2 (0x00007fddec3bf000)
    libldap-2.4.so.2 => /lib64/libldap-2.4.so.2 (0x00007fddec16c000)
    libz.so.1 => /lib64/libz.so.1 (0x00007fddebf56000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fddebb93000)
    libssl.so.10 => /lib64/libssl.so.10 (0x00007fddeb926000)
    libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007fddeb53e000)
    libicuuc.so.50 => /lib64/libicuuc.so.50 (0x00007fddeb1c4000)
    libicudata.so.50 => /lib64/libicudata.so.50 (0x00007fdde9bf0000)
    librt.so.1 => /lib64/librt.so.1 (0x00007fdde99e7000)
    /lib64/ld-linux-x86-64.so.2 (0x0000558fea944000)
    libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007fdde97d8000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007fdde95d4000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fdde93b9000)
    libsasl2.so.3 => /lib64/libsasl2.so.3 (0x00007fdde919c000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fdde8e93000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fdde8b91000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fdde897b000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fdde8755000)
    libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007fdde851e000)
    libpcre.so.1 => /lib64/libpcre.so.1 (0x00007fdde82bc000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fdde8097000)
    libfreebl3.so => /lib64/libfreebl3.so (0x00007fdde7e94000)
ericeil commented 8 years ago

Thanks for trying that, @ppanyukov!

@ppanyukov, @crozone, do either of you have an SSE source I can use to reproduce this myself? Either a public server somewhere, or a local server I can run, would work.

ppanyukov commented 8 years ago

@ericeil I can build one for you easily with all the VSTS agent things we are using in Azure if you give me your public ssh key. Email ppanyukov at googlemail.com, we can discuss offline.

The tricky part with our setup is you would need a visualstudio.com thing to do builds, which I can also setup for you. Or you can get your own. Again, can discuss offline.

PS. Oh, and what is SSE source? Ah never mind, got it. Well the visualstudio.com would be SSE source for us. Otherwise we don't have anything ready.

ericeil commented 8 years ago

@ppanyukov, thanks for the offer of help! I'm going to be away from GitHub for the next couple of weeks, but I'll follow up with you after that.

CRCinAU commented 8 years ago

Just wanting to give this a bit of a prod to make sure it doesn't get forgotten :)

crozone commented 8 years ago

I'm going to write a basic event source server using kestrel for everyone to test against - give me a day or so and I'll have something to repro with.

ericeil commented 8 years ago

I'm going to write a basic event source server using kestrel for everyone to test against

Thanks, @crozone! I'm back from vacation now, so I can take a look at this; just let me know when you have something I can try.

crozone commented 8 years ago

https://github.com/crozone/EventSourceDemo

All done, this is a basic event source server that hosts an event source at the URL /EventSource, and pushes down the current time in the data every second.

It is currently implemented by using the standard MVC pipeline to route requests to the EventSource action. The action sets the correct header types and then enters into a loop, async waiting on a semaphore slim to indicate that a message queue has been populated. The loop pops messages off the queue and pushes them as event source formatted plaintext down the response stream. AFAIK, this behaves correctly, but if anyone has a better way of implementing an event source (without involving SignalR), let me know.

The Index page also has some javascript that connects to it and displays the messages as part of the DOM - this works on Firefox and Chrome, but not on Edge just yet (since Edge doesn't support event sources).

ericeil commented 8 years ago

Thank you again, @crozone. With this event source, I'm able to reproduce the problem on my machine! I'll investigate the CPU usage now....

ericeil commented 8 years ago

We seem to get stuck in a loop calling curl_multi_wait, which apparently returns immediately. My current hunch is that this is due to the problem fixed in libcurl with this commit. The "extra" fd we pass in has data available for reading, but we don't realize this because the revents field for that fd never gets set correctly. So we don't actually read the data from the fd, so it always polls true.

It looks like that libcurl issue was fixed in libcurl 7.32.0. Would using that version be a viable option for everyone here?

CRCinAU commented 8 years ago

Not an option here - as it would vary the installations from the upstream vendor.

I have however started a bugzilla request with RedHat in an attempt to have the fix backported into the official RHEL packages: https://bugzilla.redhat.com/show_bug.cgi?id=1367614

If someone closer to the debugging of this issue than me is able to add more technical details to assist the RH team, this may be helpful to them.

crozone commented 8 years ago

We should probably proceed by grabbing the libcurl/7.29.0 source, making a patch for it (add the code from https://github.com/curl/curl/commit/6d30f8ebed34e7276c2a59ee20d466bff17fee56), build and test, and then submit that to the bugzilla report so it can be patched.

CRCinAU commented 8 years ago

Ok - this is turning out to be more complex than I thought.... The EL7 packages already have a heap of patches that touch lib/multi.c. One of those patches is messing up with my application of the ported patch to these sources.

To make things more complex, because of previous patches, I can't even just rip the entire routine out and put in the modified one from 7.32.

As I'm not a native C coder, I'm only having a guess as to what is gong on to try and manually port it back - so I'm a little out of my depth in knowing if what I'm attempting is correct.

For reference of the issue history, the current tree used to create the RHEL package is at: https://git.centos.org/tree/rpms!curl/5522008c68b4e4b077c312f163d6f925e752437c

You'll see the many patches in the SOURCES directory. It might be better off waiting for someone much more familiar with libcurl to take a peek at this.

CRCinAU commented 8 years ago

In related news, seems this is already known in Bugzilla by RH in a different report: https://bugzilla.redhat.com/show_bug.cgi?id=1347904

In a nutshell, scheduled to be fixed in RHEL 7.4.

Looking at the listed commits in the other bug report, it may actually come close to what I've done in the patch in my previous comment.

As such, people probably have 2 options now - test with my patch above (which myself and @crozone will try), or wait for the RHEL 7.4 release to drop.

crozone commented 8 years ago

:money_with_wings: sweet.

CRCinAU commented 8 years ago

Ok - test results are a failure.... Seems that it causes a segfault in the dotnet runtime now in the HTTP library.

Unless someone wants to go through at cherry-pick the patches from RH BZ #1347904, then it may well be a case of waiting until RHEL 7.4 drops.

(I've deleted my comment with patch so nobody else wastes their time down this path)

stephentoub commented 8 years ago

Another option would be to add our own poll call here: https://github.com/dotnet/corefx/blob/7e2bd07936179c192e682d979b2938b4a7e32030/src/Native/Unix/System.Net.Http.Native/pal_multi.cpp#L72 This would paper-over the old libcurl bug at the expense of an additional poll call in some circumstances.

ericeil commented 8 years ago

Another option would be to add our own poll call

Yes, I was thinking the same thing. I'll put together a PR.

CRCinAU commented 8 years ago

Ok, I cherry picked the commits against the RHEL version of curl 7.29. These built successfully and seem to fix the problem as documented in this thread.

I uploaded the new packages to: http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/

I called these: curl-7.29.0-25.1.el7 libcurl-7.29.0-25.1.el7

From what I can gather, the fixed redhat version will be curl-7.29.0-32.el7. This means when this package hits, the ones I built will be replaced by the official redhat version.

From initial testing, seems we no longer get 100% CPU usage, however handing over to @crozone for functionality testing...

CRCinAU commented 8 years ago

@stephentoub & @ericeil - my thoughts at the moment would be to leave this as is. RedHat will have a fixed package in distribution at some point in the timeline. I've done a shortcut by patching the existing curl packages and happy to have these packages available for anyone to test.

I think adding extra complexity to the dotnet core stuff may be just extra cruft. Happy to hear thoughts from others - but if this is fixed by my packages AND RedHat will have an update that fixes the problem in the near future, then I feel this is the better path forward.

Also, would be good if others involved can test & advise if it fixes the problem in their use case.