dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

NamedPipeClientStream high cpu usage when server unavailable #22876

Closed tarekgh closed 4 years ago

tarekgh commented 7 years ago

From @coderb on July 20, 2017 11:31

NamedPipeClientStream does a tight infinite loop (spins without blocking) if Connect(Timeout.Infinite) is called and the server is unavailable (not running or the pipe is busy).

I would suggest that the retry loop should yield the cpu (Thread.Sleep/Task.Delay) for say 250ms.

        public void Connect(int timeout) {
            CheckConnectOperationsClient();

            if (timeout < 0 && timeout != Timeout.Infinite) {
                throw new ArgumentOutOfRangeException("timeout", SR.GetString(SR.ArgumentOutOfRange_InvalidTimeout));
            }

            UnsafeNativeMethods.SECURITY_ATTRIBUTES secAttrs = PipeStream.GetSecAttrs(m_inheritability);

            int _pipeFlags = (int)m_pipeOptions;
            if (m_impersonationLevel != TokenImpersonationLevel.None) {
                _pipeFlags |= UnsafeNativeMethods.SECURITY_SQOS_PRESENT;
                _pipeFlags |= (((int)m_impersonationLevel - 1) << 16);
            }

            // This is the main connection loop. It will loop until the timeout expires.  Most of the 
            // time, we will be waiting in the WaitNamedPipe win32 blocking function; however, there are
            // cases when we will need to loop: 1) The server is not created (WaitNamedPipe returns 
            // straight away in such cases), and 2) when another client connects to our server in between 
            // our WaitNamedPipe and CreateFile calls.
            int startTime = Environment.TickCount;
            int elapsed = 0;
            do {
                // Wait for pipe to become free (this will block unless the pipe does not exist).
                if (!UnsafeNativeMethods.WaitNamedPipe(m_normalizedPipePath, timeout - elapsed)) {
                    int errorCode = Marshal.GetLastWin32Error();

                    // Server is not yet created so let's keep looping.
                    if (errorCode == UnsafeNativeMethods.ERROR_FILE_NOT_FOUND) {
                        continue; 
                    }

Copied from original issue: Microsoft/dotnet#445

tarekgh commented 7 years ago

cc @JeremyKuhne @ianhays

JeremyKuhne commented 7 years ago

This is a NetFX issue. CoreFX spin waits here. @ianhays

ianhays commented 7 years ago

Closing this as it isn't .net core relevant.

I've backported this fix to full .net and it should be included in the release after the next one.

coderb commented 7 years ago

now that's what i call service!!! thank you!!!

coderb commented 7 years ago

so i see the corefx code uses SpinWait.SpinOnce() on each loop iteration.

this strikes me as ultra agressive and suprising behavior. the API from the outside would appear to block ala a Socket.Connect() when in reality it is going to eat tons of CPU. SpinOnce() at best will Sleep(1) which is virtually a no-op when the other loop activity is significant as in this case.

why not start by sleeping 50ms and then bump it up to 250ms after a certain number of iterations? if the user is waiting a for a server then an extra 250ms shouldn't be noticed. But a pegged CPU due to spin waiting surely will and is quite unfriendly.

stephentoub commented 7 years ago

when in reality it is going to eat tons of CPU

SpinWait has fallback behavior, such that i'll end up doing Thread.Sleeps: image

coderb commented 7 years ago

Your test is analogous to

while (true) {
   Thread.Sleep(1);
}

but really what is happing is

while (true) {
   DoSignificantWork()
  Thread.Sleep(1);
}

what matters is the ratio of work to sleeping to figure out CPU load. SpinWait is inappropriate when you are doing a lot of CPU bound work in the loop. Eg 99ms workload and sleep 1ms -> 99% CPU utilitzation. #NotGood!

(that said i didn't actually test anything so i may be full of sh*t).

SpinWait on my system:

    public void SpinOnce()
    {
      if (this.NextSpinWillYield)
      {
        CdsSyncEtwBCLProvider.Log.SpinWait_NextSpinWillYield();
        int num = this.m_count >= 10 ? this.m_count - 10 : this.m_count;
        if (num % 20 == 19)
          Thread.Sleep(1);
        else if (num % 5 == 4)
          Thread.Sleep(0);
        else
          Thread.Yield();
      }
      else
        Thread.SpinWait(4 << this.m_count);
      this.m_count = this.m_count == int.MaxValue ? 10 : this.m_count + 1;
    }
stephentoub commented 7 years ago

Eg 99ms workload and sleep 1ms -> 99% CPU utilitzation

It generally won't sleep for 1ms... more likely 10-15ms. And the work required in WaitNamedPipe if it doesn't block is small.

but really what is happing is

Can you please share a repro on .NET Core 2.0? I see this: image

coderb commented 7 years ago

ok, i hear you and it probably is ok on windows. the thing is there are multiple assumptions built in that are fragile- 10-15ms timeslice architecturally depenedent on the windows scheduler. not sure what would happen on linux especially on a kernel that might be build with NOHZ or realtime / low latency (1ms scheduler if i recall). we're also assuming TryConnect() is fast. i just don't see why SpinWait is chosen other than it being a 1-liner to code rather than an adaptive backoff or even a straight up Thread.Sleep(50). thanks for entertaining my gripes!

stephentoub commented 7 years ago

i just don't see why SpinWait is chosen other than it being a 1-liner to code rather than an adaptive backoff or even a straight up Thread.Sleep(50)

Because there's a downside to longer waits, even with a backoff scheme, and we've not yet seen a case that's harmed by using the smallest sleep possible. If you have one, please do share, but otherwise in the absence of an actual problem we'd prefer to optimize for the cases we know to exist and to keep the code as simple as possible. One of the most common scenarios that hits this code path is a process forking off another that it then wants to communicate with over a named pipe; generally the delay before the server creates the listening pipe is very small, and any additional delay incurred by the client artificially sleeping for a longer period of time can actually add up to very noticeable delays if done frequently enough. The spin wait fix is simple, easy to understand, and addresses the problems we've seen (the biggest problem we saw was with the C# compilation server, when on a single core box, the client could end up starving the server's ability to create the very pipe the client was waiting for). If there are other problems and you can share repros that highlight them, we'll want to revisit.

coderb commented 7 years ago

kudos @stephentoub . i get it. i appreciate you taking the time to explain. what you say makes complete sense. i was a little wound up by getting hit with the 100% cpu bug on netfx but you've clearly thought it through. thank you.

stephentoub commented 7 years ago

Thanks, @coderb. If after trying the fix you do find situations where it's still problematic, please do let us know :)