Intermittent: Empty Process stdout

sirkadirov commented 6 years ago

Story

When running user-defined console-only processes using System.Diagnostics.Process simultaneously in parallel threads and redirecting their outputs, in some cases you can get empty or non-full output data (from redirected STDOUTs).

Platforms

All platforms (see quoted file in https://github.com/dotnet/runtime/issues/27128#issuecomment-583153102 and other platform-specific implementations).

Reproduction code

The error is intermittent, to reproduce it quickly, you need to run multiple instances of the reproduction program on a computer with multi-core processor (4+ threads recommended). Some "tests" would pass, but some of them would fail.

You can specify the path in process.StartInfo.FileName to any console program, that exits automatically. In the given example, I use the standard Windows console program ipconfig.exe.

using System;
using System.IO;
using System.Text;
using System.Threading;
using System.Diagnostics;

class EmptyStdOutRepro
{

    public static void Main()
    {
        new EmptyStdOutRepro().Run();
    }

    public void Run()
    {

        while (true)
            Exec();

    }

    private void Exec()
    {

        int lineCount = 0;
        string output = "";

        Process process = new Process();

        process.StartInfo.FileName = "ipconfig.exe";

        process.StartInfo.UseShellExecute = false;
        process.StartInfo.RedirectStandardOutput = true;
        process.StartInfo.CreateNoWindow = true;
        process.StartInfo.ErrorDialog = false;

        process.OutputDataReceived += new DataReceivedEventHandler((sender, e) =>
        {

            if (!String.IsNullOrEmpty(e.Data))
            {
                lineCount++;
                output += ("\n[" + lineCount + "]: " + e.Data);
            }

        });

        process.Start();

        process.BeginOutputReadLine();
        process.WaitForExit(1000); // Unpredictable behavior

        if (output != "")
            Console.WriteLine("PASSED!" + output.Length);
        else
            Console.WriteLine("FAILED!" + output.Length);

        process.WaitForExit();
        process.Close();

    }

}

Initial reproduction, needs minimization

https://github.com/thesirkadirov/simplepm-server/tree/master/SProgramRunner

karelz commented 6 years ago

Are you able to reproduce it on both Windows and Linux? Are you able to reproduce it on more than 1 machine? Is the large repro sensitive to HW configuration?

sirkadirov commented 6 years ago

Hi, @karelz! I just moved my project from .NET framework/Mono due to not implremented System.Diagnostics features in second. After that, I noticed this problem.

Yes, this error rep on both OSes, tested on VMs - reproduces...

Maybe this error is mine, so now I'm rewriting that module. If problem returns, I let you know.

No, not sensitive, this must work on all hardware.

danmoseley commented 6 years ago

I'll close this meantime, feel free to reopen if you determine it is .NET Core.

karelz commented 6 years ago

OK, let us know if you see it is a real problem in .NET Core.

sirkadirov commented 6 years ago

OK, I'll stay in touch.

sirkadirov commented 6 years ago

Rewrote the "testing subsystem" from scratch. Nothing helped.
Now testing it to get more information.
Also, I moved this "subsystem" from .NET Core to .NET Standard 2.0.3.
Repro coming soon, I hope to minimize it.

sirkadirov commented 6 years ago

@danmosemsft @karelz Checked many times, with different ways of stream reading - problem exists...

Base stream reader used

using System.IO;
using System.Threading;
using System.Threading.Tasks;

namespace SProgramRunner
{

    public class AdvancedStreamReader
    {

        private const int BufferSize = 4096;

        private Stream _source;
        private MemoryStream _outputMemoryStream;

        private CancellationTokenSource _cancellationToken;

        public AdvancedStreamReader(Stream source)
        {

            _source = source;
            _outputMemoryStream = new MemoryStream();

        }

        public void Run()
        {

            _cancellationToken = new CancellationTokenSource();

            Task.Run(async () =>
            {

                var buffer = new byte[BufferSize];

                while (true)
                {

                    _cancellationToken.Token.ThrowIfCancellationRequested();

                    var count = await _source.ReadAsync(buffer, 0, BufferSize, _cancellationToken.Token);

                    if (count <= 0)
                        break;

                    await _outputMemoryStream.WriteAsync(buffer, 0, count, _cancellationToken.Token);

                    await _outputMemoryStream.FlushAsync(_cancellationToken.Token);

                }

            }, _cancellationToken.Token);

        }

        public byte[] KillAndGet()
        {

            _cancellationToken.Cancel();

            return _outputMemoryStream.ToArray();

        }

    }

}

Exec this reader

process.Start();
_outputStreamReader = new AdvancedStreamReader(_process.StandardOutput.BaseStream);
_outputStreamReader.Run();

on process exited:

_programRunningResult.ProgramOutputData = _outputStreamReader.KillAndGet();

Error - empty output on small count of tests from 50-100 or more... And this is not my code.

Hope this can help to identify the problem. Now working on repro.

danmoseley commented 6 years ago

@stephentoub any thoughts come to mind?

stephentoub commented 6 years ago

any thoughts come to mind?

Not until we have a repro :)

That said, accessing StreamReader.BaseStream is a bit suspicious. The StreamReader could have buffered data from the stream, such that it's been consumed from the Stream and is sitting in the StreamReader's buffer. The shared code isn't doing any manipulation of the reader, so this may not be the issue, but that's why having a repro that actually shows the problem is important.

sirkadirov commented 6 years ago

@stephentoub I'll re-try to create a working repro.

duncanawoods commented 4 years ago

I suspect you might be seeing a known problem where there is a race condition between Process.Exited and Process.OutputDataReceived events.

See the remarks here:

https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexit?redirectedfrom=MSDN&view=netcore-3.1#System_Diagnostics_Process_WaitForExit_System_Int32

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload.

If you are calling Process.WaitForExit(int timeout), you actually have to call it again without the timeout to ensure that stdout has been flushed e.g. change this:

return Process.WaitForExit(timeout);

to this:

bool result = Process.WaitForExit(timeout);
if (result)
{
    Process.WaitForExit();
}
return result;

This has to be a really common gotcha.

I've just found it in https://github.com/daveaglick/Buildalyzer/issues/125.

danmoseley commented 4 years ago

I'm not familiar with this problem but since the parameterless version is just calling the other one with an infinite timeout. I'm guessing which you choose doesn't matter ie the text could probably be better worded as

To ensure that asynchronous event handling has been completed, call WaitForExit() after receiving a true from this overload.

duncanawoods commented 4 years ago

I'm guessing which you choose doesn't matter

There is an explicit difference in the descriptions of the two overloads so I would be fairly certain that the underlying implementation in how they handle stdout does differ in key ways. It might be within the same low-level method ultimately by choosing a different branch based on whether there is a timeout.

WaitForExit()

This overload ensures that all processing has been completed, including the handling of asynchronous events for redirected standard output

WaitForExit(Int32)

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload

sirkadirov commented 4 years ago

When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload

If such a problem exists, but is extremely rare, it may be manifested when my code is executed due to many concurrent Process.Start(...) usages.

@duncanawoods Thank you for your desire to help me with this problem. I will check your solution as quick as possible.

danmoseley commented 4 years ago

@duncanawoods appears this is what it refers to - it is a little counter intuitive that waiting for infinity and waiting for not-quite-infinity have potentially different behavior.

https://github.com/dotnet/runtime/blob/719d627a9e79ad9ed0255f1c6be1fc49cdffc88a/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L174-L181

duncanawoods commented 4 years ago

@danmosemsft yep that's it. I looked at another FOSS project I use, https://github.com/SimonCropp/TextCopy, and it also fails to call WaitForExit twice after using a timeout. I fear this bug is lurking EVERYWHERE so well worth thinking about what can be done e.g. deprecating it, warnings etc. It's really almost perfectly evil: a bizarre requirement that will cause unpredictable failures only in high load scenarios. Gah!

it may be manifested when my code is executed due to many concurrent Process.Start(...) usages

@sirkadirov yep those were the conditions under which I saw it it: 10+ parallel invocations on a 12 core machine (Ryzen 3900X). I have used this library many times on quad-core machines and never saw this failure before.

danmoseley commented 4 years ago

It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?) The Process class is slippery.

sirkadirov commented 4 years ago

@duncanawoods My code with additional Process.WaitForExit() works well on a small amount of tests. Need more time to verify whether it isn't only a partial solution, though.

@danmosemsft It is not a popular problem, the solution of which will bring more trouble than help. In my opinion, the best idea is to add some notes to the project's documentation.

I don't think that .NET Core platform is designed for this case. Third-party library will be a better solution for working deeply with processes.

duncanawoods commented 4 years ago

It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?)

I would wager that if we searched GitHub for all uses of WaitForExit(int timeout) combined with reading standard-out, we would find this bug in > 50% of the cases. I know that nuget.client had to fix it some time ago. https://github.com/NuGet/Home/issues/1902.

I also suspect that the frequency of observed stability problems will increase as core-counts increase and a lot of users won't track it down to this. I don't know what measures you use to decide on changes but to me this API is a guaranteed bug in user's code and must be fixed.

I appreciate the difficulty changing something so fundamental so I would suggest:

Mark WaitForExit(int timeout) as deprecated with a link to explanation and resolution in the warning. No need to delete it.
Create a new overload WaitForExit(int timeout, ProcessStreamControl processStreamControl) where the second argument is an enum to choose either the current behaviour or the expected behaviour e.g.

public enum ProcessStreamControl
{
    WaitForStreamsToFlush,
    ExitImmediately
}

Those names suck and yes it's ugly but it's clear and won't break anything.

sirkadirov commented 4 years ago

@duncanawoods I support your opinion. Had similar idea, but I don't know whether it will be accepted due to compatibility limitations.

Even if a decision is made to change the API, this little one will take too much time, also, there may be problems when porting old projects from .NET framework to .NET Core 3.x & .NET 5 (?). On the other side, we should quickly change some problematic APIs before .NET 5 release.

@danmosemsft Do you think we should create an issue in dotnet/standard regarding this problem?

SimonCropp commented 4 years ago

@duncanawoods i pushed a new version of textcopy with the double wait hack. let me know id you still experience the problem

SimonCropp commented 4 years ago

would having a new async process api fix this problem??

duncanawoods commented 4 years ago

Hi @SimonCropp Fantastic, thanks for making the fix.

The good news is that I have already tried to replicate the problem with the unfixed TextCopy with massively parallel copy-pasting of huge amounts of text but couldn't. I suspect that there are conditions on the called executable to reproduce the problem e.g. it does some work or yields before writing to stdout. I imagine that xclip is just a single system call so it's very unlikely for us to see the race condition. Better safe than sorry, so again, thanks for making the change!

Great idea for an async API to replace the blocking wait.

SimonCropp commented 4 years ago

@adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?

SimonCropp commented 4 years ago

as an aside. does this mean this bug exists in powershell? https://github.com/PowerShell/PowerShell/blob/master/src/Microsoft.PowerShell.Commands.Management/commands/management/Clipboard.cs#L53 cc @adityapatwardhan

adityapatwardhan commented 4 years ago

Looks like it. I will have a detailed look tomorrow. Thanks for reporting.

SimonCropp commented 4 years ago

again @adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?

danmoseley commented 4 years ago

Sure, removed them. We don't use those labels consistently anyway.

duncanawoods commented 4 years ago

Can it be treated as a bug please? It is causing bugs in Microsoft's own tools.

danmoseley commented 4 years ago

cc @adamsitnik @eiriktsarpalis (per https://github.com/dotnet/runtime/blob/master/docs/area-owners.md) to respond to that

rrelyea commented 4 years ago

NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance? @SimonCropp opened: https://github.com/NuGet/Home/issues/10189

sirkadirov commented 4 years ago

NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance? @SimonCropp opened: NuGet/Home#10189

Good idea, but as I see, a fix can potentially damage legacy/ported software projects. Maybe, new API proposal could help?

SimonCropp commented 3 years ago

does WaitForExitAsync in net 5 still have the same problem?

dotnet / runtime