Open sirkadirov opened 6 years ago
Are you able to reproduce it on both Windows and Linux? Are you able to reproduce it on more than 1 machine? Is the large repro sensitive to HW configuration?
Hi, @karelz! I just moved my project from .NET framework/Mono due to not implremented System.Diagnostics features in second. After that, I noticed this problem.
Yes, this error rep on both OSes, tested on VMs - reproduces...
Maybe this error is mine, so now I'm rewriting that module. If problem returns, I let you know.
No, not sensitive, this must work on all hardware.
I'll close this meantime, feel free to reopen if you determine it is .NET Core.
OK, let us know if you see it is a real problem in .NET Core.
OK, I'll stay in touch.
@danmosemsft @karelz Checked many times, with different ways of stream reading - problem exists...
using System.IO;
using System.Threading;
using System.Threading.Tasks;
namespace SProgramRunner
{
public class AdvancedStreamReader
{
private const int BufferSize = 4096;
private Stream _source;
private MemoryStream _outputMemoryStream;
private CancellationTokenSource _cancellationToken;
public AdvancedStreamReader(Stream source)
{
_source = source;
_outputMemoryStream = new MemoryStream();
}
public void Run()
{
_cancellationToken = new CancellationTokenSource();
Task.Run(async () =>
{
var buffer = new byte[BufferSize];
while (true)
{
_cancellationToken.Token.ThrowIfCancellationRequested();
var count = await _source.ReadAsync(buffer, 0, BufferSize, _cancellationToken.Token);
if (count <= 0)
break;
await _outputMemoryStream.WriteAsync(buffer, 0, count, _cancellationToken.Token);
await _outputMemoryStream.FlushAsync(_cancellationToken.Token);
}
}, _cancellationToken.Token);
}
public byte[] KillAndGet()
{
_cancellationToken.Cancel();
return _outputMemoryStream.ToArray();
}
}
}
process.Start();
_outputStreamReader = new AdvancedStreamReader(_process.StandardOutput.BaseStream);
_outputStreamReader.Run();
on process exited:
_programRunningResult.ProgramOutputData = _outputStreamReader.KillAndGet();
Error - empty output on small count of tests from 50-100 or more... And this is not my code.
Hope this can help to identify the problem. Now working on repro.
@stephentoub any thoughts come to mind?
any thoughts come to mind?
Not until we have a repro :)
That said, accessing StreamReader.BaseStream is a bit suspicious. The StreamReader could have buffered data from the stream, such that it's been consumed from the Stream and is sitting in the StreamReader's buffer. The shared code isn't doing any manipulation of the reader, so this may not be the issue, but that's why having a repro that actually shows the problem is important.
@stephentoub I'll re-try to create a working repro.
I suspect you might be seeing a known problem where there is a race condition between Process.Exited
and Process.OutputDataReceived
events.
See the remarks here:
When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload.
If you are calling Process.WaitForExit(int timeout)
, you actually have to call it again without the timeout to ensure that stdout has been flushed e.g. change this:
return Process.WaitForExit(timeout);
to this:
bool result = Process.WaitForExit(timeout);
if (result)
{
Process.WaitForExit();
}
return result;
This has to be a really common gotcha.
I've just found it in https://github.com/daveaglick/Buildalyzer/issues/125.
I'm not familiar with this problem but since the parameterless version is just calling the other one with an infinite timeout. I'm guessing which you choose doesn't matter ie the text could probably be better worded as
To ensure that asynchronous event handling has been completed, call WaitForExit() after receiving a true from this overload.
I'm guessing which you choose doesn't matter
There is an explicit difference in the descriptions of the two overloads so I would be fairly certain that the underlying implementation in how they handle stdout does differ in key ways. It might be within the same low-level method ultimately by choosing a different branch based on whether there is a timeout.
WaitForExit()
This overload ensures that all processing has been completed, including the handling of asynchronous events for redirected standard output
WaitForExit(Int32)
When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload
When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload
If such a problem exists, but is extremely rare, it may be manifested when my code is executed due to many concurrent Process.Start(...)
usages.
@duncanawoods Thank you for your desire to help me with this problem. I will check your solution as quick as possible.
@duncanawoods appears this is what it refers to - it is a little counter intuitive that waiting for infinity and waiting for not-quite-infinity have potentially different behavior.
@danmosemsft yep that's it. I looked at another FOSS project I use, https://github.com/SimonCropp/TextCopy, and it also fails to call WaitForExit twice after using a timeout. I fear this bug is lurking EVERYWHERE so well worth thinking about what can be done e.g. deprecating it, warnings etc. It's really almost perfectly evil: a bizarre requirement that will cause unpredictable failures only in high load scenarios. Gah!
it may be manifested when my code is executed due to many concurrent Process.Start(...) usages
@sirkadirov yep those were the conditions under which I saw it it: 10+ parallel invocations on a 12 core machine (Ryzen 3900X). I have used this library many times on quad-core machines and never saw this failure before.
It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?) The Process class is slippery.
@duncanawoods My code with additional Process.WaitForExit()
works well on a small amount of tests. Need more time to verify whether it isn't only a partial solution, though.
@danmosemsft It is not a popular problem, the solution of which will bring more trouble than help. In my opinion, the best idea is to add some notes to the project's documentation.
I don't think that .NET Core platform is designed for this case. Third-party library will be a better solution for working deeply with processes.
It looks like we inherited this from long ago. If there was a non-breaking way to improve things, great but I'm not sure there is (?)
I would wager that if we searched GitHub for all uses of WaitForExit(int timeout)
combined with reading standard-out, we would find this bug in > 50% of the cases. I know that nuget.client had to fix it some time ago. https://github.com/NuGet/Home/issues/1902.
I also suspect that the frequency of observed stability problems will increase as core-counts increase and a lot of users won't track it down to this. I don't know what measures you use to decide on changes but to me this API is a guaranteed bug in user's code and must be fixed.
I appreciate the difficulty changing something so fundamental so I would suggest:
Mark WaitForExit(int timeout)
as deprecated with a link to explanation and resolution in the warning. No need to delete it.
Create a new overload WaitForExit(int timeout, ProcessStreamControl processStreamControl)
where the second argument is an enum to choose either the current behaviour or the expected behaviour e.g.
public enum ProcessStreamControl
{
WaitForStreamsToFlush,
ExitImmediately
}
Those names suck and yes it's ugly but it's clear and won't break anything.
@duncanawoods I support your opinion. Had similar idea, but I don't know whether it will be accepted due to compatibility limitations.
Even if a decision is made to change the API, this little one will take too much time, also, there may be problems when porting old projects from .NET framework to .NET Core 3.x & .NET 5 (?). On the other side, we should quickly change some problematic APIs before .NET 5 release.
@danmosemsft Do you think we should create an issue in dotnet/standard
regarding this problem?
@duncanawoods i pushed a new version of textcopy with the double wait hack. let me know id you still experience the problem
would having a new async process api fix this problem??
Hi @SimonCropp Fantastic, thanks for making the fix.
The good news is that I have already tried to replicate the problem with the unfixed TextCopy
with massively parallel copy-pasting of huge amounts of text but couldn't. I suspect that there are conditions on the called executable to reproduce the problem e.g. it does some work or yields before writing to stdout. I imagine that xclip
is just a single system call so it's very unlikely for us to see the race condition. Better safe than sorry, so again, thanks for making the change!
Great idea for an async API to replace the blocking wait.
@adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?
as an aside. does this mean this bug exists in powershell? https://github.com/PowerShell/PowerShell/blob/master/src/Microsoft.PowerShell.Commands.Management/commands/management/Clipboard.cs#L53 cc @adityapatwardhan
Looks like it. I will have a detailed look tomorrow. Thanks for reporting.
again @adamsitnik @danmosemsft given the extra info. should this be moved from "question" to "bug" label, and "needs more info" label be removed?
Sure, removed them. We don't use those labels consistently anyway.
Can it be treated as a bug please? It is causing bugs in Microsoft's own tools.
cc @adamsitnik @eiriktsarpalis (per https://github.com/dotnet/runtime/blob/master/docs/area-owners.md) to respond to that
NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance? @SimonCropp opened: https://github.com/NuGet/Home/issues/10189
NuGet is being asked to do this double wait trick. If this is a problem, would prefer to have runtime fix, rather than put workarounds everywhere. Guidance? @SimonCropp opened: NuGet/Home#10189
Good idea, but as I see, a fix can potentially damage legacy/ported software projects. Maybe, new API proposal could help?
does WaitForExitAsync in net 5 still have the same problem?
Story
When running user-defined console-only processes using
System.Diagnostics.Process
simultaneously in parallel threads and redirecting their outputs, in some cases you can get empty or non-full output data (from redirected STDOUTs).Platforms
All platforms (see quoted file in https://github.com/dotnet/runtime/issues/27128#issuecomment-583153102 and other platform-specific implementations).
Reproduction code
The error is intermittent, to reproduce it quickly, you need to run multiple instances of the reproduction program on a computer with multi-core processor (4+ threads recommended). Some "tests" would pass, but some of them would fail.
You can specify the path in
process.StartInfo.FileName
to any console program, that exits automatically. In the given example, I use the standard Windows console programipconfig.exe
.Initial reproduction, needs minimization
https://github.com/thesirkadirov/simplepm-server/tree/master/SProgramRunner