dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.39k stars 4.75k forks source link

Data corruption when reading HttpClient content stream #110002

Open mhenry07 opened 1 hour ago

mhenry07 commented 1 hour ago

Description

I've observed an issue where data corruption occurs when using a certain pattern to read and process the stream from HttpClient.Content.

The pattern involves using a buffer to read a large stream from HttpClient.Content while it's downloading, and within a loop filling that buffer from the stream then doing some processing on each line and performing occasional I/O after a "batch" of lines has been received. In certain cases, I am regularly seeing intermittent lines being corrupted after some megabytes have been processed.

There is a chance that the issue could be on the ASP.NET side, but since the behavior seems so dependent on the read pattern and also occurs against fake-gcs-server, I suspect the issue is on the client side, and in particular I suspect it's related to the HttpClient.Content stream.

Example log message when data corruption occurs in the reproduction (note how the actual text differs from the expected text):

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'

Reproduction Steps

I created a solution which reproduces the issue. It's on GitHub: mhenry07/http-content-stream-repro, and I attached a ZIP file: HttpContentStreamRepro-v1.zip.

Example error:

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'

There are a few options for comparing how different values and variations affect behavior, which can be set in HttpContentStreamRepro.Console/Program.cs (see the README.md).

Rough pseudocode to illustrate the idea:

var buffer = new byte[ChunkSize];
var offset = 0;
var row = 0L;
var response = await httpClient.GetAsync("/values.csv", HttpCompletionOption.ResponseHeadersRead);
var stream = await response.Content.ReadAsStreamAsync();
while (true)
{
    var length = await ReadUntilFullAsync(stream, buffer.AsMemory(offset));
    if (offset + length == 0) break;
    var memory = buffer.AsMemory(0, offset + length);
    while (TryReadLine(ref memory, out var line))
        if (row++ % 100 == 0)
            await Task.Delay(15);

    offset = memory.Length;
    if (memory.Length > 0) memory.CopyTo(buffer);
}

static async ValueTask<int> ReadUntilFullAsync(Stream stream, Memory<byte> buffer)
{
    var count = 0;
    while (count < buffer.Length)
    {
        var read = await stream.ReadAsync(buffer[count..]);
        if (read == 0) break;
        count += read;
    }
    return count;
}

I've observed the same behavior when using System.IO.Pipelines to consume the stream, and that and more variations are implemented at another GitHub repository: mhenry07/aspire-repro.

Pseudocode for alternate System.IO.Pipelines reproduction (see PipeBuffer.cs in my alternate repo):

async Task FillPipeAsync(PipeWriter writer)
{
    var buffer = new byte[ChunkSize];
    using var writerStream = writer.AsStream();
    var response = await httpClient.GetAsync("/values.csv", HttpCompletionOption.ResponseHeadersRead);
    using var responseStream = await response.Content.ReadAsStreamAsync();
    while (true)
    {
        var length = await ReadUntilFullAsync(responseStream, buffer);
        if (length == 0) break;
        await writerStream.WriteAsync(buffer.AsMemory(0, length));
    }
    await writer.CompleteAsync();
}

async Task ReadPipeAsync(PipeReader reader)
{
    var row = 0L;
    while (true)
    {
        var result = await reader.ReadAsync(default);
        var buffer = result.Buffer;
        while (TryReadLine(ref buffer, out var line))
            if (row++ % 100 == 0)
                await Task.Delay(15);

        reader.AdvanceTo(buffer.Start, buffer.End);
        if (result.IsCompleted) break;
    }

    await reader.CompleteAsync();
}

Expected behavior

Reading the stream should not result in data corruption and it should not be necessary to load the full contents of a large response into memory before processing it

Actual behavior

After reading multiple megabytes, some lines get corrupted.

Example error logs showing actual versus expected lines when the issue occurs:

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'
Reader: Line was corrupted at row 72,678, ~6,384,622 bytes:
Actual:    'abc,72663,def,TTTTTTT,ghi,01/01/0001 00:01:12 +00:00,jkl,01/01/0001 20:11:03 +00:00,mno'
Expected:  'abc,72678,def,IIIIIIIII,ghi,01/01/0001 00:01:12 +00:00,jkl,01/01/0001 20:11:18 +00:00,mno'

Regression?

I don't think so. I first observed this just after updating Visual Studio to 17.12 with the debugger attached and originally filed Visual Studio feedback. However, I have since been able to reproduce the issue on VS 17.11, with and without the debugger attached, and in debug and release builds. I now suspect it's a .NET issue rather than Visual Studio.

Known Workarounds

Configuration

Also, observed with the Visual Studio debugger attached and detached, with Visual Studio 17.11 and 17.12.

Other information

I also have another GitHub repository which has more examples of implementations that reproduce the issue and don't reproduce the issue: mhenry07/aspire-repro.

dotnet-policy-service[bot] commented 1 hour ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.