Data corruption when reading HttpClient content stream

Description

I've observed an issue where data corruption occurs when using a certain pattern to read and process the stream from HttpClient.Content.

The pattern involves using a buffer to read a large stream from HttpClient.Content while it's downloading, and within a loop filling that buffer from the stream then doing some processing on each line and performing occasional I/O after a "batch" of lines has been received. In certain cases, I am regularly seeing intermittent lines being corrupted after some megabytes have been processed.

There is a chance that the issue could be on the ASP.NET side, but since the behavior seems so dependent on the read pattern and also occurs against fake-gcs-server, I suspect the issue is on the client side, and in particular I suspect it's related to the HttpClient.Content stream.

Example log message when data corruption occurs in the reproduction (note how the actual text differs from the expected text):

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'

Reproduction Steps

I created a solution which reproduces the issue. It's on GitHub: mhenry07/http-content-stream-repro, and I attached a ZIP file: HttpContentStreamRepro-v1.zip.

Git clone the GitHub repo git clone https://github.com/mhenry07/http-content-stream-repro.git or unzip the attached zip file
Run the HttpContentStreamRepro.AppHost project from Visual Studio
- ~or via the command line: dotnet run --project HttpContentStreamRepro.AppHost/HttpContentStreamRepro.AppHost.csproj --framework net8.0 (or net9.0)~ (command-line doesn't appear to support multi-target projects)
From the Aspire dashboard, watch the logs from the "console" resource/project
Within 20-30 seconds, you should see an error logged: "Line was corrupted at row..." or if the debugger is attached, the debugger may pause at an exception when a row is corrupted, and hitting continue should result in Aspire showing the log as the application exits

Example error:

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'

There are a few options for comparing how different values and variations affect behavior, which can be set in HttpContentStreamRepro.Console/Program.cs (see the README.md).

Rough pseudocode to illustrate the idea:

var buffer = new byte[ChunkSize];
var offset = 0;
var row = 0L;
var response = await httpClient.GetAsync("/values.csv", HttpCompletionOption.ResponseHeadersRead);
var stream = await response.Content.ReadAsStreamAsync();
while (true)
{
    var length = await ReadUntilFullAsync(stream, buffer.AsMemory(offset));
    if (offset + length == 0) break;
    var memory = buffer.AsMemory(0, offset + length);
    while (TryReadLine(ref memory, out var line))
        if (row++ % 100 == 0)
            await Task.Delay(15);

    offset = memory.Length;
    if (memory.Length > 0) memory.CopyTo(buffer);
}

static async ValueTask<int> ReadUntilFullAsync(Stream stream, Memory<byte> buffer)
{
    var count = 0;
    while (count < buffer.Length)
    {
        var read = await stream.ReadAsync(buffer[count..]);
        if (read == 0) break;
        count += read;
    }
    return count;
}

I've observed the same behavior when using System.IO.Pipelines to consume the stream, and that and more variations are implemented at another GitHub repository: mhenry07/aspire-repro.

Pseudocode for alternate System.IO.Pipelines reproduction (see PipeBuffer.cs in my alternate repo):

async Task FillPipeAsync(PipeWriter writer)
{
    var buffer = new byte[ChunkSize];
    using var writerStream = writer.AsStream();
    var response = await httpClient.GetAsync("/values.csv", HttpCompletionOption.ResponseHeadersRead);
    using var responseStream = await response.Content.ReadAsStreamAsync();
    while (true)
    {
        var length = await ReadUntilFullAsync(responseStream, buffer);
        if (length == 0) break;
        await writerStream.WriteAsync(buffer.AsMemory(0, length));
    }
    await writer.CompleteAsync();
}

async Task ReadPipeAsync(PipeReader reader)
{
    var row = 0L;
    while (true)
    {
        var result = await reader.ReadAsync(default);
        var buffer = result.Buffer;
        while (TryReadLine(ref buffer, out var line))
            if (row++ % 100 == 0)
                await Task.Delay(15);

        reader.AdvanceTo(buffer.Start, buffer.End);
        if (result.IsCompleted) break;
    }

    await reader.CompleteAsync();
}

Expected behavior

Reading the stream should not result in data corruption and it should not be necessary to load the full contents of a large response into memory before processing it

Actual behavior

After reading multiple megabytes, some lines get corrupted.

Example error logs showing actual versus expected lines when the issue occurs:

Reader: Line was corrupted at row 73,036, ~6,416,181 bytes:
Actual:    'abc,73036,def,CCC,ghi,01/01/0001 00:01:13abc,73032,def,YYYYYYYYYYYY,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:12 +00:00,mno'
Expected:  'abc,73036,def,CCC,ghi,01/01/0001 00:01:13 +00:00,jkl,01/01/0001 20:17:16 +00:00,mno'

Reader: Line was corrupted at row 72,678, ~6,384,622 bytes:
Actual:    'abc,72663,def,TTTTTTT,ghi,01/01/0001 00:01:12 +00:00,jkl,01/01/0001 20:11:03 +00:00,mno'
Expected:  'abc,72678,def,IIIIIIIII,ghi,01/01/0001 00:01:12 +00:00,jkl,01/01/0001 20:11:18 +00:00,mno'

Regression?

I don't think so. I first observed this just after updating Visual Studio to 17.12 with the debugger attached and originally filed Visual Studio feedback. However, I have since been able to reproduce the issue on VS 17.11, with and without the debugger attached, and in debug and release builds. I now suspect it's a .NET issue rather than Visual Studio.

Known Workarounds

Using a smaller ChunkSize seems to help (see the ChunkSize option)
I would prefer to use the Google Cloud Storage client, which uses the MediaDownloader implementation which uses the pattern which causes the issue. However, if we were to use an HttpClient directly instead of the GCS client, the following seem to work better:
- Reading the stream once per loop iteration instead of fully filling the buffer (see the FillBuffer option)
- Using stream.CopyToAsync seems to work better
It seems that the issue is sensitive to certain variables. For example, it seems the issue doesn't happen when skipping the batch I/O step (see the BatchSize and Delay options).
The issue doesn't occur when reading a local stream (see the SourceStream: StreamSource.Local option)
I tried using an ASP.NET endpoint within the same project, but that didn't seem to reproduce the issue
Allow the full stream to load into memory before processing (not ideal for gigabyte-sized files)

Configuration

.NET: Observed on both .NET 8.0 and .NET 9.0
OS: Windows 10 (22H2)
Architecture: x64. I've observed the issue on both Intel and AMD CPUs.

Also, observed with the Visual Studio debugger attached and detached, with Visual Studio 17.11 and 17.12.

Other information

The issue doesn't appear to occur when reading a local stream (e.g. the StreamSource: StreamSource.Local option in the repro)

I also have another GitHub repository which has more examples of implementations that reproduce the issue and don't reproduce the issue: mhenry07/aspire-repro.

dotnet / runtime