dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

[API Proposal]: Add a TeeReader to allow doing multiple computations from the same stream without buffering to disk or re-reading stream #72798

Open zlepper opened 2 years ago

zlepper commented 2 years ago

Background and motivation

A common use case I encounter when needing to deal with uploaded files is passing them forward to something like Azure Blob Storage and hash them for checksumming. Another related use case is hashing a file with multiple hash algorithms without loading the file more than once from the disk.

Right now, I have two options, as far as I know:

  1. Enable request buffering and have AspNetCore write the whole uploaded file to disk before invoking my controller method. Then I can access the uploaded file multiple times, forward it to Blob storage, and hash it. Either sequentially or in parallel.
    • Pros:
      • Very easy to use
      • Minimal risk of shooting yourself in the foot
    • Cons:
      • Requires waiting for the entire file to arrive on the server before the controller is invoked, which can slow things dramatically
      • Only works for AspNetCore, a Worker application (like for batch processing), would still be stuck
      • Requires the server have disk access. While some disk access is generally available, many slow/concurrent uploads of large files could quickly end up eating all the server disk space.
  2. Accept the actual stream, and upload it to storage. Then, once it has been uploaded to storage, "download" the file again to do the hashing.
    • Pros
      • Avoids touching the disk, like option 1; thus, it doesn't matter how long time getting the entire upload takes
      • Still little risk of shooting yourself in the foot
      • Could work outside of the AspNetCore context.
    • Cons
      • The place the stream was sent to needs to be able to support downloading it again
      • This forces all file processing to be sequential, slowing the wall-clock response time.
How we do multiple file hashing right now ```csharp using var md5 = MD5.Create(); using var sha1 = SHA1.Create(); var hashes = new List {md5, sha1}; foreach (var h in hashes) h.Initialize(); var buffer = new byte[32768]; //Instantiate buffer to copy bytes. await using (var file = new FileStream(filename, FileMode.Open)) { int bytesRead; // file.Read() might return fewer bytes than requested, even though we have not reached the end of the stream. while ((bytesRead = await file.ReadAsync(buffer, cancellationToken)) > 0) { foreach (var hasher in hashes) hasher.TransformBlock(buffer, 0, bytesRead, null, 0); } } // Finalize the calculated hashes. foreach (var hasher in hashes) hasher.TransformFinalBlock(Array.Empty(), 0, 0); // Return the calculated hash values as upper-case hexadecimal strings. return (Convert.ToHexString(sha1.Hash!), Convert.ToHexString(md5.Hash!)); ```

API Proposal

Adapted from TeeReader from GO: https://pkg.go.dev/io#TeeReader + Binary reader from C#

namespace System.IO;

public class TeeReader : Stream
{
    public TeeReader(Stream readStream, Stream writeStream);
    public TeeReader(Stream readStream, Stream writeStream, bool leaveOpen);
    // ... All read/copy methods implemented here, but without signature changes ...
}

I'm not entirely sure about the leaveOpen flag, but thought it might help with consistency with signature like the constructors of BinaryReader.

In theory this could very easily be expanded also to support writing to multiple streams at the same, such as this EchoStream from here: https://www.codeproject.com/Articles/3922/EchoStream-An-Echo-Tee-Stream-for-NET, however I feel that would require more guarantees from the supplied streams rather than just readStream.CanRead == true && writeStream.CanWrite == true

API Usage

await using var sourceStream = File.OpenRead("Somefile.txt"); // Or possibly from the AspNetCore Request.Body stream
await using var uploadStream = File.OpenWrite("uploaded-file.txt");

await using var teeReader = new TeeReader(sourceStream, uploadStream);
using var sha256 = SHA256.Create();
var sha256Hash = await sha256.ComputeHashAsync(teeReader);

Another example if you need to provide a readable stream to something, and need to "flip" is using Pipe:

// Create pipe to "flip" stream direction
var uploadPipe = new Pipe();
await using var uploadReadStream = uploadPipe.Reader.AsStream();
await using var uploadWriteStream = uploadPipe.Writer.AsStream();

// Start the upload with reading from the flipped stream. Need to store the task, since this would deadlock
// otherwise
var uploadTask = _storageWrapper.Upload(uploadReadStream, cancellationToken);

await using var hashStream = new TeeReader(Request.Body, uploadWriteStream);
using var sha256 = SHA256.Create();
var sha256Hash = await sha256.ComputeHashAsync(hashStream, cancellationToken);

// Finally await the upload task so we can be sure it also finished successfully. 
var storageETag = await uploadTask;

Alternative Designs

Microknights has a project called SplitStream, which achieves somewhat the same effect: https://github.com/microknights/SplitStream

using (var inputSplitStream = new ReadableSplitStream(inputSourceStream))

using (var inputFileStream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputFileStream = File.OpenWrite("MyFileOnAnyFilestore.bin"))

using (var inputSha1Stream = inputSplitStream.GetForwardReadOnlyStream())
using (var outputSha1Stream = SHA1.Create())
{
    inputSplitStream.StartReadAhead();

    Parallel.Invoke(
        () => {
            var bytes = outputSha1Stream.ComputeHash(inputSha1Stream);
            var checksumSha1 = string.Join("", bytes.Select(x => x.ToString("x")));
        },
        () => {
            inputFileStream.CopyTo(outputFileStream);
        },
    );
}

However, this requires juggling parallel tasks rather than just a couple of streams. The benefit of this solution is that it lets you get multiple useable read streams, rather than having to pass a write stream. However "flipping" a stream around is possible with System.IO.pipelines, even though that is somewhat painful (Or at least it was last I did it). However that problem is outside the realm of this API suggestion.

Risks

No risk for existing projects, as this is an entirely new class.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation A common use case I encounter when needing to deal with uploaded files is passing them forward to something like Azure Blob Storage and hash them for checksumming. Another related use case is hashing a file with multiple hash algorithms without loading the file more than once from the disk. Right now, I have two options, as far as I know: 1. Enable request buffering and have AspNetCore write the whole uploaded file to disk before invoking my controller method. Then I can access the uploaded file multiple times, forward it to Blob storage, and hash it. Either sequentially or in parallel. * Pros: * Very easy to use * Minimal risk of shooting yourself in the foot * Cons: * Requires waiting for the entire file to arrive on the server before the controller is invoked, which can slow things dramatically * Only works for AspNetCore, a Worker application (like for batch processing), would still be stuck * Requires the server have disk access. While some disk access is generally available, many slow/concurrent uploads of large files could quickly end up eating all the server disk space. 2. Accept the actual stream, and upload it to storage. Then, once it has been uploaded to storage, "download" the file again to do the hashing. * Pros * Avoids touching the disk, like option 1; thus, it doesn't matter how long time getting the entire upload takes * Still little risk of shooting yourself in the foot * Could work outside of the AspNetCore context. * Cons * The place the stream was sent to needs to be able to support downloading it again * This forces all file processing to be sequential, slowing the wall-clock response time.
How we do multiple file hashing right now ```csharp using var md5 = MD5.Create(); using var sha1 = SHA1.Create(); var hashes = new List {md5, sha1}; foreach (var h in hashes) h.Initialize(); var buffer = new byte[32768]; //Instantiate buffer to copy bytes. await using (var file = new FileStream(filename, FileMode.Open)) { int bytesRead; // file.Read() might return fewer bytes than requested, even though we have not reached the end of the stream. while ((bytesRead = await file.ReadAsync(buffer, cancellationToken)) > 0) { foreach (var hasher in hashes) hasher.TransformBlock(buffer, 0, bytesRead, null, 0); } } // Finalize the calculated hashes. foreach (var hasher in hashes) hasher.TransformFinalBlock(Array.Empty(), 0, 0); // Return the calculated hash values as upper-case hexadecimal strings. return (Convert.ToHexString(sha1.Hash!), Convert.ToHexString(md5.Hash!)); ```
### API Proposal Adapted from TeeReader from GO: https://pkg.go.dev/io#TeeReader + Binary reader from C# ```csharp namespace System.IO; public class TeeReader : Stream { public TeeReader(Stream readStream, Stream writeStream); public TeeReader(Stream readStream, Stream writeStream, bool leaveOpen); // ... All read/copy methods implemented here, but without signature changes ... } ``` I'm not entirely sure about the `leaveOpen` flag, but thought it might help with consistency with signature like the constructors of `BinaryReader`. In theory this could very easily be expanded also to support writing to multiple streams at the same, such as this `EchoStream` from here: https://www.codeproject.com/Articles/3922/EchoStream-An-Echo-Tee-Stream-for-NET, however I feel that would require more guarantees from the supplied streams rather than just `readStream.CanRead == true && writeStream.CanWrite == true` ### API Usage ```csharp await using var sourceStream = File.OpenRead("Somefile.txt"); // Or possibly from the AspNetCore Request.Body stream await using var uploadStream = File.OpenWrite("uploaded-file.txt"); await using var teeReader = new TeeReaderStream(sourceStream, uploadStream); using var sha256 = SHA256.Create(); var sha256Hash = await sha256.ComputeHashAsync(teeReader); ``` ### Alternative Designs Microknights has a project called `SplitStream`, which achieves somewhat the same effect: https://github.com/microknights/SplitStream ```csharp using (var inputSplitStream = new ReadableSplitStream(inputSourceStream)) using (var inputFileStream = inputSplitStream.GetForwardReadOnlyStream()) using (var outputFileStream = File.OpenWrite("MyFileOnAnyFilestore.bin")) using (var inputSha1Stream = inputSplitStream.GetForwardReadOnlyStream()) using (var outputSha1Stream = SHA1.Create()) { inputSplitStream.StartReadAhead(); Parallel.Invoke( () => { var bytes = outputSha1Stream.ComputeHash(inputSha1Stream); var checksumSha1 = string.Join("", bytes.Select(x => x.ToString("x"))); }, () => { inputFileStream.CopyTo(outputFileStream); }, ); } ``` However, this requires juggling parallel tasks rather than just a couple of streams. The benefit of this solution is that it lets you get multiple useable read streams, rather than having to pass a write stream. However "flipping" a stream around is possible with `System.IO.pipelines`, even though that is somewhat painful (Or at least it was last I did it). However that problem is outside the realm of this API suggestion. ### Risks No risk for existing projects, as this is an entirely new class.
Author: zlepper
Assignees: -
Labels: `api-suggestion`, `area-System.IO`
Milestone: -
davidfowl commented 2 years ago

I like this idea, ASP.NET actually has a bunch of streams implemented like this (logging, the file buffering read stream you mentioned). The scenario you mentioned where the request body is being uploaded to blob storage and hashed using 2 algorithms is tricky because the TeeReader supports a read and write stream. I'm envisioning what that code would look like with this change:

app.MapPost("/blob", async Stream body) =>
{
   Stream blobStream = GetTheBlobStream();
   Stream hashStream = new DoubleHashStream(); 
   Stream teeStream = new TeeStream(body, hashStream);
   await body.CopyToAsync(teeStream, blobStream);
   stream.Complete();

    var sha1 = stream.Sha1Hash;
    var md5 = stream.MD5Hash;

   return new { sha1, md5 };
});

class DoubleHashStream : Stream
{
     // etc
     private HashAlgorithm[] _hashes = new [] {md5, sha1};

    // ... 
   public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken)
   {
      foreach (var hasher in _hashes)
            hasher.TransformBlock(buffer, 0, bytesRead, null, 0);
   }

   public void Complete()
   {
       foreach (var hasher in _hashes)
          hasher.TransformFinalBlock(Array.Empty<byte>(), 0, 0);
   }
}

However "flipping" a stream around is possible with System.IO.pipelines, even though that is somewhat painful (Or at least it was last I did it)

There should be a simpler way to get a duplex stream from a Pipe.

I'm not a fan of the name TeeReader or TeeStream though.

@stephentoub and @adamsitnik is this part of your stream improvements list?

stephentoub commented 2 years ago

There should be a simpler way to get a duplex stream from a Pipe. @stephentoub and @adamsitnik is this part of your stream improvements list?

Yup