dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

[API Proposal]: Add Method FromStream(Async) to BinaryData that does not copy #89804

Open dersia opened 1 year ago

dersia commented 1 year ago

Background and motivation

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob. With the new Media-Type on BinaryData (https://github.com/dotnet/runtime/pull/89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using BinaryData overall. But as of today BinaryData.FromStream() copies the stream as you can see here https://github.com/dotnet/runtime/blob/f60757ad40a36b44d471f1cb72614368eb51a1fd/src/libraries/System.Memory.Data/src/System/BinaryData.cs#L173-L208

Because of this BinaryData is not useable for us. So I would like to suggest an API that would not copy the stream.

This makes BinaryData convenient to use and avoid extra allocation.

API Proposal

namespace System;

public class BinaryData
{
    // existing methods with https://github.com/dotnet/runtime/pull/89605
    public static BinaryData FromStream(Stream stream);
    public static BinaryData FromStream(Stream stream, string? mediaType);
    public static Task<BinaryData> FromStreamAsync(Stream stream, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default);

    // new api
    public static BinaryData FromStream(Stream stream, bool copyContent = true);
    public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true);
    public static Task<BinaryData> FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default);
}

API Usage

public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default)
{
    var data = BinrayData.FromString(fileStream, mediaType, false);
    await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false);
}

Alternative Designs

Alternatively FromStream could be overloaded but I think this would be a source breaking change. Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime.

Risks

If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs. We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob. With the new Media-Type on `BinaryData` (https://github.com/dotnet/runtime/pull/89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using `BinaryData` overall. But as of today `BinaryData.FromStream()` copies the stream as you can see here https://github.com/dotnet/runtime/blob/f60757ad40a36b44d471f1cb72614368eb51a1fd/src/libraries/System.Memory.Data/src/System/BinaryData.cs#L173 Because of this `BinaryData` is not useable for us. So I would like to suggest an API that would not copy the stream. This makes `BinaryData` convenient to use and avoid extra allocation. ### API Proposal ```cs namespace System; public class BinaryData { // existing methods with https://github.com/dotnet/runtime/pull/89605 public static BinaryData FromStream(Stream stream); public static BinaryData FromStream(Stream stream, string? mediaType); public static Task FromStreamAsync(Stream stream, CancellationToken cancellationToken = default); public static Task FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default); // new api public static BinaryData FromStream(Stream stream, bool copyContent = true); public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true); public static Task FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default); public static Task FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default); } ``` ### API Usage ```csharp public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default) { var data = BinrayData.FromString(fileStream, mediaType, false); await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false); } ``` ### Alternative Designs Alternatively `FromStream` could be overloaded but I think this would be a source breaking change. Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime. ### Risks If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.
Author: dersia
Assignees: -
Labels: `api-suggestion`, `area-System.IO`, `untriaged`
Milestone: -
ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs. We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob. With the new Media-Type on `BinaryData` (https://github.com/dotnet/runtime/pull/89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using `BinaryData` overall. But as of today `BinaryData.FromStream()` copies the stream as you can see here https://github.com/dotnet/runtime/blob/f60757ad40a36b44d471f1cb72614368eb51a1fd/src/libraries/System.Memory.Data/src/System/BinaryData.cs#L173-L208 Because of this `BinaryData` is not useable for us. So I would like to suggest an API that would not copy the stream. This makes `BinaryData` convenient to use and avoid extra allocation. ### API Proposal ```cs namespace System; public class BinaryData { // existing methods with https://github.com/dotnet/runtime/pull/89605 public static BinaryData FromStream(Stream stream); public static BinaryData FromStream(Stream stream, string? mediaType); public static Task FromStreamAsync(Stream stream, CancellationToken cancellationToken = default); public static Task FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default); // new api public static BinaryData FromStream(Stream stream, bool copyContent = true); public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true); public static Task FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default); public static Task FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default); } ``` ### API Usage ```csharp public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default) { var data = BinrayData.FromString(fileStream, mediaType, false); await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false); } ``` ### Alternative Designs Alternatively `FromStream` could be overloaded but I think this would be a source breaking change. Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime. ### Risks If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.
Author: dersia
Assignees: -
Labels: `api-suggestion`, `area-System.Memory`, `untriaged`
Milestone: -
stephentoub commented 9 months ago

I don't understand the proposal. What is the behavior when not copying? Are you proposing storing the Stream object and then reading from it later when the BinaryData is consumed? That seems like a non-starter, as for example it would push a lot of possibly asynchronous and exceptional work into existing members, like the implicit cast to span.

dersia commented 9 months ago

@stephentoub Sorry for the late reply.

Basically yes, I am suggesting to store the Stream object and probably with an additional leaveOpen argument. I would like to clarify why I am suggesting this.

The problem that we are having is that we have an http endpoint that takes files. these files need to be uploaded to azure blob-storage and we also need to set the ContentType of those blobs. The Azure SDKs that do upload to Azure Storage Account have multiple overloads, for example one that takes a stream and also one that takes BinaryData. The feature that was added in https://github.com/dotnet/runtime/pull/89605 allows for BinaryData to also set the ContentType during upload and we want to make use of this feature. When using the Stream overload on the BlobClient, we would also need to make another call to update the Blob Properties so we can set the ContentType. Having said that, there is already an overload on BinaryData which takes a Stream as an argument, the problem with that overload is, that it copies the original stream and this does not work with lager files and we ran into memory exceptions.

As far as I understand the Runtime has already APIs that take Streams and handle correct behavior with streams, so I don't quite understand what would make this different or exceptional in that regard, but then again, you are the expert and I hope you can enlighten me here. Also the overload that takes BinaryData with a stream argument as of today is internally creates a MemoryStream internally and copys the original stream and hase to then work with that stream, which is converted to a span so I don't quite understand why it couldn't do the same with the original stream?

Sorry again for the late reply and thank you, Sia

stephentoub commented 3 months ago

As far as I understand the Runtime has already APIs that take Streams and handle correct behavior with streams, so I don't quite understand what would make this different or exceptional in that regard

The problem is the rest of the BinaryData surface area assumes the type stores all the data, which then means that methods like ToArray, ToMemory, and the implicit cast to ReadOnlyMemory<T> would all start doing I/O, not only significantly changing their performance profile, but also leading to I/O that likely should have been done asynchronously now either being done synchronously, or worse, being done asynchronously but blocking (sync over async).

julealgon commented 3 months ago

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

Have you considered using pre-signed blob upload links, send those to the frontend, and have it upload directly to Azure through them?

I know this doesn't directly answer your proposal, but figured I'd mention anyways in case you were not aware of this option which tends to be substantially more efficient than roundtripping all the data twice via an indirection (your API).

KrzysztofCwalina commented 2 months ago

cc: @seanmcc-msft