Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.26k stars 4.6k forks source link

[FEATURE REQ] Data Lake: Please stop Flush from being executed twice when writing a stream #18076

Closed shibayan closed 6 months ago

shibayan commented 3 years ago

Library or service name.

Azure.Storage.Files.DataLake v12.5.0

Is your feature request related to a problem? Please describe.

If you execute a write operation using a common stream like the following sample code, the same two flushes to the Storage API will be executed.

Since I am writing large amounts of data to multiple storage accounts, the overhead and error rate of this redundant flush is becoming too much to ignore.

class Program
{
    static async Task Main(string[] args)
    {
        var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net";

        var dataLakeServiceClient = new DataLakeServiceClient(connectionString);

        var fileSystemClient = dataLakeServiceClient.GetFileSystemClient("sampledata");

        await fileSystemClient.CreateIfNotExistsAsync();

        var fileClient = fileSystemClient.GetFileClient("sample.txt");

        // Write blob using stream
        using (var stream = await fileClient.OpenWriteAsync(false))
        using (var writer = new StreamWriter(stream))
        {
            await writer.WriteLineAsync(DateTime.Now.ToString());
        }
    }
}

image

Workaround

It seems that Flush is called every time Dispose is executed, so you can minimize the number of times Flush is executed by writing tricky code like the following, but I feel this is not the right way to write it.

// Remove using
var stream = await fileClient.OpenWriteAsync(false);

// Adding `leaveOpen` flag
using (var writer = new StreamWriter(stream, leaveOpen: true))
{
    await writer.WriteLineAsync(DateTime.Now.ToString());
}
ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @sumantmehtams.

Issue Details
**Library or service name.** `Azure.Storage.Files.DataLake` v12.5.0 **Is your feature request related to a problem? Please describe.** If you execute a write operation using a common stream like the following sample code, the same two flushes to the Storage API will be executed. Since I am writing large amounts of data to multiple storage accounts, the overhead and error rate of this redundant flush is becoming too much to ignore. ```csharp class Program { static async Task Main(string[] args) { var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net"; var dataLakeServiceClient = new DataLakeServiceClient(connectionString); var fileSystemClient = dataLakeServiceClient.GetFileSystemClient("sampledata"); await fileSystemClient.CreateIfNotExistsAsync(); var fileClient = fileSystemClient.GetFileClient("sample.txt"); // Write blob using stream using (var stream = await fileClient.OpenWriteAsync(false)) using (var writer = new StreamWriter(stream)) { await writer.WriteLineAsync(DateTime.Now.ToString()); } } } ``` ![image](https://user-images.githubusercontent.com/1356444/105198775-5c288400-5b81-11eb-8f98-90c66f22dbb1.png) **Workaround** It seems that Flush is called every time Dispose is executed, so you can minimize the number of times Flush is executed by writing tricky code like the following, but I feel this is not the right way to write it. ```csharp // Remove using var stream = await fileClient.OpenWriteAsync(false); // Adding `leaveOpen` flag using (var writer = new StreamWriter(stream, leaveOpen: true)) { await writer.WriteLineAsync(DateTime.Now.ToString()); } ```
Author: shibayan
Assignees: -
Labels: `Client`, `Data Lake Storage Gen2`, `Service Attention`, `customer-reported`, `needs-team-attention`, `needs-triage`, `question`
Milestone: -
jsquire commented 3 years ago

Thank you for your feedback. Tagging and routing to the team best able to assist.

shibayan commented 3 years ago

Since the flush process is also subject to write transaction billing, there is a concern that unnecessary billing may have been unintentionally incurred when using the Write stream.

In C#, writing to Data Lake Storage in a stream using the CsvHelper library, which is a major library, can result in three flushes of the same request.

class Program
{
    static async Task Main(string[] args)
    {
        var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net";

        var dataLakeServiceClient = new DataLakeServiceClient(connectionString);

        var fileSystemClient = dataLakeServiceClient.GetFileSystemClient("sampledata");

        await fileSystemClient.CreateIfNotExistsAsync();

        var fileClient = fileSystemClient.GetFileClient("sample3.txt");

        // Write blob using stream
        using (var stream = await fileClient.OpenWriteAsync(false))
        using (var csvWriter = new CsvWriter(new StreamWriter(stream), new CsvConfiguration(CultureInfo.InvariantCulture)
        {
            HasHeaderRecord = false
        }))
        {
            for (int j = 0; j < 100; j++)
            {
                csvWriter.WriteField($"data-{j}");
            }

            await csvWriter.NextRecordAsync();
        }
    }
}

image

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

Issue Details
**Library or service name.** `Azure.Storage.Files.DataLake` v12.5.0 **Is your feature request related to a problem? Please describe.** If you execute a write operation using a common stream like the following sample code, the same two flushes to the Storage API will be executed. Since I am writing large amounts of data to multiple storage accounts, the overhead and error rate of this redundant flush is becoming too much to ignore. ```csharp class Program { static async Task Main(string[] args) { var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net"; var dataLakeServiceClient = new DataLakeServiceClient(connectionString); var fileSystemClient = dataLakeServiceClient.GetFileSystemClient("sampledata"); await fileSystemClient.CreateIfNotExistsAsync(); var fileClient = fileSystemClient.GetFileClient("sample.txt"); // Write blob using stream using (var stream = await fileClient.OpenWriteAsync(false)) using (var writer = new StreamWriter(stream)) { await writer.WriteLineAsync(DateTime.Now.ToString()); } } } ``` ![image](https://user-images.githubusercontent.com/1356444/105198775-5c288400-5b81-11eb-8f98-90c66f22dbb1.png) **Workaround** It seems that Flush is called every time Dispose is executed, so you can minimize the number of times Flush is executed by writing tricky code like the following, but I feel this is not the right way to write it. ```csharp // Remove using var stream = await fileClient.OpenWriteAsync(false); // Adding `leaveOpen` flag using (var writer = new StreamWriter(stream, leaveOpen: true)) { await writer.WriteLineAsync(DateTime.Now.ToString()); } ```
Author: shibayan
Assignees: -
Labels: `Client`, `Data Lake Storage Gen2`, `Service Attention`, `Storage`, `customer-reported`, `needs-team-attention`, `question`
Milestone: -
amnguye commented 3 years ago

I agree the workaround does not seem to be the best option, but I wonder if doing that would dispose the stream anyways when the stream you get from OpenWrite disposes.

This might be the way how creating CsvWriter streams and StreamWriter work where it disposes after being used, because you're essentially having the same stream just different Writers pointing to it.

@seanmcc-msft Do you have a recommendation to prevent multiple Flush calls (Dispose being called on the stream multiple times).

shibayan commented 3 years ago

Although there may be some consistency issues, ideally, if the position has not changed since the last flush, the next flush should be skipped. In buffered writers, if there is nothing to flush, flushing will not be done.

I can still compromise if flush does not incur any charges, but it is difficult because it is actually charged.

shibayan commented 1 year ago

Unfortunately, with the current ADLS Gen 2 SDK, this issue doubles the cost of Flush API calls in a typical scenario, which has caused some of my customers to consider moving away from Azure.

JimSuplizio commented 6 months ago

Hi @shibayan, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.