Closed shibayan closed 6 months ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @sumantmehtams.
Author: | shibayan |
---|---|
Assignees: | - |
Labels: | `Client`, `Data Lake Storage Gen2`, `Service Attention`, `customer-reported`, `needs-team-attention`, `needs-triage`, `question` |
Milestone: | - |
Thank you for your feedback. Tagging and routing to the team best able to assist.
Since the flush process is also subject to write transaction billing, there is a concern that unnecessary billing may have been unintentionally incurred when using the Write stream.
In C#, writing to Data Lake Storage in a stream using the CsvHelper library, which is a major library, can result in three flushes of the same request.
class Program
{
static async Task Main(string[] args)
{
var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net";
var dataLakeServiceClient = new DataLakeServiceClient(connectionString);
var fileSystemClient = dataLakeServiceClient.GetFileSystemClient("sampledata");
await fileSystemClient.CreateIfNotExistsAsync();
var fileClient = fileSystemClient.GetFileClient("sample3.txt");
// Write blob using stream
using (var stream = await fileClient.OpenWriteAsync(false))
using (var csvWriter = new CsvWriter(new StreamWriter(stream), new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false
}))
{
for (int j = 0; j < 100; j++)
{
csvWriter.WriteField($"data-{j}");
}
await csvWriter.NextRecordAsync();
}
}
}
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.
Author: | shibayan |
---|---|
Assignees: | - |
Labels: | `Client`, `Data Lake Storage Gen2`, `Service Attention`, `Storage`, `customer-reported`, `needs-team-attention`, `question` |
Milestone: | - |
I agree the workaround does not seem to be the best option, but I wonder if doing that would dispose the stream anyways when the stream you get from OpenWrite
disposes.
This might be the way how creating CsvWriter
streams and StreamWriter
work where it disposes after being used, because you're essentially having the same stream just different Writers
pointing to it.
@seanmcc-msft Do you have a recommendation to prevent multiple Flush calls (Dispose being called on the stream multiple times).
Although there may be some consistency issues, ideally, if the position has not changed since the last flush, the next flush should be skipped. In buffered writers, if there is nothing to flush, flushing will not be done.
I can still compromise if flush does not incur any charges, but it is difficult because it is actually charged.
Unfortunately, with the current ADLS Gen 2 SDK, this issue doubles the cost of Flush API calls in a typical scenario, which has caused some of my customers to consider moving away from Azure.
Hi @shibayan, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.
Library or service name.
Azure.Storage.Files.DataLake
v12.5.0Is your feature request related to a problem? Please describe.
If you execute a write operation using a common stream like the following sample code, the same two flushes to the Storage API will be executed.
Since I am writing large amounts of data to multiple storage accounts, the overhead and error rate of this redundant flush is becoming too much to ignore.
Workaround
It seems that Flush is called every time Dispose is executed, so you can minimize the number of times Flush is executed by writing tricky code like the following, but I feel this is not the right way to write it.