Closed ragavendra closed 3 months ago
Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.
@ragavendra
The current method
ReadLinesAsync
does not accept a callback method to run once all the file's lines are loaded as IEnumerable.
I fail to see what you are saying here. There is no need for a callback, you just add whatever you need after iterating on the method:
await foreach (var line in file.ReadLinesAsync())
{
// do what you want with the line here
}
// here the entire file has been read, so add whatever you wanted here
Nice to have a method like
ReadLineAsync
accepting the line start, end and a call back method to run after each line read.
Again, why do you want callbacks if you can just do await foreach
and process each line inside? Maybe you could elaborate a bit more on how your proposed API would look like?
This helps a lot in
async
programming like theParallel.forEachAsync
method. The current approach withReadLinesAsync
gives rare benefit for theReadLines
, making it almost to be used as async
with await only.
I don't understand what you mean here. Can you rephrase/elaborate?
I fail to see what you are saying here. There is no need for a callback, you just add whatever you need after iterating on the method:
I can see that this way the lines from the file is loaded to the IEnumerable
. What if the file is like 3GB, is 3GB worth of lines loaded in memory now? Or is it better to say load lines 0 - 100
with additional startLine
and endLine
parameters to the say ReadLinesAsync
method .
No offense, my post has already been done voted and I doubt if I have even the slightest of the chances to help any further.
Thanks for the suggestion. I don't think we'd add this. Effectively with LINQ the proposed:
var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token, some);
is the same as:
var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token).Select(some);
We generally prefer composition over exposing new APIs that combine simple things like this.
I can see that this way the lines from the file is loaded to the IEnumerable. What if the file is like 3GB, is 3GB worth of lines loaded in memory now?
The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.
You should have misunderstandings about how (async) enumerables work. Enumerables are effectively turning callbacks into plain calls. With the following producers and consumers:
IAsyncEnumerable<string> Producer()
{
while (!EOL)
{
string line = await LoadNextLineAsync();
yield return line;
}
}
await foreach (string line in Producer())
{
DoSomeThing(line);
}
You will be running the equivalent of:
while (!EOL)
{
string line = await LoadNextLineAsync();
DoSomeThing(line);
}
The SO question you are referencing is highly legacy.
The original answer was written in 2011, with the ReadAllLines method that returns all of the lines in an array. The ReadLines(Async) method was introduced later, which returns enumerables and only loads one line when you are enumerating. DO note the difference of these two series of methods.
File.ReadLinesAsync
does not read all lines into memory and enumerate them line by line, that's what File.ReadAllLinesAsync
does. File.ReadLinesAsync
returns an IAsyncEnumerable
that does asynchronous buffered read and returns one line per iteration. File.ReadLines
also does this, but reads synchronously.
See this benchmark for comparison:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Benckmarks).Assembly).Run(args);
[ShortRunJob]
[MemoryDiagnoser]
[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public class Benckmarks
{
[GlobalSetup]
public void Setup()
{
using StreamWriter writer = File.CreateText("test.txt");
Span<byte> line = stackalloc byte[40];
for (int i = 0; i < 1024 * 1024 * 1024 / 40; i++)
{
Random.Shared.NextBytes(line);
writer.WriteLine(Convert.ToHexString(line));
}
}
[Benchmark]
public long ReadLines()
{
IEnumerable<string> linesEnumerable = File.ReadLines("test.txt");
long length = 0;
foreach (string line in linesEnumerable)
{
length += line.Length;
}
return length;
}
[Benchmark]
public async Task<long> ReadLinesAsync()
{
IAsyncEnumerable<string> linesEnumerable = File.ReadLinesAsync("test.txt");
long length = 0;
await foreach (string line in linesEnumerable)
{
length += line.Length;
}
return length;
}
[Benchmark]
public long ReadAllLines()
{
string[] lines = File.ReadAllLines("test.txt");
long length = 0;
foreach (string line in lines)
{
length += line.Length;
}
return length;
}
[Benchmark]
public async Task<long> ReadAllLinesAsync()
{
string[] lines = await File.ReadAllLinesAsync("test.txt");
long length = 0;
foreach (string line in lines)
{
length += line.Length;
}
return length;
}
}
On my machine this gives:
Method | Mean | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|
ReadLines | 1.266 s | 787000.0000 | - | - | 4.6 GB |
ReadLinesAsync | 3.287 s | 1119000.0000 | 3000.0000 | - | 6.5 GB |
ReadAllLines | 5.909 s | 811000.0000 | 294000.0000 | 24000.0000 | 5.3 GB |
ReadAllLinesAsync | 13.514 s | 1134000.0000 | 411000.0000 | 20000.0000 | 7.2 GB |
I didn't find a way in Benchmark.NET to measure peak heap usage, so no direct measurements here. But notice that for ReadLines[Async]
, generation 2 collection never occurred, and generation 1 collection is significantly less frequent. This means allocations are mostly temporary, and memory pressure is significant lower.
I believe another way to validate this is to write to the file during enumeration and see what happens:
using (StreamWriter writer = File.CreateText("test.txt"))
{
writer.WriteLine("A".PadRight(4096));
writer.WriteLine("B".PadRight(4096));
writer.WriteLine("C".PadRight(4096));
writer.WriteLine("D".PadRight(4096));
}
IAsyncEnumerator<string> linesEnumerator = File.ReadLinesAsync("test.txt").GetAsyncEnumerator();
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
using (StreamWriter writer = File.CreateText("test.txt"))
{
writer.WriteLine("1".PadRight(4096));
writer.WriteLine("2".PadRight(4096));
writer.WriteLine("3".PadRight(4096));
writer.WriteLine("4".PadRight(4096));
}
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
Mutating data source during enumeration is probably a bad idea and you should avoid this in practice. But in my environment it prints:
A B C 4
As for the Parallel.ForEachAsync
you mentioned, it is not meant to be used together with IAsyncEnumerable
to replace await foreach
. It is used to perform multiple asynchronous tasks in parallel. You have to have all the data loaded in memory for them to be processed in parallel. Therefore using File.ReadLinesAsync
with Parallel.ForEachAsync
in the way OP implied doesn't seem to be a valid use case to me.
@julealgon already gave a great example on how to actually use IAsyncEnumerable
.
@ragavendra
No offense, my post has already been done voted and I doubt if I have even the slightest of the chances to help any further.
You probably saw the other folks explaining to you how the IAsyncEnumerable
overload doesn't behave how you thought it did so I won't cover that but just wanted to chime in on this statement here.
It is totally ok to be downvoted. Don't be discouraged from posting ideas because of that. Sometimes, you can make people reverse their downvotes into an upvote based on a more precise explanation or a more specific use case that you bring. This is why in my personal case I downvoted your post but asked you to clarify a few points at the same time, because there was still a chance I could reverse my vote depending on your answer.
The only thing I would suggest however is to first post these proposals in the "Discussions" area if you still have a few doubts on how some API works.
Thank you all for the valuable answers. I was able to find the solution for my situation. The problem with await
is it is blocking and it blocks even in an async
method. Having said that, I am using a separate async or a non async task to be able to do the loading using the ReadLinesAsync
or the ReadLines
( now it is irrelevant ). I however have to check if that task is completed in the main thread or the controller method does not let the task to complete or similar.
Good to know about IAsyncEnumerable
loads each line asynchroously having little impact on the resources.
The problem with await is it is blocking
await
is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/
I however have to check if that task is completed in the main thread or the controller method does not let the task to complete or similar.
The entire thing the await
keyword does for you is waiting for completion. If you are in a GUI application that has an ExecutionContext
associated with your main thread, it will put the continuation to your main thread by default.
I suggest to learn more about how async-await works. Stephen's blog is awesome, but really too verbose for beginners.
From what I see @ragavendra got the needed help and as @stephentoub wrote, we won't introduce such API.
Thank you all for your help!
await
is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/
This is exactly the problem that me and other programmers have :) . Look at the await
in JS here which is certainly blocking and makes sense to an average programmer, see this. There is no need to use await
in that case but to simply only to clear the compiler warning.
The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.
In the documentation, the reference is like below.
// Returns:
// The async enumerable that represents all the lines of the file, or the lines
// that are the result of a query.
public static IAsyncEnumerable<string> ReadLinesAsync(string path, Encoding encoding, CancellationToken cancellationToken = default);
It does not mention that the file is read line by line, instead one can assume that the whole file is loaded or the result of the query. There is little or no mention about it here or anywhere as well.
@ragavendra
await
is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/This is exactly the problem that me and other programmers have :) . Look at the
await
in JS here which is certainly blocking and makes sense to an average programmer, see this. There is no need to useawait
in that case but to simply only to clear the compiler warning.
Using await
just to "clear the compiler warning" there would be an absolutely terrible way to do it. The correct thing to do would be to switch from Thread.Sleep
to await Task.Delay
.
I think you have a misunderstanding on what "blocking" means vs what "waiting for the response to be available" means.
When you use await
on a call, it means your code will stop there and wait for the response to become available, then resume execution. But it does this in a non-blocking way in the sense that it does not block the execution thread: it simply signals that the thread that was running your code until that point is now free to run some other code (potentially even the task to be awaited).
That's also the difference between Thread.Sleep
(which blocks the thread) vs await Task.Delay
(which does not block the thread).
The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.
In the documentation, the reference is like below.
// Returns: // The async enumerable that represents all the lines of the file, or the lines // that are the result of a query. public static IAsyncEnumerable<string> ReadLinesAsync(string path, Encoding encoding, CancellationToken cancellationToken = default);
It does not mention that the file is read line by line, instead one can assume that the whole file is loaded or the result of the query. There is little or no mention about it here or anywhere as well.
It is implicit by the fact that it returns the IAsyncEnumerable
interface. You need to understand how that interface works to understand how the results are read. I would strongly recommend looking that up specifically and getting familiar with it, then your questions should clear up when using any method that returns it.
is the same as:
var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token).Select(some);
Does IAsyncEnumerable
have the select
?
// Returns: // The async enumerable that represents all the lines of the file, or the lines // that are the result of a query.
Is this still correct - or the lines that are the result of a query.
?
For this method, after the async
task call, the lines are not run . Full snippet below as well for .Net 8.
using System;
using System.Threading.Tasks;
using System.Collections.Generic;
using System.IO;
public class Program
{
public static void Main()
{
Console.WriteLine("Hello World");
File.WriteAllText("someFile", "line one \nline two \nline three\n");
/*
var res = File.ReadAllText("someFile");
Console.Write("File contents - " + res);
var res_ = File.ReadAllLines("someFile");
Console.Write("File contents - " + res_[0]);*/
var res_1 = File.ReadLinesAsync("someFile");
// Parallel.ForEachAsync(
SomeMethod(res_1);
Console.WriteLine("Exiting main");
}
public static void SomeMethod(IAsyncEnumerable<string> res_1)
{
var tsk = SomeMethod_1(res_1);
while (!tsk.IsCompleted)
{
// _logger.LogInformation("Status is " + tsk.Status);
}
return;
}
public static async Task SomeMethod_1(IAsyncEnumerable<string> res_1){
await Parallel.ForEachAsync(res_1, (line, cncl) => {
Console.WriteLine("Line is " + line);
// return default;
return new ValueTask(new Task(() => { }, cncl));
});
}
}
First of all I am not sure if calling ReadLinesAsync
justifies to what the method returns. Technically more or less a pointer to the first line is returned asynchronously? Nothing is read at all until, the foreach
is run on the IAsyncEnumerable
.
Literally, when I drafted the proposal it was more or less a draft itself, there were so many questions and ambiguity. However so many issues can still be opened and attended to from this post, to have a more verbose documentation and usability experience for both the entry level programmers and the so called 10x programmers :).
Does
IAsyncEnumerable
have theselect
?
It's in System.Linq.Async package.
after the
async
task call, the lines are not run
It's because this line:
return new ValueTask(new Task(() => { }, cncl));
Task created by new Task
is not started. Thus, tasks built and waiting on them would never complete.
I'd suggest you to learn more about async and tasks, to properly use them.
Background and motivation
The current method
ReadLinesAsync
does not accept a callback method to run once all the file's lines are loaded as IEnumerable. Nice to have a method likeReadLineAsync
accepting the line start, end and a call back method to run after each line read. This helps a lot inasync
programming like theParallel.forEachAsync
method. The current approach withReadLinesAsync
gives rare benefit for theReadLines
, making it almost to be used as async
with await only.API Proposal
Actual -- file
Reference -- link
API Usage
Alternative Designs
No response
Risks
No response