dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.07k stars 4.69k forks source link

[API Proposal]: System.IO.File to have a method to read a file line by line #103137

Closed ragavendra closed 3 months ago

ragavendra commented 4 months ago

Background and motivation

The current method ReadLinesAsync does not accept a callback method to run once all the file's lines are loaded as IEnumerable. Nice to have a method like ReadLineAsync accepting the line start, end and a call back method to run after each line read. This helps a lot in async programming like the Parallel.forEachAsync method. The current approach with ReadLinesAsync gives rare benefit for the ReadLines , making it almost to be used as a sync with await only.

API Proposal

namespace System.IO;

public static partial class File
{
    public static IAsyncEnumerable<string> ReadLineAsync(string path, CancellationToken cancellationToken = default, Func<string, bool> func)
            => ReadLineAsync(path, Encoding.UTF8, cancellationToken, func);
}

Actual -- file

Reference -- link

API Usage


// call back method
Func<string, bool> some = (line) => { Console.log($"Line is {line} ."); return false; };

// Calling the method
var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token, some);

Alternative Designs

No response

Risks

No response

dotnet-policy-service[bot] commented 4 months ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

julealgon commented 4 months ago

@ragavendra

The current method ReadLinesAsync does not accept a callback method to run once all the file's lines are loaded as IEnumerable.

I fail to see what you are saying here. There is no need for a callback, you just add whatever you need after iterating on the method:

await foreach (var line in file.ReadLinesAsync())
{
    // do what you want with the line here
}

// here the entire file has been read, so add whatever you wanted here

Nice to have a method like ReadLineAsync accepting the line start, end and a call back method to run after each line read.

Again, why do you want callbacks if you can just do await foreach and process each line inside? Maybe you could elaborate a bit more on how your proposed API would look like?

This helps a lot in async programming like the Parallel.forEachAsync method. The current approach with ReadLinesAsync gives rare benefit for the ReadLines , making it almost to be used as a sync with await only.

I don't understand what you mean here. Can you rephrase/elaborate?

ragavendra commented 4 months ago

I fail to see what you are saying here. There is no need for a callback, you just add whatever you need after iterating on the method:

I can see that this way the lines from the file is loaded to the IEnumerable. What if the file is like 3GB, is 3GB worth of lines loaded in memory now? Or is it better to say load lines 0 - 100 with additional startLine and endLine parameters to the say ReadLinesAsync method .

No offense, my post has already been done voted and I doubt if I have even the slightest of the chances to help any further.

stephentoub commented 4 months ago

Thanks for the suggestion. I don't think we'd add this. Effectively with LINQ the proposed:

var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token, some);

is the same as:

var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token).Select(some);

We generally prefer composition over exposing new APIs that combine simple things like this.

I can see that this way the lines from the file is loaded to the IEnumerable. What if the file is like 3GB, is 3GB worth of lines loaded in memory now?

The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.

huoyaoyuan commented 4 months ago

You should have misunderstandings about how (async) enumerables work. Enumerables are effectively turning callbacks into plain calls. With the following producers and consumers:

IAsyncEnumerable<string> Producer()
{
    while (!EOL)
    {
        string line = await LoadNextLineAsync();
        yield return line;
    }
}
await foreach (string line in Producer())
{
    DoSomeThing(line);
}

You will be running the equivalent of:

while (!EOL)
{
    string line = await LoadNextLineAsync();
    DoSomeThing(line);
}
huoyaoyuan commented 4 months ago

The SO question you are referencing is highly legacy.

The original answer was written in 2011, with the ReadAllLines method that returns all of the lines in an array. The ReadLines(Async) method was introduced later, which returns enumerables and only loads one line when you are enumerating. DO note the difference of these two series of methods.

DL444 commented 4 months ago

File.ReadLinesAsync does not read all lines into memory and enumerate them line by line, that's what File.ReadAllLinesAsync does. File.ReadLinesAsync returns an IAsyncEnumerable that does asynchronous buffered read and returns one line per iteration. File.ReadLines also does this, but reads synchronously.

See this benchmark for comparison:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benckmarks).Assembly).Run(args);

[ShortRunJob]
[MemoryDiagnoser]
[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public class Benckmarks
{
    [GlobalSetup]
    public void Setup()
    {
        using StreamWriter writer = File.CreateText("test.txt");
        Span<byte> line = stackalloc byte[40];
        for (int i = 0; i < 1024 * 1024 * 1024 / 40; i++)
        {
            Random.Shared.NextBytes(line);
            writer.WriteLine(Convert.ToHexString(line));
        }
    }

    [Benchmark]
    public long ReadLines()
    {
        IEnumerable<string> linesEnumerable = File.ReadLines("test.txt");
        long length = 0;
        foreach (string line in linesEnumerable)
        {
            length += line.Length;
        }
        return length;
    }

    [Benchmark]
    public async Task<long> ReadLinesAsync()
    {
        IAsyncEnumerable<string> linesEnumerable = File.ReadLinesAsync("test.txt");
        long length = 0;
        await foreach (string line in linesEnumerable)
        {
            length += line.Length;
        }
        return length;
    }

    [Benchmark]
    public long ReadAllLines()
    {
        string[] lines = File.ReadAllLines("test.txt");
        long length = 0;
        foreach (string line in lines)
        {
            length += line.Length;
        }
        return length;
    }

    [Benchmark]
    public async Task<long> ReadAllLinesAsync()
    {
        string[] lines = await File.ReadAllLinesAsync("test.txt");
        long length = 0;
        foreach (string line in lines)
        {
            length += line.Length;
        }
        return length;
    }
}

On my machine this gives:

Method Mean Gen0 Gen1 Gen2 Allocated
ReadLines 1.266 s 787000.0000 - - 4.6 GB
ReadLinesAsync 3.287 s 1119000.0000 3000.0000 - 6.5 GB
ReadAllLines 5.909 s 811000.0000 294000.0000 24000.0000 5.3 GB
ReadAllLinesAsync 13.514 s 1134000.0000 411000.0000 20000.0000 7.2 GB

I didn't find a way in Benchmark.NET to measure peak heap usage, so no direct measurements here. But notice that for ReadLines[Async], generation 2 collection never occurred, and generation 1 collection is significantly less frequent. This means allocations are mostly temporary, and memory pressure is significant lower.

I believe another way to validate this is to write to the file during enumeration and see what happens:

using (StreamWriter writer = File.CreateText("test.txt"))
{
    writer.WriteLine("A".PadRight(4096));
    writer.WriteLine("B".PadRight(4096));
    writer.WriteLine("C".PadRight(4096));
    writer.WriteLine("D".PadRight(4096));
}

IAsyncEnumerator<string> linesEnumerator = File.ReadLinesAsync("test.txt").GetAsyncEnumerator();
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());

using (StreamWriter writer = File.CreateText("test.txt"))
{
    writer.WriteLine("1".PadRight(4096));
    writer.WriteLine("2".PadRight(4096));
    writer.WriteLine("3".PadRight(4096));
    writer.WriteLine("4".PadRight(4096));
}

await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());
await linesEnumerator.MoveNextAsync();
Console.WriteLine(linesEnumerator.Current.Trim());

Mutating data source during enumeration is probably a bad idea and you should avoid this in practice. But in my environment it prints:

A B C 4

As for the Parallel.ForEachAsync you mentioned, it is not meant to be used together with IAsyncEnumerable to replace await foreach. It is used to perform multiple asynchronous tasks in parallel. You have to have all the data loaded in memory for them to be processed in parallel. Therefore using File.ReadLinesAsync with Parallel.ForEachAsync in the way OP implied doesn't seem to be a valid use case to me.

@julealgon already gave a great example on how to actually use IAsyncEnumerable.

julealgon commented 4 months ago

@ragavendra

No offense, my post has already been done voted and I doubt if I have even the slightest of the chances to help any further.

You probably saw the other folks explaining to you how the IAsyncEnumerable overload doesn't behave how you thought it did so I won't cover that but just wanted to chime in on this statement here.

It is totally ok to be downvoted. Don't be discouraged from posting ideas because of that. Sometimes, you can make people reverse their downvotes into an upvote based on a more precise explanation or a more specific use case that you bring. This is why in my personal case I downvoted your post but asked you to clarify a few points at the same time, because there was still a chance I could reverse my vote depending on your answer.

The only thing I would suggest however is to first post these proposals in the "Discussions" area if you still have a few doubts on how some API works.

ragavendra commented 4 months ago

Thank you all for the valuable answers. I was able to find the solution for my situation. The problem with await is it is blocking and it blocks even in an async method. Having said that, I am using a separate async or a non async task to be able to do the loading using the ReadLinesAsync or the ReadLines ( now it is irrelevant ). I however have to check if that task is completed in the main thread or the controller method does not let the task to complete or similar.

Good to know about IAsyncEnumerable loads each line asynchroously having little impact on the resources.

CyrusNajmabadi commented 4 months ago

The problem with await is it is blocking

await is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/

huoyaoyuan commented 4 months ago

I however have to check if that task is completed in the main thread or the controller method does not let the task to complete or similar.

The entire thing the await keyword does for you is waiting for completion. If you are in a GUI application that has an ExecutionContext associated with your main thread, it will put the continuation to your main thread by default.

I suggest to learn more about how async-await works. Stephen's blog is awesome, but really too verbose for beginners.

adamsitnik commented 3 months ago

From what I see @ragavendra got the needed help and as @stephentoub wrote, we won't introduce such API.

Thank you all for your help!

ragavendra commented 3 months ago

await is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/

This is exactly the problem that me and other programmers have :) . Look at the await in JS here which is certainly blocking and makes sense to an average programmer, see this. There is no need to use await in that case but to simply only to clear the compiler warning.

The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.

In the documentation, the reference is like below.

        // Returns:
        //     The async enumerable that represents all the lines of the file, or the lines
        //     that are the result of a query.
        public static IAsyncEnumerable<string> ReadLinesAsync(string path, Encoding encoding, CancellationToken cancellationToken = default);

It does not mention that the file is read line by line, instead one can assume that the whole file is loaded or the result of the query. There is little or no mention about it here or anywhere as well.

julealgon commented 3 months ago

@ragavendra

await is definitely not blocking :) You def wnat to read up on the topic here: https://learn.microsoft.com/en-us/dotnet/csharp/asynchronous-programming/

This is exactly the problem that me and other programmers have :) . Look at the await in JS here which is certainly blocking and makes sense to an average programmer, see this. There is no need to use await in that case but to simply only to clear the compiler warning.

Using await just to "clear the compiler warning" there would be an absolutely terrible way to do it. The correct thing to do would be to switch from Thread.Sleep to await Task.Delay.

I think you have a misunderstanding on what "blocking" means vs what "waiting for the response to be available" means.

When you use await on a call, it means your code will stop there and wait for the response to become available, then resume execution. But it does this in a non-blocking way in the sense that it does not block the execution thread: it simply signals that the thread that was running your code until that point is now free to run some other code (potentially even the task to be awaited).

That's also the difference between Thread.Sleep (which blocks the thread) vs await Task.Delay (which does not block the thread).

The existing ReadLinesAsync returns an IAsyncEnumerable. It does not load the entire file: when you call MoveNextAsync, it asynchronously reads the next line.

In the documentation, the reference is like below.

        // Returns:
        //     The async enumerable that represents all the lines of the file, or the lines
        //     that are the result of a query.
        public static IAsyncEnumerable<string> ReadLinesAsync(string path, Encoding encoding, CancellationToken cancellationToken = default);

It does not mention that the file is read line by line, instead one can assume that the whole file is loaded or the result of the query. There is little or no mention about it here or anywhere as well.

It is implicit by the fact that it returns the IAsyncEnumerable interface. You need to understand how that interface works to understand how the results are read. I would strongly recommend looking that up specifically and getting familiar with it, then your questions should clear up when using any method that returns it.

ragavendra commented 3 months ago

is the same as:

var res = System.IO.File.ReadLinesAsync("/tmp/tmp.txt", System.Text.Encoding.UTF8, cancellationToken.Token).Select(some);

Does IAsyncEnumerable have the select?

// Returns: // The async enumerable that represents all the lines of the file, or the lines // that are the result of a query.

Is this still correct - or the lines that are the result of a query. ?

For this method, after the async task call, the lines are not run . Full snippet below as well for .Net 8.

using System;
using System.Threading.Tasks;
using System.Collections.Generic;
using System.IO;

public class Program
{
    public static void Main()
    {
        Console.WriteLine("Hello World");
        File.WriteAllText("someFile", "line one \nline two \nline three\n");

        /*

        var res = File.ReadAllText("someFile");
        Console.Write("File contents - " + res);

        var res_ = File.ReadAllLines("someFile");
        Console.Write("File contents - " + res_[0]);*/

        var res_1 = File.ReadLinesAsync("someFile");

        // Parallel.ForEachAsync(
        SomeMethod(res_1);

        Console.WriteLine("Exiting main");
    }

    public static void SomeMethod(IAsyncEnumerable<string> res_1)
    {
        var tsk = SomeMethod_1(res_1);
         while (!tsk.IsCompleted)
         {
             // _logger.LogInformation("Status is " + tsk.Status);
         }
        return;
    }

    public static async Task SomeMethod_1(IAsyncEnumerable<string> res_1){

        await Parallel.ForEachAsync(res_1, (line, cncl) => {
                        Console.WriteLine("Line is " + line);
                        // return default;
                        return new ValueTask(new Task(() => { }, cncl));
                    });

    }
}

First of all I am not sure if calling ReadLinesAsync justifies to what the method returns. Technically more or less a pointer to the first line is returned asynchronously? Nothing is read at all until, the foreach is run on the IAsyncEnumerable.

Literally, when I drafted the proposal it was more or less a draft itself, there were so many questions and ambiguity. However so many issues can still be opened and attended to from this post, to have a more verbose documentation and usability experience for both the entry level programmers and the so called 10x programmers :).

huoyaoyuan commented 3 months ago

Does IAsyncEnumerable have the select?

It's in System.Linq.Async package.

after the async task call, the lines are not run

It's because this line:

return new ValueTask(new Task(() => { }, cncl));

Task created by new Task is not started. Thus, tasks built and waiting on them would never complete.

I'd suggest you to learn more about async and tasks, to properly use them.