bitfield / script

Making it easy to write shell-like scripts in Go
MIT License
5.55k stars 315 forks source link

Update First(n int) to read no more data than necessary #199

Closed bartdeboer closed 6 months ago

bartdeboer commented 6 months ago

This slightly modifies First and introduces Prompt that can be used for user input:

fmt.Print("User input: ")
input, _ := script.Stdin().First(1).String()
fmt.Printf("Input was %s\n", input)

or

input, _ := script.Prompt("User input: ").String()
fmt.Printf("Input was %s\n", input)

Fixes #51, #132

bitfield commented 6 months ago

Thanks @bartdeboer! A little more context would be helpful for me on this one. Prompt makes sense, but what effect does the change to First have that makes it necessary here?

bartdeboer commented 6 months ago

Hi John,

I'm happy to provide more context. The First(n int) function is designed to read the first n lines from a stream. However, the old implementation had a limitation:

func (p *Pipe) FilterScan(filter func(string, io.Writer)) *Pipe {
    return p.Filter(func(r io.Reader, w io.Writer) error {
        scanner := newScanner(r)
        for scanner.Scan() {
            filter(scanner.Text(), w)
        }
        return scanner.Err()
    })
}

func (p *Pipe) First(n int) *Pipe {
    if p.Error() != nil {
        return p
    }
    if n <= 0 {
        return NewPipe()
    }
    i := 0
    return p.FilterScan(func(line string, w io.Writer) {
        if i >= n {
            return
        }
        fmt.Fprintln(w, line)
        i++
    })
}

It filters out n number of lines, but keeps scanning until EOF even after n lines have been received. This means that when used to scan for user input we're not sending in EOF, the function doesn't break out of the scanning loop.

My modified version solves this by breaking out of the scanner once the n number of lines is exceeded:

func (p *Pipe) First(n int) *Pipe {
    if p.Error() != nil {
        return p
    }
    if n <= 0 {
        return NewPipe()
    }
    return p.Filter(func(r io.Reader, w io.Writer) error {
        scanner := newScanner(r)
        i := 0
        for scanner.Scan() {
            _, err := fmt.Fprintln(w, scanner.Text())
            if err != nil {
                return err
            }
            i++
            if i >= n {
                break
            }
        }
        return scanner.Err()
    })
}
bitfield commented 6 months ago

It filters out n number of lines, but keeps scanning until EOF even after n lines have been received.

I'm not sure this is true, if I'm understanding you correctly. For example, consider this program:

script.Stdin().First(1).Stdout()

If we run this, and type "hello" followed by a newline, we immediately see the output "hello". First is not waiting for EOF in order to produce its result. All further input lines are ignored, as you'd expect, and the pipe itself doesn't terminate until the input ends, also as you'd expect.

Can you help me out by showing an example program that doesn't work as you'd expect with the current implementation?

bartdeboer commented 6 months ago

Here's some examples of what I'm seeing:

With .String()

fmt.Print("User input: ")
input, _ := script.Stdin().First(1).String()
fmt.Printf("Input was %s\n", input)

With PR:

User input: 1st input\n
Input was 1st input

Without PR:

User input: 1st input\n
2nd input\n
3rd input\n
Ctrl+D
Input was 1st input

With .Stdout()

fmt.Print("User input: ")
bytes, _ := script.Stdin().First(1).Stdout()
fmt.Printf("Input was %d bytes\n", bytes)

With PR:

User input: 1st input\n
1st input
Input was 10 bytes

Without PR:

User input: 1st input\n
1st input
2nd input\n
3rd input\n
Ctrl+D
Input was 10 bytes

I guess herein lies the difference in behavior. If the pipe should terminate after n lines have been read or keep streaming?

I think maybe it should terminate. It isn't going to send anything more. What happens if I want to read the first 10 lines of a 1GB log file?

Considering it should work the same like head -1 I think it also terminates after n lines?

bitfield commented 6 months ago

I guess herein lies the difference in behavior. If the pipe should terminate after n lines have been read or keep streaming?

This is a very good point. As a matter of fact, no script filter currently closes the pipe by itself. I agree it seems logical that First should do that, though, since reading any further lines is guaranteed to be wasted work for all downstream stages. And head does indeed behave this way.

What do you think about this slightly more compact way of writing the same logic?

    return p.Filter(func(r io.Reader, w io.Writer) error {
        scanner := newScanner(r)
        for i := 0; i < n && scanner.Scan(); i++ {
            _, err := fmt.Fprintln(w, scanner.Text())
            if err != nil {
                return err
            }
        }
        return scanner.Err()
    })
bartdeboer commented 6 months ago

I think your version writes it more elegantly, yes.

Would you like to commit that change to the PR or shall I?

Next I think I/we should perhaps drop the Prompt() function from this PR and perhaps do that in a separate PR?

bitfield commented 6 months ago

Yes, go ahead and make the change by all means. Also, shall we document this behaviour in the doc comment? Something like:

// First produces only the first n lines of the pipe's contents, or all the
// lines if there are less than n. If n is zero or negative, there is no output
// at all. When n lines have been produced, First stops reading its input and
// sends EOF to its output.

Next I think I/we should perhaps drop the Prompt() function from this PR and perhaps do that in a separate PR?

Yes, good idea—let's make these changes atomic.

bitfield commented 6 months ago

Great! Also, we should probably have a test that shows this behaviour, shouldn't we? What about something like a strings.Reader that contains several lines, and then after reading it through a pipe with First(1) we can prove that there are still lines left in the reader.

bartdeboer commented 6 months ago

Something like this?

It checks if there's remaining bytes in the reader after running First (accounting for Scanner's 4096 buffer size).

func TestFirstHasRemainingBytesAfterNLinesOfInput(t *testing.T) {
    t.Parallel()
    lineSize, testNumLines, totalLines := 1024, 3, 9
    var input strings.Builder
    // Generate input that is bigger than Scanner's 4096 byte buffer size.
    for i := 0; i < totalLines; i++ {
        line := strings.Repeat(fmt.Sprintf("%d", i), lineSize-1) + "\n"
        input.WriteString(line)
    }
    inputReader := strings.NewReader(input.String())
    got, _ := script.NewPipe().WithReader(inputReader).First(testNumLines).String()
    if len(got) == 0 {
        t.Errorf("Expected bytes from First, got 0")
    }
    remInputBytes, err := io.ReadAll(inputReader)
    if err != nil {
        t.Fatal("Error reading remaining input bytes:", err)
    }
    if len(remInputBytes) == 0 {
        t.Errorf("Expected remaining bytes in the inputReader, got 0")
    }
}
bitfield commented 6 months ago

Great! I forgot about the buffer. What about this slightly shorter version of the same idea:

func TestFirstDoesNotConsumeUnnecessaryData(t *testing.T) {
    t.Parallel()
    // First uses a 4096-byte buffer, so will always read at least
    // that much, but no more (once N lines have been read).
    r := strings.NewReader(strings.Repeat("line\n", 1000))
    got, err := script.NewPipe().WithReader(r).First(1).String()
    if err != nil {
        t.Fatal(err)
    }
    want := "line\n"
    if want != got {
        t.Errorf("want output %q, got %q", want, got)
    }
    if r.Len() == 0 {
        t.Errorf("no data left in reader")
    }
}