Replace the ioutil.ReadAll usage

alexsapran commented 1 year ago

While we were working on the offsite and analyzing some of the memory profiles of Filebeat the output to ES was consuming memory for parsing the buffer from the _bulk response.

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

While this issue points to a specific case of the usage of ReadAll taking a look into the codebase of other Beats, it looks like we can even optimize that as well.

The desired output is to write some go benchmark that compares the before and after.

/cc @AndersonQ @pierrehilbert

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

amitkanfer commented 1 year ago

trying something out here: https://github.com/elastic/beats/pull/36215

rdner commented 9 months ago

Let's get it straight first:

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

The only reason it's deprecated because it was moved to another package and now it's io.ReadAll

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

Okay, I was not sure about this and collected some data:

Analysis

io.ReadAll allocates a buffer of 512 bytes and then re-allocates it to a larger size if the content exceeds this size. I believe in the most of our cases it would probable re-allocate.

Using io.Copy and bytes.Buffer would work in a similar way – first time you try to write to a bytes.Buffer it allocates a slice to accommodate the currently written chunk, if the chunk is smaller than 64 bytes, the minimal allocation is 64 bytes. The next written chunk would probably cause another allocation. However, the buffer growing algorithm is quite sophisticated and probably more efficient in some edge cases.

Benchmarking

To prove the hypothesis I created this benchmark:

package main

import (
    "bytes"
    "fmt"
    "io"
    "io/ioutil"
    "testing"
)

func BenchmarkReadBytes(b *testing.B) {
    sizes := []int{
        100,         // 100 bytes
        100 * 1024,  // 100KB
        1024 * 1024, // 1MB
    }
    for _, size := range sizes {
        b.Run(fmt.Sprintf("size: %d", size), func(b *testing.B) {

            // emulate a file or an HTTP response
            generated := bytes.Repeat([]byte{'a'}, size)
            content := bytes.NewReader(generated)
            b.ResetTimer()

            b.Run("deprecated ioutil.ReadAll", func(b *testing.B) {
                for i := 0; i < b.N; i++ {
                    content.Seek(0, io.SeekStart) // reset
                    data, err := ioutil.ReadAll(content)
                    if err != nil {
                        b.Fatal(err)
                    }
                    if len(data) != size {
                        b.Fatalf("size does not match, expected %d, actual %d", size, len(data))
                    }
                }
            })
            b.Run("bytes.Buffer+io.Copy", func(b *testing.B) {
                for i := 0; i < b.N; i++ {
                    content.Seek(0, io.SeekStart) // reset
                    buf := bytes.NewBuffer(nil)
                    _, err := io.Copy(buf, content)
                    if err != nil {
                        b.Fatal(err)
                    }
                    if buf.Len() != size {
                        b.Fatalf("size does not match, expected %d, actual %d", size, buf.Len())
                    }
                }
            })
        })
    }
}

And indeed, bytes.Buffer + io.Copy is significantly faster on every size I tested (100 bytes, 100KB, 1MB) and it allocates almost 5 times less memory:

go test -run=none -bench='BenchmarkReadBytes/' -benchmem -benchtime=1000x
goos: darwin
goarch: arm64
pkg: github.com/rdner/tar-test
BenchmarkReadBytes/size:_100/deprecated_ioutil.ReadAll-10                   1000               226.6 ns/op           512 B/op          1 allocs/op
BenchmarkReadBytes/size:_100/bytes.Buffer+io.Copy-10                        1000               173.9 ns/op           160 B/op          2 allocs/op
BenchmarkReadBytes/size:_102400/deprecated_ioutil.ReadAll-10                1000             55907 ns/op          514308 B/op         18 allocs/op
BenchmarkReadBytes/size:_102400/bytes.Buffer+io.Copy-10                     1000              6875 ns/op          106544 B/op          2 allocs/op
BenchmarkReadBytes/size:_1048576/deprecated_ioutil.ReadAll-10               1000            349308 ns/op         5241107 B/op         27 allocs/op
BenchmarkReadBytes/size:_1048576/bytes.Buffer+io.Copy-10                    1000             72474 ns/op         1048626 B/op          2 allocs/op
PASS
ok      github.com/rdner/tar-test       0.700s

In terms of CPU (smaller is better):

In terms of memory (smaller is better):

Conclusion

Even a simple replacement of ioutil.ReadAll (io.ReadAll) with bytes.Buffer + io.Copy (no buffer re-use) can give us a significant boost in both CPU and memory performance. So, it's worth spending time on this change.

Additionally, we can also investigate places where we might know the fixed buffer size and re-use the same buffer multiple times like we do in filestream fingerprinting. However, this is more effort and will take significantly more time.

amitkanfer commented 9 months ago

@rdner well done, great summary.

rdner commented 9 months ago

I estimated how big of a change it would be:

ioutil.ReadAll currently has 40 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats%20%22ioutil.ReadAll%22&type=code
io.ReadAll currently has 25 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats+%22io.ReadAll%22&type=code
I created an optimized version of ReadAll just for HTTP responses here https://github.com/elastic/elastic-agent-libs/pull/183 Once it's merged and we update to this new version of elastic-agent-libs it's just 65 files. It's medium effort if we don't address all the linter issues in the affected files, otherwise it's going to be a way higher effort.

rdner commented 9 months ago

Just realised that we need to fix this in the https://github.com/elastic/elastic-agent-libs repository too (10 files).

alexsapran commented 8 months ago

I think we should prioritize the input side of Beats when consuming HTTP endpoints, especially for the metricbeat. During EAH I discussed with some folks from the platform SRE complaining about the memory usage of the "agent", which was mostly consuming metrics running as a pod inside MKI.

Given the impact that @rdner showed with this, I think we should start with replacing that part of the code, then talk with SRE-o11y and canary it to measure the impact.

rdner commented 8 months ago

@alexsapran I was planning to contribute to Go itself replacing the ReadAll implementation there with the more efficient one. Then it's just updating to the next version and gaining all this performance for free. I'll keep you updated here.

rowlandgeoff commented 5 months ago

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

pierrehilbert commented 5 months ago

Hey @rowlandgeoff As @rdner is currently focused on our CI reliability, he didn't make any progress on this for a while. If we can have your team's to automate the triaging he is doing and fix some issues (not related to the tests themselves) we are having, it will allow us to come back to this important topic.

mauri870 commented 4 months ago

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

I found the optimizations discussed here to be incredibly insightful. Great work on the findings and investigation! Considering the current development cycle, Go 1.23 is frozen for the August release. Therefore, the most realistic timeframe for these changes to be included would be the Go 1.24 release in February 2025, assuming a CL with the improvements is approved and merged by then. The issue at go.dev/issues/50774 seems to have been stale for two years, so it will likely remain that way unless someone pushes the effort forward.

Readers and writers are excellent abstractions and two of the three most important interfaces in Go (alongside the empty interface). However, covering and optimizing all possible implementations and permutations of these is very difficult and time-consuming, often involving trade-offs. The Go standard library leverages these abstractions extensively, and if a particular case presents too many downsides, it is preferable to maintain stable performance in the standard library rather than optimize for a specific scenario.

Given that said, I would advice to not wait for this to land in the stdlib, the performance gains seems more than justifiable to just keep an optimized version of io.Readall around.

mauri870 commented 1 month ago

I spent some time looking into this (probably more time than I should) and the conclusion that I got is a big it depends.

It is clear that io.ReadAll uses more resources than a bytes.Buffer + io.Copy in general, but this is not always the case.

TL;DR;

Most Readers in the standard library implement io.WriterTo, I'll leave a list here for posterity. This seems to be the main optimization that makes the bytes.Buffer + io.Copy fast. The io.ReadAll function cannot take advantage of this.

Third party io.Readers might not implement this optimization, and this makes the < 512 bytes reads 4x slower, this is due to io.ReadAll preallocating a 512 byte buffer. For some use cases you might want faster performance in this scenario. That makes me skeptical about replacing io.ReadAll everywhere because we might not know about the underlying reader and context. But in general, it seems safe to do this replacement as long as we pay attention to these details.

$ cd $GOROOT/api; rg '\) WriteTo\('

go1.22.txt
118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808
125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808

go1.5.txt
585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)

go1.8.txt
211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)

go1.txt
197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)
4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)
4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)
4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)
5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error

go1.1.txt
29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)
37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)
47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)
2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)
(beats) ➜  api git:(master) ✗ cd $GOROOT/api; rg '\) WriteTo\(' *.txt
go1.22.txt
118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808
125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808

go1.5.txt
585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)

go1.8.txt
211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)

go1.txt
197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)
4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)
4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)
4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)
5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error

go1.1.txt
29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)
37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)
47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)
2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)
(beats) ➜  api $(git_prompt_info)
(beats) ➜  api git:(master) ✗
(beats) ➜  api git:(master) ✗ cd $GOROOT/api; rg '\) WriteTo\(' *.txt
go1.22.txt
118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808
125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808

go1.5.txt
585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)

go1.8.txt
211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)

go1.txt
197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)
4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)
4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)
4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)
5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error

go1.1.txt
29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)
37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)
47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)
2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)

Small Reads (< 512 bytes)

For smaller reads (up to 512 bytes) io.ReadAll is faster, since it preallocates a single 512 bytes chunk of memory by default and does not need to grow it until it needs more space. The downside is that it will always allocate 512 bytes even if the io.Reader you have only yields 32 bytes of data, it trades a slightly larger memory footprint for cpu.

Also there are io.Readers and there are io.Readers. Some readers implement io.WriterTo, which is an optimization that allows the Read operation to avoid creating an intermediary buffer during io.Copy. Some readers, let's call then "dumb readers", only implement io.Read. For io.ReadAll this optimization does not really matter as it only relies on Read anyway. If the source is a dumb reader then the performance of bytes.Buffer + io.Copy becomes much worse than an io.ReadAll for < 512 bytes since it cannot take advantage of this optimization:

BenchmarkReadAll/size_64B/io.ReadAll/WriterTo-32                 7117320               166.3 ns/op           512 B/op          1 allocs/op
BenchmarkReadAll/size_64B/io.ReadAll/Reader-32                   7319155               167.4 ns/op           512 B/op          1 allocs/op
BenchmarkReadAll/size_64B/ReadAll/WriterTo-32                   10699069               113.4 ns/op           112 B/op          2 allocs/op
BenchmarkReadAll/size_64B/ReadAll/Reader-32                      2122534               551.0 ns/op          1584 B/op          3 allocs/op

So in conclusion, for readers yielding < 512 bytes:

io.ReadAll never takes io.WriterTo/io.ReaderFrom optimization into account.
For dumb readers io.ReadAll is 4x faster than bytes.Buffer+io.Copy due to the preallocated buffer.
If the source implements io.WriterTo then bytes.Buffer+io.Copy is faster and uses less memory than io.ReadAll

Remaining Reads (> 512 bytes)

For these sizes it does not really matter if the reader is dumb or not, since io.ReadAll does not take advantage of the optimization. This means that bytes.Buffer + io.Copy is able to use io.WriterTo from the source and read it in one Go (no pun intended), with this there is a huge perf gain. If the reader is dumb then we rely on bytes.Buffer growing algorithm, which is less picky than the runtime append growth ratio.

BenchmarkReadAll/size_100KB/io.ReadAll/WriterTo-32                 13582             88505 ns/op          514304 B/op         18 allocs/op
BenchmarkReadAll/size_100KB/io.ReadAll/Reader-32                   13425             89588 ns/op          514304 B/op         18 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/WriterTo-32                    53968             22920 ns/op          106544 B/op          2 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/Reader-32                      22172             54702 ns/op          261680 B/op         10 allocs/op

In conclusion, for > 512 byte reads:

io.ReadAll is slower and more resource intensive in all cases, its only advantage was the 512 bytes preallocated buffer, but at this point it needs to regrow multiple times, following the append growth ratio.
bytes.Buffer + io.Copy is the better choice for dumb and optimized readers.

amitkanfer commented 1 month ago

Thanks @mauri870 ! So what should be our next step here? Should we look into the different places where we use io.ReadAll and assess if the average allocation is smaller or bigger than 512 bytes?

mauri870 commented 1 month ago

Given that most readers in the standard library do implement io.WriterTo, it seems wise for us to take advantage of this whenever we can. Unfortunately there is a multitude of io.Readers in the stdlib and third party packages that it is hard to track the details, this starts to blur the line of microptimization territory IMHO. Generally when you deal with readers you just want to read from them an not spend time figuring out which kind of reader it is or if it yields x bytes of data.

This is one of the reasons it is very unlikely that this would land in the stdlib, there is too many corner cases and details, there is no way to make it faster in all scenarios, it is very dependent on the situation.

My plan is to add this optimized ReadAll to elastic-agent-libs alongside tests and benchmarks, like what @rdner did with the http optimized ReadAll. But in order to actually use this we need to identify places where we can use this and don't have downsides. That means careful analysis of the surrounding context and preferably a benchmark to confirm it is actually a speedup.

For example:

f, _ := os.Open("x.txt")
b, _ := io.ReadAll(f)

We could just blindly replace this with a bytes.Buffer + io.Copy, but pre Go 1.22 os.File did not implement io.WriterTo. This means that if this file was < 512 bytes this change would be 4x slower.

mauri870 commented 1 month ago

I believe this issue has deviated slightly from its original purpose. The initial goal was to improve the usage of io.ReadAll in the libbeat Elasticsearch client, which has already been addressed in https://github.com/elastic/beats/pull/36275. This is confirmed by the benchmarks introduced in https://github.com/elastic/beats/pull/40885.

While there are still many instances of io.ReadAll across multiple projects, I think it's impractical to replace all occurrences with a buffer without first confirming that doing so doesn’t introduce a slowdown in the scenarios previously mentioned. I’ve added an optimized version of readall with accompanying benchmarks https://github.com/elastic/elastic-agent-libs/pull/229.

If we want to spend more time investigating each occurrence, we should open a separate issue for that. Perhaps those more familiar with the codebase can highlight areas where these optimizations would provide significant benefits, allowing us to focus our efforts more effectively.

amitkanfer commented 1 month ago

Ack. Thank you @mauri870 . @alexsapran @jeniawhite @rdner - specific places we need to focus at, or should we close this issue as done? (or both ;) )

alexsapran commented 1 month ago

Given the focus on the OTel, it's safe to say we can close this.

elastic / beats