Closed alexsapran closed 1 month ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
trying something out here: https://github.com/elastic/beats/pull/36215
Let's get it straight first:
Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.
The only reason it's deprecated because it was moved to another package and now it's io.ReadAll
We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.
Okay, I was not sure about this and collected some data:
io.ReadAll
allocates a buffer of 512 bytes and then re-allocates it to a larger size if the content exceeds this size. I believe in the most of our cases it would probable re-allocate.
Using io.Copy
and bytes.Buffer
would work in a similar way – first time you try to write to a bytes.Buffer
it allocates a slice to accommodate the currently written chunk, if the chunk is smaller than 64 bytes, the minimal allocation is 64 bytes. The next written chunk would probably cause another allocation. However, the buffer growing algorithm is quite sophisticated and probably more efficient in some edge cases.
To prove the hypothesis I created this benchmark:
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
"testing"
)
func BenchmarkReadBytes(b *testing.B) {
sizes := []int{
100, // 100 bytes
100 * 1024, // 100KB
1024 * 1024, // 1MB
}
for _, size := range sizes {
b.Run(fmt.Sprintf("size: %d", size), func(b *testing.B) {
// emulate a file or an HTTP response
generated := bytes.Repeat([]byte{'a'}, size)
content := bytes.NewReader(generated)
b.ResetTimer()
b.Run("deprecated ioutil.ReadAll", func(b *testing.B) {
for i := 0; i < b.N; i++ {
content.Seek(0, io.SeekStart) // reset
data, err := ioutil.ReadAll(content)
if err != nil {
b.Fatal(err)
}
if len(data) != size {
b.Fatalf("size does not match, expected %d, actual %d", size, len(data))
}
}
})
b.Run("bytes.Buffer+io.Copy", func(b *testing.B) {
for i := 0; i < b.N; i++ {
content.Seek(0, io.SeekStart) // reset
buf := bytes.NewBuffer(nil)
_, err := io.Copy(buf, content)
if err != nil {
b.Fatal(err)
}
if buf.Len() != size {
b.Fatalf("size does not match, expected %d, actual %d", size, buf.Len())
}
}
})
})
}
}
And indeed, bytes.Buffer
+ io.Copy
is significantly faster on every size I tested (100 bytes, 100KB, 1MB) and it allocates almost 5 times less memory:
go test -run=none -bench='BenchmarkReadBytes/' -benchmem -benchtime=1000x
goos: darwin
goarch: arm64
pkg: github.com/rdner/tar-test
BenchmarkReadBytes/size:_100/deprecated_ioutil.ReadAll-10 1000 226.6 ns/op 512 B/op 1 allocs/op
BenchmarkReadBytes/size:_100/bytes.Buffer+io.Copy-10 1000 173.9 ns/op 160 B/op 2 allocs/op
BenchmarkReadBytes/size:_102400/deprecated_ioutil.ReadAll-10 1000 55907 ns/op 514308 B/op 18 allocs/op
BenchmarkReadBytes/size:_102400/bytes.Buffer+io.Copy-10 1000 6875 ns/op 106544 B/op 2 allocs/op
BenchmarkReadBytes/size:_1048576/deprecated_ioutil.ReadAll-10 1000 349308 ns/op 5241107 B/op 27 allocs/op
BenchmarkReadBytes/size:_1048576/bytes.Buffer+io.Copy-10 1000 72474 ns/op 1048626 B/op 2 allocs/op
PASS
ok github.com/rdner/tar-test 0.700s
Even a simple replacement of ioutil.ReadAll
(io.ReadAll
) with bytes.Buffer
+ io.Copy
(no buffer re-use) can give us a significant boost in both CPU and memory performance. So, it's worth spending time on this change.
Additionally, we can also investigate places where we might know the fixed buffer size and re-use the same buffer multiple times like we do in filestream fingerprinting. However, this is more effort and will take significantly more time.
@rdner well done, great summary.
I estimated how big of a change it would be:
ioutil.ReadAll
currently has 40 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats%20%22ioutil.ReadAll%22&type=codeio.ReadAll
currently has 25 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats+%22io.ReadAll%22&type=codeReadAll
just for HTTP responses here https://github.com/elastic/elastic-agent-libs/pull/183 Once it's merged and we update to this new version of elastic-agent-libs
it's just 65 files. It's medium effort if we don't address all the linter issues in the affected files, otherwise it's going to be a way higher effort.Just realised that we need to fix this in the https://github.com/elastic/elastic-agent-libs repository too (10 files).
I think we should prioritize the input side of Beats when consuming HTTP endpoints, especially for the metricbeat. During EAH I discussed with some folks from the platform SRE complaining about the memory usage of the "agent", which was mostly consuming metrics running as a pod inside MKI.
Given the impact that @rdner showed with this, I think we should start with replacing that part of the code, then talk with SRE-o11y and canary it to measure the impact.
@alexsapran I was planning to contribute to Go itself replacing the ReadAll
implementation there with the more efficient one. Then it's just updating to the next version and gaining all this performance for free. I'll keep you updated here.
@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?
I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.
Hey @rowlandgeoff As @rdner is currently focused on our CI reliability, he didn't make any progress on this for a while. If we can have your team's to automate the triaging he is doing and fix some issues (not related to the tests themselves) we are having, it will allow us to come back to this important topic.
@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?
I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.
I found the optimizations discussed here to be incredibly insightful. Great work on the findings and investigation! Considering the current development cycle, Go 1.23 is frozen for the August release. Therefore, the most realistic timeframe for these changes to be included would be the Go 1.24 release in February 2025, assuming a CL with the improvements is approved and merged by then. The issue at go.dev/issues/50774 seems to have been stale for two years, so it will likely remain that way unless someone pushes the effort forward.
Readers and writers are excellent abstractions and two of the three most important interfaces in Go (alongside the empty interface). However, covering and optimizing all possible implementations and permutations of these is very difficult and time-consuming, often involving trade-offs. The Go standard library leverages these abstractions extensively, and if a particular case presents too many downsides, it is preferable to maintain stable performance in the standard library rather than optimize for a specific scenario.
Given that said, I would advice to not wait for this to land in the stdlib, the performance gains seems more than justifiable to just keep an optimized version of io.Readall around.
I spent some time looking into this (probably more time than I should) and the conclusion that I got is a big it depends.
It is clear that io.ReadAll uses more resources than a bytes.Buffer + io.Copy in general, but this is not always the case.
Most Readers in the standard library implement io.WriterTo, I'll leave a list here for posterity. This seems to be the main optimization that makes the bytes.Buffer + io.Copy fast. The io.ReadAll function cannot take advantage of this.
Third party io.Readers might not implement this optimization, and this makes the < 512 bytes reads 4x slower, this is due to io.ReadAll preallocating a 512 byte buffer. For some use cases you might want faster performance in this scenario. That makes me skeptical about replacing io.ReadAll everywhere because we might not know about the underlying reader and context. But in general, it seems safe to do this replacement as long as we pay attention to these details.
go1.22.txt 118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808 125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808 go1.5.txt 585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool) go1.8.txt 211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error) go1.txt 197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error) 4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error) 4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error) 4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error) 5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error go1.1.txt 29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error) 37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error) 47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error) 2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error) (beats) ➜ api git:(master) ✗ cd $GOROOT/api; rg '\) WriteTo\(' *.txt go1.22.txt 118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808 125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808 go1.5.txt 585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool) go1.8.txt 211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error) go1.txt 197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error) 4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error) 4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error) 4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error) 5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error go1.1.txt 29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error) 37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error) 47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error) 2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error) (beats) ➜ api $(git_prompt_info) (beats) ➜ api git:(master) ✗ (beats) ➜ api git:(master) ✗ cd $GOROOT/api; rg '\) WriteTo\(' *.txt go1.22.txt 118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808 125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808 go1.5.txt 585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool) go1.8.txt 211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error) go1.txt 197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error) 4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error) 4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error) 4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error) 5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error go1.1.txt 29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error) 37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error) 47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error) 2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)
For smaller reads (up to 512 bytes) io.ReadAll is faster, since it preallocates a single 512 bytes chunk of memory by default and does not need to grow it until it needs more space. The downside is that it will always allocate 512 bytes even if the io.Reader you have only yields 32 bytes of data, it trades a slightly larger memory footprint for cpu.
Also there are io.Readers and there are io.Readers. Some readers implement io.WriterTo, which is an optimization that allows the Read operation to avoid creating an intermediary buffer during io.Copy. Some readers, let's call then "dumb readers", only implement io.Read. For io.ReadAll this optimization does not really matter as it only relies on Read anyway. If the source is a dumb reader then the performance of bytes.Buffer + io.Copy becomes much worse than an io.ReadAll for < 512 bytes since it cannot take advantage of this optimization:
BenchmarkReadAll/size_64B/io.ReadAll/WriterTo-32 7117320 166.3 ns/op 512 B/op 1 allocs/op
BenchmarkReadAll/size_64B/io.ReadAll/Reader-32 7319155 167.4 ns/op 512 B/op 1 allocs/op
BenchmarkReadAll/size_64B/ReadAll/WriterTo-32 10699069 113.4 ns/op 112 B/op 2 allocs/op
BenchmarkReadAll/size_64B/ReadAll/Reader-32 2122534 551.0 ns/op 1584 B/op 3 allocs/op
So in conclusion, for readers yielding < 512 bytes:
For these sizes it does not really matter if the reader is dumb or not, since io.ReadAll does not take advantage of the optimization. This means that bytes.Buffer + io.Copy is able to use io.WriterTo from the source and read it in one Go (no pun intended), with this there is a huge perf gain. If the reader is dumb then we rely on bytes.Buffer growing algorithm, which is less picky than the runtime append growth ratio.
BenchmarkReadAll/size_100KB/io.ReadAll/WriterTo-32 13582 88505 ns/op 514304 B/op 18 allocs/op
BenchmarkReadAll/size_100KB/io.ReadAll/Reader-32 13425 89588 ns/op 514304 B/op 18 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/WriterTo-32 53968 22920 ns/op 106544 B/op 2 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/Reader-32 22172 54702 ns/op 261680 B/op 10 allocs/op
In conclusion, for > 512 byte reads:
Thanks @mauri870 ! So what should be our next step here? Should we look into the different places where we use io.ReadAll and assess if the average allocation is smaller or bigger than 512 bytes?
Given that most readers in the standard library do implement io.WriterTo, it seems wise for us to take advantage of this whenever we can. Unfortunately there is a multitude of io.Readers in the stdlib and third party packages that it is hard to track the details, this starts to blur the line of microptimization territory IMHO. Generally when you deal with readers you just want to read from them an not spend time figuring out which kind of reader it is or if it yields x bytes of data.
This is one of the reasons it is very unlikely that this would land in the stdlib, there is too many corner cases and details, there is no way to make it faster in all scenarios, it is very dependent on the situation.
My plan is to add this optimized ReadAll to elastic-agent-libs alongside tests and benchmarks, like what @rdner did with the http optimized ReadAll. But in order to actually use this we need to identify places where we can use this and don't have downsides. That means careful analysis of the surrounding context and preferably a benchmark to confirm it is actually a speedup.
For example:
f, _ := os.Open("x.txt")
b, _ := io.ReadAll(f)
We could just blindly replace this with a bytes.Buffer + io.Copy, but pre Go 1.22 os.File did not implement io.WriterTo. This means that if this file was < 512 bytes this change would be 4x slower.
I believe this issue has deviated slightly from its original purpose. The initial goal was to improve the usage of io.ReadAll in the libbeat Elasticsearch client, which has already been addressed in https://github.com/elastic/beats/pull/36275. This is confirmed by the benchmarks introduced in https://github.com/elastic/beats/pull/40885.
While there are still many instances of io.ReadAll across multiple projects, I think it's impractical to replace all occurrences with a buffer without first confirming that doing so doesn’t introduce a slowdown in the scenarios previously mentioned. I’ve added an optimized version of readall with accompanying benchmarks https://github.com/elastic/elastic-agent-libs/pull/229.
If we want to spend more time investigating each occurrence, we should open a separate issue for that. Perhaps those more familiar with the codebase can highlight areas where these optimizations would provide significant benefits, allowing us to focus our efforts more effectively.
Ack. Thank you @mauri870 . @alexsapran @jeniawhite @rdner - specific places we need to focus at, or should we close this issue as done? (or both ;) )
Given the focus on the OTel, it's safe to say we can close this.
While we were working on the offsite and analyzing some of the memory profiles of Filebeat the output to ES was consuming memory for parsing the buffer from the
_bulk
response.Looking further into it, we see that when we are executing the bulk request and consuming the response we call the
ioutil.ReadAll(resp.Body)
which is deprecated and consumes more memory, especially for those big responses.We should replace this with the Buffer and maybe the
io.Copy
to copy the response Buffer into a consumable reusable Buffer.While this issue points to a specific case of the usage of
ReadAll
taking a look into the codebase of other Beats, it looks like we can even optimize that as well.The desired output is to write some go benchmark that compares the before and after.
/cc @AndersonQ @pierrehilbert