Slower reads using iouring than os for ReadFile type operations at all file sizes

johnsiilver commented 1 year ago

Code discussed here can be found at: https://github.com/johnsiilver/iouringfile

Expectations: I expected faster Reads with io_uring, but found it to be slower in every benchmark. This could be due to multiple factors:

I don't know what I'm doing, which is easily possible
Single buffer reads, no matter what the size, have no speed advantage in io_uring vs epoll
Go's concurrency model and io_uring aren't good fits
There are bugs in this package

I could not get WithEQPoll() to work as I documented in another bug. Most benchmarks on io_uring show significant savings without it and multipliers if enabled. I found no examples in this repo that used it and using sourcegraph I could find no external repos that used it: https://sourcegraph.com/search?q=context%3Aglobal+WithSQPoll%28%29&patternType=standard&sm=1&groupBy=repo

Here are various benchmarks I ran on an Amazon VM running Ubnutu (I don't have any native linux boxes on bare metal).

Initially used runtime.LockOSThread() for my event loop, which yielded these results:

goos: linux goarch: amd64 pkg: github.com/johnsiilver/iouringfile BenchmarkReadFile/OS_1KiB-2 108146 11022 ns/op BenchmarkReadFile/OS_16KiB-2 72103 16457 ns/op BenchmarkReadFile/OS_32KiB-2 51840 23956 ns/op BenchmarkReadFile/OS_64KiB-2 33656 36336 ns/op BenchmarkReadFile/OS_128KiB-2 18320 66011 ns/op BenchmarkReadFile/OS_512KiB-2 5438 239014 ns/op BenchmarkReadFile/OS_1MiB-2 4478 277964 ns/op BenchmarkReadFile/OS_10MiB-2 326 3560091 ns/op BenchmarkReadFile/OS_100MiB-2 24 45336144 ns/op BenchmarkReadFile/OS_1GiB-2 2 533741636 ns/op BenchmarkReadFile/IOURING1KiB-2 20305 50978 ns/op <- Almost 5x slower BenchmarkReadFile/IOURING_16KiB-2 19050 59990 ns/op BenchmarkReadFile/IOURING_32KiB-2 18846 67469 ns/op BenchmarkReadFile/IOURING_64KiB-2 14314 81480 ns/op BenchmarkReadFile/IOURING_128KiB-2 10000 162159 ns/op BenchmarkReadFile/IOURING_512KiB-2 2743 456180 ns/op BenchmarkReadFile/IOURING_1MiB-2 1564 757464 ns/op BenchmarkReadFile/IOURING_10MiB-2 278 4013378 ns/op BenchmarkReadFile/IOURING_100MiB-2 25 42532022 ns/op BenchmarkReadFile/IOURING_1GiB-2 2 627497596 ns/op <- Still slower, but at least not 5x

Improved with removal of the the lock, and got:

goos: linux goarch: amd64 pkg: github.com/johnsiilver/iouringfile cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz BenchmarkReadFile/OS_1KiB-2 104364 11190 ns/op BenchmarkReadFile/OS_16KiB-2 70800 16552 ns/op BenchmarkReadFile/OS_32KiB-2 48690 24940 ns/op BenchmarkReadFile/OS_64KiB-2 29878 37414 ns/op BenchmarkReadFile/OS_128KiB-2 17282 72963 ns/op BenchmarkReadFile/OS_512KiB-2 4719 227736 ns/op BenchmarkReadFile/OS_1MiB-2 3877 285898 ns/op BenchmarkReadFile/OS_10MiB-2 326 3557621 ns/op BenchmarkReadFile/OS_100MiB-2 24 44912433 ns/op BenchmarkReadFile/OS_1GiB-2 2 552437373 ns/op BenchmarkReadFile/IOURING1KiB-2 56310 21982 ns/op <- Almost 2x slower BenchmarkReadFile/IOURING_16KiB-2 43414 28685 ns/op BenchmarkReadFile/IOURING_32KiB-2 34561 34932 ns/op BenchmarkReadFile/IOURING_64KiB-2 24840 45259 ns/op BenchmarkReadFile/IOURING_128KiB-2 17257 84012 ns/op BenchmarkReadFile/IOURING_512KiB-2 2972 375026 ns/op BenchmarkReadFile/IOURING_1MiB-2 --- FAIL: BenchmarkReadFile/IOURING_1MiB-2 benchmark_test.go:148: bad file descriptor BenchmarkReadFile/IOURING_10MiB-2 273 4144841 ns/op BenchmarkReadFile/IOURING_100MiB-2 25 42300929 ns/op BenchmarkReadFile/IOURING_1GiB-2 2 632643157 ns/op

NOTE: For this one run I got a bad file descriptor error, which I suspect is a bug in iouring-go, but it hasn't happened again so I can't confirm it is. It is also just as likely I did something silly.

I removed the event loop and pre-allocation of return channels and just did the iouring calls inside ReadFile itself, which yielded worse results:

goos: linux goarch: amd64 pkg: github.com/johnsiilver/iouringfile cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz BenchmarkReadFile/OS_1KiB-2 105922 11093 ns/op BenchmarkReadFile/OS_16KiB-2 73557 16656 ns/op BenchmarkReadFile/OS_32KiB-2 53299 22797 ns/op BenchmarkReadFile/OS_64KiB-2 36957 36118 ns/op BenchmarkReadFile/OS_128KiB-2 19062 66646 ns/op BenchmarkReadFile/OS_512KiB-2 5625 234702 ns/op BenchmarkReadFile/OS_1MiB-2 3667 318104 ns/op BenchmarkReadFile/OS_10MiB-2 325 3617478 ns/op BenchmarkReadFile/OS_100MiB-2 25 43581482 ns/op BenchmarkReadFile/OS_1GiB-2 2 552909954 ns/op BenchmarkReadFile/IOURING1KiB-2 11348 97162 ns/op <- 9x slower, so by far my worse idea BenchmarkReadFile/IOURING_16KiB-2 10000 111489 ns/op BenchmarkReadFile/IOURING_32KiB-2 10104 109561 ns/op BenchmarkReadFile/IOURING_64KiB-2 6470 186679 ns/op BenchmarkReadFile/IOURING_128KiB-2 7903 220638 ns/op BenchmarkReadFile/IOURING_512KiB-2 2329 614333 ns/op BenchmarkReadFile/IOURING_1MiB-2 1212 896992 ns/op BenchmarkReadFile/IOURING_10MiB-2 255 4715659 ns/op BenchmarkReadFile/IOURING_100MiB-2 25 43412745 ns/op BenchmarkReadFile/IOURING_1GiB-2 2 631509550 ns/op PASS ok github.com/johnsiilver/iouringfile 47.412s

So I went back to the previous and measured the allocs:

goos: linux goarch: amd64 pkg: github.com/johnsiilver/iouringfile cpu: Intel(R) Xeon(R) CPU BenchmarkReadFile/OS_1KiB-2 BenchmarkReadFile/OS_16KiB-2 BenchmarkReadFile/OS_32KiB-2 BenchmarkReadFile/OS_64KiB-2 BenchmarkReadFile/OS_128KiB-2 BenchmarkReadFile/OS_512KiB-2 BenchmarkReadFile/OS_1MiB-2 BenchmarkReadFile/OS_10MiB-2 BenchmarkReadFile/OS_100MiB-2 BenchmarkReadFile/OS_1GiB-2 BenchmarkReadFile/IOURING1KiB-2 BenchmarkReadFile/IOURING_16KiB-2 BenchmarkReadFile/IOURING_32KiB-2 BenchmarkReadFile/IOURING_64KiB-2 BenchmarkReadFile/IOURING_128KiB-2 BenchmarkReadFile/IOURING_512KiB-2 BenchmarkReadFile/IOURING_1MiB-2 BenchmarkReadFile/IOURING_10MiB-2 BenchmarkReadFile/IOURING_100MiB-2 BenchmarkReadFile/IOURING_1GiB-2 E5-2676 v3 @ 2.40GHz 106183 11050 ns/op 1560 B/op 6 allocs/op 76435 15818 ns/op 18840 B/op 6 allocs/op 54651 23067 ns/op 41368 B/op 6 allocs/op 37386 32897 ns/op 74136 B/op 6 allocs/op 20600 67160 ns/op 139672 B/op 6 allocs/op 5833 213830 ns/op 532888 B/op 6 allocs/op 3462 339948 ns/op 1057176 B/op 6 allocs/op 332 3739868 ns/op 10494360 B/op 6 allocs/op 24 44693581 ns/op 104866200 B/op 6 allocs/op 2 556463702 ns/op 1073750424 B/op 6 allocs/op 54560 21808 ns/op 1904 B/op 12 allocs/op 39284 30109 ns/op 17264 B/op 12 allocs/op 32830 34880 ns/op 33648 B/op 12 allocs/op 24984 45831 ns/op 66416 B/op 12 allocs/op 15892 70437 ns/op 131952 B/op 12 allocs/op 5031 278818 ns/op 525169 B/op 12 allocs/op 1681 725118 ns/op 1049459 B/op 12 allocs/op 313 3600860 ns/op 10486647 B/op 12 allocs/op 25 43821402 ns/op 104858486 B/op 12 allocs/op 2 629058212 ns/op 1073742704 B/op 12 allocs/op

There is some lower byte allocations, but nothing to call home about, once you get over 1KiB.

I noticed that another open issue states that the examples directory doesn't have any examples that run faster than the system tools. Since the system tools are C related and have been optimized for years, io_uring may not be the contributing factor.

But in this case it is Go (epoll) and Go (io_uring).

If you see something here that is an obvious mistake on my part or a knowledge gap on my part, that would be great. And I completely understand if you are either no longer interested in this or don't have time to work on it. If so, I'll be glad to go away :)

Cheers and thanks for making your implementation available and open source!

eatonphil commented 1 year ago

I am seeing similar slower writes using this library compared to Go's builtin WriteAt/pwrite support.

I've got a benchmark that just writes 100_000 4KB byte chunks to disk. io_uring can get close but not as good as just having 10 worker goroutines write out 4KB chunks at a time.

Source: https://github.com/eatonphil/go-io-playground

┌───────────────────────────────────────────┬───────────────────────┬────────────┐
│                  method                   │       avg_time        │ throughput │
│                  varchar                  │        varchar        │  varchar   │
├───────────────────────────────────────────┼───────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.1633701s            │ 2.5GB/s    │
│ 10_goroutines_io_uring_pwrite_100_entries │ 0.0394018s            │ 10.4GB/s   │
│ buf                                       │ 0.257824s             │ 1.5GB/s    │
│ 10_goroutines_pwrite                      │ 0.025590199999999997s │ 16.0GB/s   │
│ nobuf                                     │ 0.2549947s            │ 1.6GB/s    │
└───────────────────────────────────────────┴───────────────────────┴────────────┘

eatonphil commented 1 year ago

I'm no longer able to reproduce this. I'm currently getting:

┌───────────────────────────────────────────┬──────────────────────┬────────────┐
│                  method                   │       avg_time       │ throughput │
│                  varchar                  │       varchar        │  varchar   │
├───────────────────────────────────────────┼──────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.5154280000000001s  │ 794.7MB/s  │
│ blocking                                  │ 0.2821061s           │ 1.4GB/s    │
│ 10_goroutines_pwrite                      │ 0.32600510000000005s │ 1.2GB/s    │
│ 10_goroutines_io_uring_pwrite_512_entries │ 0.35256730000000003s │ 1.1GB/s    │
└───────────────────────────────────────────┴──────────────────────┴────────────┘

io_uring is still lower than even blocking io. I'll keep looking into it.

I was able to find some more bugs in my code while porting this to Zig.

eatonphil commented 1 year ago

I ported my Go benchmark from Go to Zig and I get basically the same results from Zig and from Go.

So I don't think this is the fault of this library. Not that OP was saying that either necessarily.

Iceber / iouring-go

Slower reads using iouring than os for ReadFile type operations at all file sizes #27