What's the problem with performance, implementation or usage?

DataDog / glommio

Glommio is a thread-per-core crate that makes writing highly parallel asynchronous applications in a thread-per-core architecture easier for rustaceans.

Other

2.93k stars 161 forks source link

What's the problem with performance, implementation or usage? #641

Open TheLudlows opened 4 months ago

TheLudlows commented 4 months ago

https://github.com/bytedance/monoio/blob/master/docs/en/benchmark.md

As described in the test.

bench

glommer commented 4 months ago

Likely implementation. For example, I have made the explicit decision of keeping a hash of the completion entries instead of just casting their address as unsafe.

That's a decision I don't regret, since early io_uring code was full of issues that essentially led to wrong addresses being added there, completions disappearing, etc. I am sure it's less of an issue now, but I have never done the work to methodically go chase performance issues.

TheLudlows commented 4 months ago

It looks like it's almost 30% worse，is any way to improve it?

vlovich commented 3 months ago

Discussed previously in #554. Probably just needs someone to run things under a profiler to figure it out. I know that monoio relies on nightly features (e.g. fast thread local) and that could also be contributing to the performance difference (it doesn't require it anymore but the benchmarks are run against nightly with that feature on).

Would be interesting to hear from @ihciah if he can give a high level guess if he intentionally did something differently with monoio to get higher perf.

bryandmc commented 3 months ago

Their implementation also uses fastpoll which is much better than what we do which is poll+read if I am remembering all this correctly. Honestly the way to resolve this now is probably to re-do sockets with all the new io_uring features that have become available.. Of which I would include buffer select, buffer rings, fastpoll, and zc where possible. Also worth checking the Semaphore implementation as they mention in #554.

vlovich commented 3 months ago

@bryandmc do you know if this impacts disk I/O performance at all or if this is just issues in the net stack?

bryandmc commented 3 months ago

@vlovich from an io_uring perspective, we already do most (if not all?) of the things that ensure fast disk reads. The additional performance could be obtained through profiling, etc, but unlike the net stack I don't think there are features we have "left on the table" that we aren't currently using.. Because of that, I would defer to @glommer explanation, which is just that it hasn't been optimized at all. Probably some easy performance wins for anyone with a little time and a profiler..