Closed marcus-pousette closed 8 months ago
Some obvious things I see right away
1).
This should instead be something like
Turns out that below is not faster
// In the top of the file
const allocUnsafeFn = (): (len: number) => Uint8Array => {
if (globalThis.Buffer) {
return globalThis.Buffer.allocUnsafe
}
return (len) => new Uint8Array(len);
}
const allocUnsafe = allocUnsafeFn();
// Later
const frame = allocUnsafe(HEADER_LENGTH)
2).
In the decoder, are we using the data length available in the header to allocate the right size for the Uint8ArrayList? It seems to currently only be used as a break condition.
With PR #43 one of the most obvious things was addressed.
Still left
stringifyHeader
in https://github.com/ChainSafe/js-libp2p-yamux/blob/4007dc4454a5d05df04254ca81e270376ea06f18/src/muxer.ts#L379@marcus-pousette thanks for creating this issue, afaik yamux should have no performance penalties.
@marcopolo can you confirm? additionally, once libp2p/test-plans has benchmarking setup for js-libp2p, we should be able to observe this discrepancy between mplex and yamux right?
I see, then it is in the implementation details I guess.
There is a benchmark setup in this lib, and the output also shows that there is a discrepancy
I am running the benchmark from master and get this
✔ yamux send and receive 1 0.0625KB chunks 15954.82 ops/s 62.67700 us/op x0.943 11068 runs 1.37 s
✔ yamux send and receive 1 1KB chunks 15674.96 ops/s 63.79600 us/op x0.991 6594 runs 0.820 s
✔ yamux send and receive 1 64KB chunks 12712.78 ops/s 78.66100 us/op x0.987 8240 runs 1.14 s
✔ yamux send and receive 1 1024KB chunks 2628.176 ops/s 380.4920 us/op x0.971 1667 runs 1.14 s
✔ yamux send and receive 1000 0.0625KB chunks 126.6361 ops/s 7.896644 ms/op x0.978 40 runs 0.831 s
✔ yamux send and receive 1000 1KB chunks 119.9003 ops/s 8.340262 ms/op x0.959 51 runs 0.940 s
✔ yamux send and receive 1000 64KB chunks 37.21731 ops/s 26.86922 ms/op x0.994 21 runs 1.08 s
✔ yamux send and receive 1000 1024KB chunks 3.227550 ops/s 309.8325 ms/op x1.002 6 runs 2.46 s
✔ mplex send and receive 1 0.0625KB chunks 13847.92 ops/s 72.21300 us/op x1.046 8967 runs 1.19 s
✔ mplex send and receive 1 1KB chunks 14239.74 ops/s 70.22600 us/op x1.045 8129 runs 1.03 s
✔ mplex send and receive 1 64KB chunks 12067.97 ops/s 82.86400 us/op x0.987 7406 runs 1.04 s
✔ mplex send and receive 1 1024KB chunks 3554.330 ops/s 281.3470 us/op x1.014 2787 runs 1.24 s
✔ mplex send and receive 1000 0.0625KB chunks 312.3817 ops/s 3.201212 ms/op x1.004 255 runs 1.34 s
✔ mplex send and receive 1000 1KB chunks 270.3823 ops/s 3.698467 ms/op x0.954 274 runs 1.54 s
✔ mplex send and receive 1000 64KB chunks 48.38998 ops/s 20.66543 ms/op x0.856 42 runs 1.39 s
✔ mplex send and receive 1000 1024KB chunks 3.216166 ops/s 310.9292 ms/op x0.851 4 runs 1.82 s
There is almost a factor 2x difference between yamux and mplex for some of the runs here
I tried to run the VSCode profiler for the benchmark tasks but the results given are very opaque, sadly (fail to get any real insights)
Yamux has historically been slower than mplex but it’s not had the same amount of profiling applied so there is almost certainly some low hanging fruit to be had.
@marcus-pousette thank you for looking into this
Yeap! It should be a low hanging fruit for this one.
I have not found anything critical yet, but...
It "feels" like readHeader and readBytes could be improved if consume
method would return the sliced header instead. Now both this.buffer.slice(...)
and this.buffer.consume(...)
will do an equivalent iteration on the underlying list, to perform its purposes. But something like this.buffer.splice
would be interesting to see.
For the Uint8ArrayList related things, we are doing a lot of append
, sublist
. and consume
where the Uint8ArrayList (this.buffer
) usually only contains one Uint8Array element. If you do a special case implementation for Uint8arrayList to perform better when N = 1, you could get roughly 20% performance gain on "append". I have not checked the other operations yet, but I assume the consume
will be faster also, since we do shift()
https://github.com/achingbrain/uint8arraylist/blob/0adda9ad78c3db75bee73a63acce5aee8c5f0f76/src/index.ts#L167
The Uint8ArrayList could perform better (perhaps) on consume
if the underlying list is a linked list, so that we don't need to do shift()
at all
In Yamux we perhaps don't need construct and pass arguments as objects for private function that are invoked often.
Logging does not seem to have too much effect on performance, overall.
Posting benchmark from running it today
codec benchmark ✔ frame header - encodeFrameHeader 8130081 ops/s 123.0000 ns/op - 1131753 runs 0.404 s ✔ frame header - encodeFrameHeaderNaive 2118644 ops/s 472.0000 ns/op - 2241054 runs 1.62 s ✔ frame header decodeHeader 9345794 ops/s 107.0000 ns/op - 1306895 runs 0.505 s ✔ frame header decodeHeaderNaive 2325581 ops/s 430.0000 ns/op - 2834015 runs 2.02 s
comparison benchmark ✔ yamux send and receive 1 0.0625KB chunks 13620.27 ops/s 73.42000 us/op - 9929 runs 1.41 s ✔ yamux send and receive 1 1KB chunks 14097.21 ops/s 70.93600 us/op - 6689 runs 0.924 s ✔ yamux send and receive 1 64KB chunks 11133.63 ops/s 89.81800 us/op - 5943 runs 0.950 s ✔ yamux send and receive 1 1024KB chunks 2604.350 ops/s 383.9730 us/op - 1658 runs 1.14 s ✔ yamux send and receive 1000 0.0625KB chunks 123.3899 ops/s 8.104389 ms/op - 40 runs 0.835 s ✔ yamux send and receive 1000 1KB chunks 112.0145 ops/s 8.927416 ms/op - 25 runs 0.735 s ✔ yamux send and receive 1000 64KB chunks 34.73366 ops/s 28.79052 ms/op - 53 runs 2.04 s ✔ yamux send and receive 1000 1024KB chunks 2.772760 ops/s 360.6515 ms/op - 10 runs 4.24 s ✔ mplex send and receive 1 0.0625KB chunks 14712.59 ops/s 67.96900 us/op - 9807 runs 1.22 s ✔ mplex send and receive 1 1KB chunks 14861.71 ops/s 67.28700 us/op - 12154 runs 1.32 s ✔ mplex send and receive 1 64KB chunks 13063.70 ops/s 76.54800 us/op - 6321 runs 0.820 s ✔ mplex send and receive 1 1024KB chunks 3554.229 ops/s 281.3550 us/op - 4373 runs 1.74 s ✔ mplex send and receive 1000 0.0625KB chunks 230.5917 ops/s 4.336670 ms/op - 189 runs 1.34 s ✔ mplex send and receive 1000 1KB chunks 206.2928 ops/s 4.847479 ms/op - 338 runs 2.16 s ✔ mplex send and receive 1000 64KB chunks 47.07421 ops/s 21.24305 ms/op - 21 runs 0.969 s ✔ mplex send and receive 1000 1024KB chunks 3.041689 ops/s 328.7647 ms/op - 10 runs 3.95 s
Still see a big difference in performance for many of the cases. Perhaps especially for "send and receive 1000 0.0625KB"
The mplex implementation using a "buffer pool" for the encoder, while this implementation is not. I wonder whether that could play a part here: That there is a lot of allocation/deallocation of memory that only lives for a short amount of time
We've done some real-world benchmarking which has led to some performance improvements and the results are that js-libp2p has the fastest streaming performance of the libp2p implementations tested - https://observablehq.com/@libp2p-workspace/performance-dashboard
I'm going to close this as yamux has closed the performance gap and has back pressure so is vastly preferable to mplex.
Please re-open if you're still seeing a serious degradation.
I am considering replacing all my dependencies on
@libp2p/mplex
with this yamux implementation in my project.However, I am experiencing significant performance loss with this change. In my benchmark, where I do replication work between multiple nodes I am able to do around 1200 tps with @libp2p/mplex but only around 700 with yamux.
I have not deep dived into the details of mplex and yamux, but is it technically possible to reach same performance in a yamux implementation as with mplex? Or am I seeing loss of performance as a cost of the features yamux brings to the table?