Performance significantly worse than libp2p mplex

marcus-pousette commented 1 year ago

I am considering replacing all my dependencies on @libp2p/mplex with this yamux implementation in my project.

However, I am experiencing significant performance loss with this change. In my benchmark, where I do replication work between multiple nodes I am able to do around 1200 tps with @libp2p/mplex but only around 700 with yamux.

I have not deep dived into the details of mplex and yamux, but is it technically possible to reach same performance in a yamux implementation as with mplex? Or am I seeing loss of performance as a cost of the features yamux brings to the table?

marcus-pousette commented 1 year ago

Some obvious things I see right away

~~1).~~

https://github.com/ChainSafe/js-libp2p-yamux/blob/4007dc4454a5d05df04254ca81e270376ea06f18/src/encode.ts#L4

~~This should instead be something like~~

Turns out that below is not faster

// In the top of the file
const allocUnsafeFn = (): (len: number) => Uint8Array => {
  if (globalThis.Buffer) {
    return globalThis.Buffer.allocUnsafe
  }
  return (len) => new Uint8Array(len);
}
const allocUnsafe = allocUnsafeFn();

// Later
const frame = allocUnsafe(HEADER_LENGTH)

2).

In the decoder, are we using the data length available in the header to allocate the right size for the Uint8ArrayList? It seems to currently only be used as a break condition.

marcus-pousette commented 1 year ago

~~With PR #43 one of the most obvious things was addressed.~~

Still left

Decoder does not seems to use the data length to pre allocated (?). Would perhaps explain why large messages performs poorly
Log messages seem to cost a bit. Especially stringifyHeader in https://github.com/ChainSafe/js-libp2p-yamux/blob/4007dc4454a5d05df04254ca81e270376ea06f18/src/muxer.ts#L379

p-shahi commented 1 year ago

@marcus-pousette thanks for creating this issue, afaik yamux should have no performance penalties.

@marcopolo can you confirm? additionally, once libp2p/test-plans has benchmarking setup for js-libp2p, we should be able to observe this discrepancy between mplex and yamux right?

marcus-pousette commented 1 year ago

I see, then it is in the implementation details I guess.

There is a benchmark setup in this lib, and the output also shows that there is a discrepancy

I am running the benchmark from master and get this

✔ yamux send and receive 1 0.0625KB chunks                            15954.82 ops/s    62.67700 us/op   x0.943      11068 runs   1.37 s
✔ yamux send and receive 1 1KB chunks                                 15674.96 ops/s    63.79600 us/op   x0.991       6594 runs  0.820 s
✔ yamux send and receive 1 64KB chunks                                12712.78 ops/s    78.66100 us/op   x0.987       8240 runs   1.14 s
✔ yamux send and receive 1 1024KB chunks                              2628.176 ops/s    380.4920 us/op   x0.971       1667 runs   1.14 s
✔ yamux send and receive 1000 0.0625KB chunks                         126.6361 ops/s    7.896644 ms/op   x0.978         40 runs  0.831 s
✔ yamux send and receive 1000 1KB chunks                              119.9003 ops/s    8.340262 ms/op   x0.959         51 runs  0.940 s
✔ yamux send and receive 1000 64KB chunks                             37.21731 ops/s    26.86922 ms/op   x0.994         21 runs   1.08 s
✔ yamux send and receive 1000 1024KB chunks                           3.227550 ops/s    309.8325 ms/op   x1.002          6 runs   2.46 s
✔ mplex send and receive 1 0.0625KB chunks                            13847.92 ops/s    72.21300 us/op   x1.046       8967 runs   1.19 s
✔ mplex send and receive 1 1KB chunks                                 14239.74 ops/s    70.22600 us/op   x1.045       8129 runs   1.03 s
✔ mplex send and receive 1 64KB chunks                                12067.97 ops/s    82.86400 us/op   x0.987       7406 runs   1.04 s
✔ mplex send and receive 1 1024KB chunks                              3554.330 ops/s    281.3470 us/op   x1.014       2787 runs   1.24 s
✔ mplex send and receive 1000 0.0625KB chunks                         312.3817 ops/s    3.201212 ms/op   x1.004        255 runs   1.34 s
✔ mplex send and receive 1000 1KB chunks                              270.3823 ops/s    3.698467 ms/op   x0.954        274 runs   1.54 s
✔ mplex send and receive 1000 64KB chunks                             48.38998 ops/s    20.66543 ms/op   x0.856         42 runs   1.39 s
✔ mplex send and receive 1000 1024KB chunks                           3.216166 ops/s    310.9292 ms/op   x0.851          4 runs   1.82 s

There is almost a factor 2x difference between yamux and mplex for some of the runs here

marcus-pousette commented 1 year ago

I tried to run the VSCode profiler for the benchmark tasks but the results given are very opaque, sadly (fail to get any real insights)

achingbrain commented 1 year ago

Yamux has historically been slower than mplex but it’s not had the same amount of profiling applied so there is almost certainly some low hanging fruit to be had.

@marcus-pousette thank you for looking into this

marcus-pousette commented 1 year ago

Yeap! It should be a low hanging fruit for this one.

I have not found anything critical yet, but...

It "feels" like readHeader and readBytes could be improved if consume method would return the sliced header instead. Now both this.buffer.slice(...) and this.buffer.consume(...) will do an equivalent iteration on the underlying list, to perform its purposes. But something like this.buffer.splice would be interesting to see.
For the Uint8ArrayList related things, we are doing a lot of append, sublist. and consume where the Uint8ArrayList (this.buffer) usually only contains one Uint8Array element. If you do a special case implementation for Uint8arrayList to perform better when N = 1, you could get roughly 20% performance gain on "append". I have not checked the other operations yet, but I assume the consume will be faster also, since we do shift() https://github.com/achingbrain/uint8arraylist/blob/0adda9ad78c3db75bee73a63acce5aee8c5f0f76/src/index.ts#L167
The Uint8ArrayList could perform better (perhaps) on consume if the underlying list is a linked list, so that we don't need to do shift() at all
In Yamux we perhaps don't need construct and pass arguments as objects for private function that are invoked often.
Logging does not seem to have too much effect on performance, overall.

marcus-pousette commented 1 year ago

Posting benchmark from running it today

codec benchmark ✔ frame header - encodeFrameHeader 8130081 ops/s 123.0000 ns/op - 1131753 runs 0.404 s ✔ frame header - encodeFrameHeaderNaive 2118644 ops/s 472.0000 ns/op - 2241054 runs 1.62 s ✔ frame header decodeHeader 9345794 ops/s 107.0000 ns/op - 1306895 runs 0.505 s ✔ frame header decodeHeaderNaive 2325581 ops/s 430.0000 ns/op - 2834015 runs 2.02 s

comparison benchmark ✔ yamux send and receive 1 0.0625KB chunks 13620.27 ops/s 73.42000 us/op - 9929 runs 1.41 s ✔ yamux send and receive 1 1KB chunks 14097.21 ops/s 70.93600 us/op - 6689 runs 0.924 s ✔ yamux send and receive 1 64KB chunks 11133.63 ops/s 89.81800 us/op - 5943 runs 0.950 s ✔ yamux send and receive 1 1024KB chunks 2604.350 ops/s 383.9730 us/op - 1658 runs 1.14 s ✔ yamux send and receive 1000 0.0625KB chunks 123.3899 ops/s 8.104389 ms/op - 40 runs 0.835 s ✔ yamux send and receive 1000 1KB chunks 112.0145 ops/s 8.927416 ms/op - 25 runs 0.735 s ✔ yamux send and receive 1000 64KB chunks 34.73366 ops/s 28.79052 ms/op - 53 runs 2.04 s ✔ yamux send and receive 1000 1024KB chunks 2.772760 ops/s 360.6515 ms/op - 10 runs 4.24 s ✔ mplex send and receive 1 0.0625KB chunks 14712.59 ops/s 67.96900 us/op - 9807 runs 1.22 s ✔ mplex send and receive 1 1KB chunks 14861.71 ops/s 67.28700 us/op - 12154 runs 1.32 s ✔ mplex send and receive 1 64KB chunks 13063.70 ops/s 76.54800 us/op - 6321 runs 0.820 s ✔ mplex send and receive 1 1024KB chunks 3554.229 ops/s 281.3550 us/op - 4373 runs 1.74 s ✔ mplex send and receive 1000 0.0625KB chunks 230.5917 ops/s 4.336670 ms/op - 189 runs 1.34 s ✔ mplex send and receive 1000 1KB chunks 206.2928 ops/s 4.847479 ms/op - 338 runs 2.16 s ✔ mplex send and receive 1000 64KB chunks 47.07421 ops/s 21.24305 ms/op - 21 runs 0.969 s ✔ mplex send and receive 1000 1024KB chunks 3.041689 ops/s 328.7647 ms/op - 10 runs 3.95 s

Still see a big difference in performance for many of the cases. Perhaps especially for "send and receive 1000 0.0625KB"

marcus-pousette commented 1 year ago

The mplex implementation using a "buffer pool" for the encoder, while this implementation is not. I wonder whether that could play a part here: That there is a lot of allocation/deallocation of memory that only lives for a short amount of time

achingbrain commented 8 months ago

We've done some real-world benchmarking which has led to some performance improvements and the results are that js-libp2p has the fastest streaming performance of the libp2p implementations tested - https://observablehq.com/@libp2p-workspace/performance-dashboard

I'm going to close this as yamux has closed the performance gap and has back pressure so is vastly preferable to mplex.

Please re-open if you're still seeing a serious degradation.

ChainSafe / js-libp2p-yamux

Performance significantly worse than libp2p mplex #42