Open akarnokd opened 4 years ago
I did a different implementation but the degradation isn't gone, just reduced:
With the new code organization, the performance is slightly worse at P=1 and P=6 and somewhat better at higher Ps. The others are likely within the noise limit.
I'm starting to think the underlying issue is that one thread simply can't drive that many rails that fast, thus the round-robin dispatching will result in a high volume of scheduling activity (also hinted by Java Flight Recorder).
If I implement batch-dispatching, the the scheduling overhead appears to be mostly eliminated:
you have consider lot of aspects while making parallel calls.
one request want to make 10 parallel calls means and your server supports only 12 threads, what about the second request, it will wait releasing of threads from first request.
you have check back all the 12 threads are allocated to your program.
etc...
For some reason, the parallel Scrabble benchmark performs poorly when the parallelism level is 10+, for example, on my i7 8700 CPU (6 cores/12 threads):
However, my older i7 4770K processor (4 cores/8 threads) shows no such performance degradation.
Neither does the reactive-streams-commons implementation (the parent of RxJava's parallel implementation) with parallelism=12.Correction: The Rsc benchmark was pinned to 8 threads and actually shows a similar inefficiency with 10+.