OxygenFramework / Oxygen.jl

💨 A breath of fresh air for programming web apps in Julia
https://oxygenframework.github.io/Oxygen.jl/
MIT License
383 stars 25 forks source link

feature/mulithreading-changes #168

Closed ndortega closed 4 months ago

ndortega commented 4 months ago

found a 10 - 15% performance gain by removing the current queue + multithreading approach and leaning harder on the scheduler built into Julia using a cooperative multitasking strategy.

Additionally, the amount of code to get this working allowed me to completely remove the StreamUtil module, some state variables, and type definitions in favor of a single function - which makes it much easier to maintain.

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 98.11321% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.20%. Comparing base (73a8f0a) to head (40700e8). Report is 1 commits behind head on master.

Files Patch % Lines
src/core.jl 97.43% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #168 +/- ## ========================================== + Coverage 97.49% 99.20% +1.71% ========================================== Files 20 19 -1 Lines 1157 1138 -19 ========================================== + Hits 1128 1129 +1 + Misses 29 9 -20 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ndortega commented 4 months ago

Hi @JanisErdmanis,

I have tested both approaches locally on my laptop with Julia 1.6 and 1.10 and found either the same or better performance with these changes. The results of this test almost feel too good to be true which makes me a bit hesitant.

What do you think?

JanisErdmanis commented 4 months ago

Hi @ndortega

Sorry, I did not see the message. I am unfamiliar with the parallelism internals and don't yet use them; thus, I can't comment on the performance. But I am happy that there is such a feature and may find it helpful in the future. For now, I can only offer some ideas to consider:

I am rather busy and can't help with coding.

EDIT: I think I am able to grasp how parallelism is implemented now. I find it concerning that there is a shared mutable state between threads, which is introduced by the metrics history object. A simple fix would be to create a history object for each thread and then aggregate that in a single history object, which can be shown on the metrics page. But perhaps that is unnecessary for CircularDeque.

ndortega commented 4 months ago

Hi @JanisErdmanis,

I'm glad you brought up that point about the shared History object. I've used a ReentrantLock to make writing and reading from the history object completely threadsafe. That being said, I'll need to look into memorizing the metrics outputs since these calculations can take a while when there's hundreds of thousands of requests and we call the metrics endpoint on an interval

And I agree with your second point, with these latest changes the functions are basically identical to one another and could be merged into a single one

JanisErdmanis commented 4 months ago

That being said, I'll need to look into memorizing the metrics outputs since these calculations can take a while when there's hundreds of thousands of requests and we call the metrics endpoint on an interval

You could look into implementing a function merge!(history, thread_1_history) which perhaps can be made fast. I don't think that there should be concern about making the metrics dashboard fast yet as this could complicate matters.

An alternative to using ReentrantLock which may imply switching between threads is to run the metrics on a single thread only and do the paralelization after that. Now your handler schematically is paralelizer(metrics(router)) but it could be beneficial to rewrite it as metrics(paralelizer(router)) as then you would no longer need to use a ReentrantLock which should be faster as it does not need synchroniation between tasks.

ndortega commented 3 months ago

@JanisErdmanis

I liked the idea of dropping the ReentrantLock and writing it like metrics(paralelizer(router)). I reafactored the code to not use any locks and only exclude docs and metrics middleware from getting wrapped with the parallel handler.

Now that the main thread was used for docs & metrics middleware exclusively - I had to make the metrics endpoint was asynchronous in order to prevent the app from halting when viewing the metrics under heavy load. Even after these changes I was surprised to see a 2x slowdown in RPS on my laptop between this approach and the ReentrantLock approach.

If you're curious and want to view this code let me know and I'll push that branch

JanisErdmanis commented 3 months ago

I was surprised to see a 2x slowdown

This is unexpected. I wonder if it could be due to tasks also being scheduled on the main thread which could slow it down. Push the code on the branch, I may have a look.

ndortega commented 3 months ago

Sure thing, here's the branch I was working out of: https://github.com/OxygenFramework/Oxygen.jl/tree/feature/multi-threading-experimental-changes

JanisErdmanis commented 3 months ago

Yeah. I suspect that the performance drop is due to tasks being spawned on the primary thread. I the additional computation with metrics may have affected some fairness policy in the scheduler so the main thread gets to host other threads tasks.

You may find interesting to explore ThreadPools.jl.

EDIT: The issue is that the metrics middleware waits for the response response = handler(req) and thus is blocked to accept new requests.

ndortega commented 3 months ago

Exactly, it won't be worth making this refactor until I speed up the metrics calculations. This shouldn't be too difficult since the helper utils are all very similar.

Out of the 7 metrics returned:

Possible Next Steps:

JanisErdmanis commented 3 months ago

Perhaps you can also consider of using a channel where requests are being but by the main thread and then processed by each individual thread returning HTTPResponse object in another channel which is the routed back to users. Though I don't know how to do the latter. The benefit of this approach is that it would delegate Julia to schedule tasks. But then again the throughput depends on a single history recording objects and acquiring a lock for the channel...

I think the design decission here needs to be supported by benchmarks. One of the ways is to check how many requests per second it takes to satturate HTTP without metrics and with metrics. If the saturation with metrics is sufficient then it can be shared by every thread and used together with a lock as it is now. Else that needs to be put within each system thread event loop to improve throughput.