NCAS-CMS / PyActiveStorage

Python implementation of Active Storage
2 stars 2 forks source link

Parallelise chunk processing #141

Closed markgoddard closed 1 year ago

markgoddard commented 1 year ago

Previously, each storage chunk was processed sequentially. This is functional, but not performant, particularly when using Reductionist (S3), due to the sequential round trip latencies of requests. This means we are not taking advantage of the concurrency provided by Reductionist.

This change switches to use a threadpool executor to process chunks. This is unlikely to be as performant as asyncio, but still provides a significant improvement without requiring an async runtime.

Some basic throttling of requests is provided by using a threadpool of a configurable size. This works well when using a single Active object, but may need refinement if multiple Active objects are in use at the same time to avoid overloading the Reductionist server.

In order to benefit from connection reuse, we switch to using a requests Session object that is shared by all threads in the pool.

This change also adds support for specifying a CA certificate to use when making requests to the Reductionist API.

Closes #121

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage is 98.03% of modified lines.

Files Changed Coverage
activestorage/reductionist.py 90.00%
activestorage/active.py 100.00%
activestorage/config.py 100.00%

:loudspeaker: Thoughts on this report? Let us know!.

markgoddard commented 1 year ago

@markgoddard this looks great, very many thanks! 🍺 A couple points:

  • I am approving but can we delay the merge by just a few days, just had a call with @kchasapis and the guys at IME/RED are starting to understand the mechanics and use of Active - Konstantinos is now forking the repo and would probably appreciate it if there were no new arguments and functionality of Active that he'd have to make use of on RED; so I'll merge next week, should be good by then

No problem for me to wait here. I can always use the branch if necessary.

  • would it make sense to start thinking of a configuration that contains, amongst others, max_threads?

The configuration of PyActiveStorage does need improvement. The current config.py method isn't really a good long-term solution since it's part of the Python module. This probably isn't the right place to discuss it though.

  • I've used the minio_start call we have in main rather than yours in here (the one in main is still yours haha) - lemme know if the one here was better, so I can revert to it 👍

No problem - whatever works.

valeriupredoi commented 1 year ago

awesome, and yes, configuration needs discussion (even more so after RED integration), not here. Not a worry, won't dillydally anymore, will merge on Monday :+1:

valeriupredoi commented 1 year ago

cheers @markgoddard :beer: