Closed AetherUnbound closed 2 years ago
I'm thinking about how to tackle this one today because it affects WordPress/openverse#1703. After taking a look at some of the code, I think we have 3 approaches:
init
functionWith this approach, we would still initialize the DelayedRequester
and <Media>Store
at the top of the file, but none of the actual initialization would occur at that point. We would then call a .init()
function on each within the main
function in the provider_api_scripts
module. That would allow us to continue to reference the requester & store as module-level variables within the API script while still deferring any heavy lifting until runtime. Tests would be affected by this, so we may need to add a fixture to the tests which automatically init
s the requester & store.
main
Here we would move the instantiation into the main
function and out of the module scope. This would require significant changes to the function signatures (as they now have to pass along structures which were previously accessible at the module level) and the tests along with them. This approach seems the most "conventional".
Almost every one of our API scripts initializes a DelayedRequester
and a <Media>Store
(if not several). There's also a fair amount of similar machinery among the API scripts that could be collapsed/refactored as part of this effort. All of the shared/common pieces would go into a ProviderAPIRetreiver
(or something) class definition that would be sub-classed and overridden as necessary per-provider. This would be the largest undertaking of the three. It also has the risk of obfuscating some of the code - while code reduction is generally considered good, accessibility to community contributions is still a high priority for this piece of the project. By moving some pieces into a common class definition, we might make it harder for contributors to understand what's going on and thus how to contribute a new provider.
The reason I bring this up is that we're likely going to be adding a new Requester
class as part of WordPress/openverse#1703 that has even more overhead on initialization, which we'd want to avoid doing at DAG parse time.
Of the three, which do you think is best? Have I missed any considerations as part of this?
CC @zackkrida @obulat @krysal
I think I'm inclined to actually go the object route - for something like WordPress/openverse#1697, it would significantly reduce the number of changes we needed to make if all providers had shared functionality inherited from a base class.
Sorry I meant to reply earlier, but lost this issue. We have discussed converting the provider API scripts into objects: https://github.com/WordPress/openverse-catalog/pull/163#discussion_r701687519 I think it would be a great solution for this problem, and, provided that we give clear instructions and a template, would not make it significantly harder for new contributors to understand.
Incidentally, a solution is also described here: https://make.wordpress.org/openverse/handbook/openverse-handbook/openverse-catalog/community-involvement/#technical-changes-making-contribution-easier
Current Situation
Related to WordPress/openverse-catalog#199 - we have a number of objects (
DelayedRequester
,MediaStore
, etc.) that are created at import time rather than runtime. This has a negative performance impact on DAG processing time. The DAG parsing does not need to have these object available, and initializing them every time the parsing is done can unnecessarily increase parsing time. This can be seen from the logging that's done during DAG parsing:This is happening at every DAG parse interval, which by default is 30 seconds.
Suggested Improvement
Initialize these objects in the
main
functions of each provider, and pass them into other functions as necessary.Benefit
Decreased DAG processing time, CPU & memory usage.
Additional context
Implementation