Consider using database or elasticsearch indices as prompt for scanning (rather than queues) - Githubissues

cedadev / search-futures

Future Search Architecture

BSD 2-Clause "Simplified" License

0 stars 0 forks source link

Consider using database or elasticsearch indices as prompt for scanning (rather than queues) #166

Open agstephens opened 2 years ago

agstephens commented 2 years ago

Thoughts about how to manage multi-level scanning...

Do we need a database?

Reasons to have a database:

jobs and statuses and claims - can all be managed via state held in the db.
would avoid duplication, and race conditions

Could ES be the database?

SP suggests that we could use ES queries to tell the item-generator and collection-generator what to scan next.

E.g. get me the latest 1,000 assets that need an item, then work through generating those items.

How to do claims?

AS thought that we might update the claim on a record in ES (to avoid another processing claiming it).

But do we need to do claims?

Maybe not. An alternative would be:

Have 1 controller at each level: asset, item, collection:
- gets batches of work to be done (based on queries)
- sends jobs to a queue
Have multiple workers at each level:
- gets next job from queue, does it

agstephens commented 2 years ago

Previous concerns/questions about pods duplicating work were:

How many workers (pods) are running - and how is that configured?
How many processes are running in each worker?
How many rabbit messages are read/consumed at a time?
Why are rabbit messages being read by different workers?