Open tillprochaska opened 1 month ago
Hi,
I had a meeting today where this issue, as one of two, was proposed as something I could work on? If so, I'd be happy to. I went through the contributing guidelines, code of conduct, etc. and read that I should assign this issue to myself, however, unless I am missing it, I don't think I can.
In any case, I forked the develop branch, and have a local setup going, so I'll just take a look :) Let me know if there is anything I should be doing to follow the process correctly!
Hi @TheApeMachine , we are thankful for you working on this! As a hint: we are working on the 4.0.0
release of aleph
which changes the task queuing system from a Redis-based one to one based on RabbitMQ. So in order to future-proof any changes you make you might want to consider targeting the release/4.0.0
branch of aleph
.
@stchris Thanks for that, I would have branched off develop otherwise :)
Is your feature request related to a problem? Please describe. In Aleph, a task is a single unit of background work. Aleph tracks progress of background jobs at the task-level (i.e. it stores how many tasks have been processed so far and how many are still pending). If a task fails, the entire task is retried.
Cross-referencing an entire collection is implemented as a single task. This causes multiple problems:
It means that Aleph currently isn’t able to display information about the progress of the cross-referencing process (besides the fact that it is still running).
If the task fails, it is retried from the beginning, re-computing cross-referencings for entities that have already been processed. Depending on the size of the Aleph instance and the size of the collection, computing cross-referencings can take hours, sometimes even days.
Aleph uses the Elasticsearch scroll API to iterate over all entities in the collection in batches. The scroll API has a timeout for the maximum time between requesting two batches of entities. If fetching candidates and computing the similarity score for a batch of entities takes longer than the timeout, Elasticsearch will raise an error when Aleph tries to request the next batch of entities.
Describe the solution you'd like Computing cross-referencings for a collection should be split into two types of background tasks:
An initial task should use the ES scroll API to iterate over all entities in the collection in batches, similar to what happens right now. However, it shouldn’t actually compute the xref matches for the entities as part of that same task. Instead, this task should merely enqueue a separate task for every batch.
These tasks then compute the actual xref matches.
For example, if a collection contains 1000 entities and given a batch size of 500:
This has several advantages:
The disadvantage is that it adds a lot of tasks to the queue with possibly large payloads (the task payload would need to include the IDs for one batch of entities).
Describe alternatives you've considered Increasing scroll timeouts: We’ve done this before, but it is only a short-term solution, as it cannot be repeated indefinitely. However, we’re currently migrating to from Redis to RabbitMQ as the primary data store for queued tasks. In contrast to Redis, RabbitMQ stores tasks on disk, so this should be less of a problem.
Additional context