alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2k stars 267 forks source link

FEATURE: Split xref into separate sub-tasks #3814

Open tillprochaska opened 1 month ago

tillprochaska commented 1 month ago

Is your feature request related to a problem? Please describe. In Aleph, a task is a single unit of background work. Aleph tracks progress of background jobs at the task-level (i.e. it stores how many tasks have been processed so far and how many are still pending). If a task fails, the entire task is retried.

Cross-referencing an entire collection is implemented as a single task. This causes multiple problems:

Describe the solution you'd like Computing cross-referencings for a collection should be split into two types of background tasks:

For example, if a collection contains 1000 entities and given a batch size of 500:

This has several advantages:

The disadvantage is that it adds a lot of tasks to the queue with possibly large payloads (the task payload would need to include the IDs for one batch of entities).

Describe alternatives you've considered Increasing scroll timeouts: We’ve done this before, but it is only a short-term solution, as it cannot be repeated indefinitely. However, we’re currently migrating to from Redis to RabbitMQ as the primary data store for queued tasks. In contrast to Redis, RabbitMQ stores tasks on disk, so this should be less of a problem.

Additional context

TheApeMachine commented 1 month ago

Hi,

I had a meeting today where this issue, as one of two, was proposed as something I could work on? If so, I'd be happy to. I went through the contributing guidelines, code of conduct, etc. and read that I should assign this issue to myself, however, unless I am missing it, I don't think I can.

In any case, I forked the develop branch, and have a local setup going, so I'll just take a look :) Let me know if there is anything I should be doing to follow the process correctly!

stchris commented 1 month ago

Hi @TheApeMachine , we are thankful for you working on this! As a hint: we are working on the 4.0.0 release of aleph which changes the task queuing system from a Redis-based one to one based on RabbitMQ. So in order to future-proof any changes you make you might want to consider targeting the release/4.0.0 branch of aleph.

TheApeMachine commented 1 month ago

@stchris Thanks for that, I would have branched off develop otherwise :)