ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
588 stars 52 forks source link

feat: distribute task related messages to specific worker pools using namespacing #1385

Closed ClemDoum closed 3 weeks ago

ClemDoum commented 5 months ago

Feature description

Current behavior

Currently task related messages (task creation, errors, result, events) are not name-spaced. This results in all agents (workers) receiving all messages. Messages task can't be routed to specific workers, this is problematic as all worker can't execute any task.

Expected behavior

Tasks should support name-spacing so that specific messages (task creation, errors, result, events) can be routed to specific agents (workers, event logger, task manager).

Practical use cases

The name-spacing used here is very hypothetical and only chosen to illustrate examples.

Implement functional and resource isolation

Implement the same functionality in several languages: NLP workers

Name-spacing will also enable to have the same functional task executed by workers implemented in different languages. For instance to perform NLP task we could route some task to Java worker for CoreNLP processing, while routing some task to Python workers for spacy processing:

Prioritizing tasks

This is only a nice to have, but name-spacing could help to distribute the same task with a priority: datashare.some.task.high, datashare.some.task.low and then using queue priority (for AMQP) or assign a different number of worker for each namespace part.

Potential implementations

AMQP

Topic exchanges + routing keys could be used to implement name-spacing.

Redis

Queue names could be leveraged to implement name-spacing

Memory (TBD)

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 40 days with no activity.

ClemDoum commented 1 month ago

If first attempt/POC of namespacing has been implemented in Python, to be able to split communication between the TM and different workers coming from same app.

The finale/exact namespacing strategy is still TBD, but it probably quite quick to implement on the Python side (probably just changing a naming/routing strategy).

bamthomas commented 3 weeks ago

See how it will perform with actual filtering on the consumers side? To be reopen for optimisation (if tasks are ending in deadletter queues).