ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
589 stars 52 forks source link

refactor: remove data bus mecanism for NER extraction (AMQP) #1283

Closed bamthomas closed 8 months ago

bamthomas commented 9 months ago

Is your feature request related to a problem? Please describe.

When we do NER extraction, we use a databus in pub-sub mode. The motivation at the beginning of the project was to be able to run several pipelines at the same time. It has several drawbacks:

We never use this feature: in production we do SCAN, INDEX operation then later extracting NER entities on only one CORENLP pipeline. In local mode, we can't either make them run in parallel.

Describe the solution you'd like

For the sake of simplicity and least astonishment it may be more convenient to use a NER extraction queue (Memory/Redis/AMQP) that is used the same as the indexing queue.

The indexing task could dequeue paths from indexing queue and when indexing is done, enqueue a Document id in the NER extraction queue.

extractionPipeline drawio(4)

EDIT 2024-01-18

TODO:

mvanzalu commented 8 months ago

I think publisher has to be replaced by DocumentQueue instead of DocumentCollectionFactoryas the latter belongs to datashare-app