Is your feature request related to a problem? Please describe.
When we do NER extraction, we use a databus in pub-sub mode. The motivation at the beginning of the project was to be able to run several pipelines at the same time. It has several drawbacks:
code and feature complexity (it is not the same design that it is for indexing)
as Redis databus is not persistent, if there are no consumers when NER extracting events are sent, they have to be published later. That's why we must use the --resume option when extracting NER with CLI, that is reading index and sending events in the databus when consumer(s) are up.
We never use this feature: in production we do SCAN, INDEX operation then later extracting NER entities on only one CORENLP pipeline. In local mode, we can't either make them run in parallel.
Describe the solution you'd like
For the sake of simplicity and least astonishment it may be more convenient to use a NER extraction queue (Memory/Redis/AMQP) that is used the same as the indexing queue.
The indexing task could dequeue paths from indexing queue and when indexing is done, enqueue a Document id in the NER extraction queue.
EDIT 2024-01-18
TODO:
[x] remove publisher from IndexTask that was made to init stats
[x] replace publisher in ElasticsearchSpewer by DocumentCollectionFactory and enqueue into output queue
[x] use configure(Options) in IndexTask instead of withIndex(index) and remove withIndex method
[x] remove the if(resume(properties)) from CliApp and create another step RESUMEIDX and add a if(pipeline.has(DatashareCli.Stage.RESUMEIDX)) like the other steps
[x] remove queue statistic from StatusResource that is unused
[x] make manual tests with distinct steps (for example with --stages SCAN,INDEX,NLP )
Is your feature request related to a problem? Please describe.
When we do NER extraction, we use a databus in pub-sub mode. The motivation at the beginning of the project was to be able to run several pipelines at the same time. It has several drawbacks:
--resume
option when extracting NER with CLI, that is reading index and sending events in the databus when consumer(s) are up.We never use this feature: in production we do SCAN, INDEX operation then later extracting NER entities on only one CORENLP pipeline. In local mode, we can't either make them run in parallel.
Describe the solution you'd like
For the sake of simplicity and least astonishment it may be more convenient to use a NER extraction queue (Memory/Redis/AMQP) that is used the same as the indexing queue.
The indexing task could dequeue paths from indexing queue and when indexing is done, enqueue a Document id in the NER extraction queue.
EDIT 2024-01-18
TODO:
IndexTask
that was made to init statsElasticsearchSpewer
byDocumentCollectionFactory
and enqueue into output queueconfigure(Options)
inIndexTask
instead ofwithIndex(index)
and remove withIndex methodif(resume(properties))
fromCliApp
and create another stepRESUMEIDX
and add aif(pipeline.has(DatashareCli.Stage.RESUMEIDX))
like the other stepsStatusResource
that is unused--stages SCAN,INDEX,NLP
)