Implemenent batch processing for NER, this change is made in the context of #1452, as batch processing is necessary for Spacy.
Notes
In this PR we made the choice not to implement PipelineTask but in contrast fully rely on the task bus to distribute the batches across workers
Changes
datashare-api
Added
added the Searcher sort(String field, SortOrder order) method to Indexer.Searcher to sort search results and be able to return documents grouped by language (to avoid model reload)
datashare-app
Added
added the CreateNlpBatchesFromIndexTask task which scan the index for document sorted by language. Documents are then added by batch to the BatchNlpTask queue, where workers will process document to perform NER by batch
added the BatchNlpTask which consumes document by batches, fetches them from the index and performs the NLP task (NER only)
TODO
PR description
Implemenent batch processing for NER, this change is made in the context of #1452, as batch processing is necessary for Spacy.
Notes
In this PR we made the choice not to implement
PipelineTask
but in contrast fully rely on the task bus to distribute the batches across workersChanges
datashare-api
Added
Searcher sort(String field, SortOrder order)
method toIndexer.Searcher
to sort search results and be able to return documents grouped by language (to avoid model reload)datashare-app
Added
CreateNlpBatchesFromIndexTask
task which scan the index for document sorted by language. Documents are then added by batch to theBatchNlpTask
queue, where workers will process document to perform NER by batchBatchNlpTask
which consumes document by batches, fetches them from the index and performs the NLP task (NER only)