Closed percevalw closed 11 months ago
Change of mind !
The above proposition did not take into account:
The new solution (shipped with v0.10.0 #202) fixes this by introducing the LazyDocsCollection
, which records ops lazily and handles document conversion (i.e., tokenization and more) at the same level as pipeline components.
Feature type
Following a brainstorming with @Thomzoy, we'd like to refactor the parallelization utilities to decouple the type of collection (iterators, lists, dataframe pandas, dataframe spark, hive table, etc.) from the type of parallelization (no parallelization, multi cpu, gpu, distributed computing for spark).
Description
Collection types
Most of the processing with edsnlp is done on pandas and spark lists and dataframes (to the best of our knowledge), so we feel it's necessary to handle these cases natively.
The following changes will be made:
spacy.tokens.Doc
objects, we add two methods tonlp.__call__
andnlp.pipe
to replace (eventually) the parameters (additional_spans, extensions, context, results_extractor):to_doc
: (Any -> spacy.tokens.Doc)from_doc
: (spacy.tokens.Doc -> Any)It's up to the user to convert the entry into an accepted format. For example, polars to pandas, or polars to dictionary iterator.
Acceleration / parallelization mode
We plan to manage acceleration on several processes, one or more gpus, or in a distributed way via spark
a new "method" parameter will receive an acceleration object (dict / custom object?) containing the acceleration type:
This will in turn call a specific function depending on the method We can probably infer the parallelization method automatically, depending on the type of input collection and the computational resources available.
Pseudo implementation
This is open to discussion: