gandersen101 / spaczz

Fuzzy matching and more functionality for spaCy.
MIT License
249 stars 27 forks source link

Add pattern after adding to spacy pipeline taking long time and memory #74

Closed Rahul-Chittora closed 1 year ago

Rahul-Chittora commented 1 year ago

There are 1 Million patterns I am trying to add. On adding to blank spacy model: import spacy from spaczz.pipeline import SpaczzRuler nlp=spacy.blank('en') spaczz_ruler = SpaczzRuler(nlp) spaczz_ruler = nlp.add_pipe("spaczz_ruler") #spaCy v3 syntax spaczz_ruler.add_patterns(patterns) It takes 8 GB of RAM and inference time is around 28 seconds.

If I try to add SpaczzRuler to current ner pipeline using spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax It is taking high RAM and time. On 32 GB RAM also it is failing patterns = [ { "label": "NAME", "pattern": "Grant Andersen", "type": "fuzzy", "kwargs": {"min_r2": 90} }]

gandersen101 commented 1 year ago

Yes, I would imagine trying to match over 1 million patterns would take a lot of compute resources and time. spaczz is nowhere near as efficient as spacy and I do not currently have the time or resources to significantly improve spaczz's performance.