Normalizer pipeline inefficient

aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

BSD 3-Clause "New" or "Revised" License

115 stars 29 forks source link

Hi @OlivierHassanaly !

The doc.text always contain the original text of the document, to use the results of the eds.normalizer pipeline, you should use edsnlp.utils.doc_to_text.get_text as shown here http://aphp.github.io/edsnlp/latest/pipes/core/normalizer/#usage

import edsnlp
from edsnlp.utils.doc_to_text import get_text

config = dict(
    lowercase=True,
    accents=True,
    quotes=False,
    spaces=False,
    pollution=True,
)

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.normalizer", config=config)

text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"

doc = nlp(text)

print(get_text(doc, attr='TEXT', ignore_excluded=True))
# Out: Pneumopathie à `coronavirus'

aphp / edsnlp

Normalizer pipeline inefficient #330

How to reproduce the bug