aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
115 stars 29 forks source link

Normalizer pipeline inefficient #330

Closed OlivierHassanaly closed 1 week ago

OlivierHassanaly commented 2 weeks ago

It seems that the eds.normalizer pipe does not act

i am using edsnlp version 0.13.1

How to reproduce the bug

config = dict( lowercase=True, accents=True, quotes=False, spaces=False, pollution=True, )

nlp = edsnlp.blank("eds") nlp.add_pipe("eds.normalizer", config=config)

text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"

doc = nlp(text)

print(doc.text)

I get unchanged text as a result

percevalw commented 2 weeks ago

Hi @OlivierHassanaly !

The doc.text always contain the original text of the document, to use the results of the eds.normalizer pipeline, you should use edsnlp.utils.doc_to_text.get_text as shown here http://aphp.github.io/edsnlp/latest/pipes/core/normalizer/#usage

import edsnlp
from edsnlp.utils.doc_to_text import get_text

config = dict(
    lowercase=True,
    accents=True,
    quotes=False,
    spaces=False,
    pollution=True,
)

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.normalizer", config=config)

text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"

doc = nlp(text)

print(get_text(doc, attr='TEXT', ignore_excluded=True))
# Out: Pneumopathie à `coronavirus'