chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 250 forks source link

Compute character index mapping for before `preprocess.normalize_whitespace` #121

Open betatim opened 7 years ago

betatim commented 7 years ago

Currently I have the following process:

  1. user provides some text possibly containing double white space, newlines, etc
  2. apply preprocess.normalize_whitespace
  3. use the NER from spacy
  4. highlight found entities in the unnormalized text

However doing (4) is kind of hard as the character coordinates (doc = nlp(text); doc.ents[0].start) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")

bdewilde commented 7 years ago

Hi @betatim , I understand the problem, although I don't know of a "good" way to solve it. The preprocessing functions are destructive and one-way, so not a lot of thought has been given to recovering the changes. Basic question: Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

The only solution that comes to mind is iterating over the resulting entities and re-locating them in the original text, a process which can be made more efficient than the simplest implementation but not, like, great.

This reminds me of annotating, say, keyterms visually in a PDF document while using the extracted/processed text in the analysis. It's definitely a thing I've seen done. (Unfortunately, my google-fu failed me — I couldn't find a concrete example.) Might be worth trying to track down...

betatim commented 7 years ago

Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.

It seem to help with things like "07\n Feb 2017" being found as a date and not as a CARDINAL and a DATE.

Was hoping you had found a nice way to do the transporting things back. Will think if we can solve it by tweaking the UI a bit.

Will see if I can find something on the PDFs