explosion / spaCy

đŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Spaces impacting tag/pos #13680

Open lsmith77 opened 2 weeks ago

lsmith77 commented 2 weeks ago

How to reproduce the behaviour

Notice the double space in front of sourire in the first case vs. the single space in the second case

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

vs.

Les publics avec un sourire chaleureux et

image

https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20sourire%20chaleureux%20%20et&model=fr_core_news_sm

Your Environment

Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into whitespace_ (ie. `) and not lead to aSPACEtoken? I did take note of https://github.com/explosion/spaCy/issues/1707 though putting the additional spaces intowhitespace_` seems more logical to me.

Research

a) Maybe related https://github.com/explosion/spaCy/issues/621 b) Semi-related https://stephantul.github.io/spacy/2019/05/01/tokenizationspacy/ c) Semi-related https://github.com/explosion/spaCy/discussions/9978

smal8 commented 2 days ago

Maybe we could use infixes or suffixes?