spaCy Version Used: v3.5 (displacy) but also in v3.7
Environment Information:
Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into whitespace_ (ie. `) and not lead to aSPACEtoken? I did take note of https://github.com/explosion/spaCy/issues/1707 though putting the additional spaces intowhitespace_` seems more logical to me.
How to reproduce the behaviour
Notice the double space in front of
sourire
in the first case vs. the single space in the second caseLes publics avec un sourire chaleureux et
https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20%20sourire%20chaleureux%20%20et&model=fr_core_news_sm
vs.
Les publics avec un sourire chaleureux et
https://demos.explosion.ai/displacy?text=Les%20publics%20avec%20un%20sourire%20chaleureux%20%20et&model=fr_core_news_sm
Your Environment
Semi-related: Any guidance on how to modify the tokenizer so that a double spaces would be placed into
whitespace_
(ie.`) and not lead to a
SPACEtoken? I did take note of https://github.com/explosion/spaCy/issues/1707 though putting the additional spaces into
whitespace_` seems more logical to me.Research
a) Maybe related https://github.com/explosion/spaCy/issues/621 b) Semi-related https://stephantul.github.io/spacy/2019/05/01/tokenizationspacy/ c) Semi-related https://github.com/explosion/spaCy/discussions/9978