Open cassidylaidlaw opened 4 years ago
Thanks for the report!
There are number of changes coming soon for spacy v3 that make the tokenizer more consistent, in particular for special cases that contain prefix/suffix/infix punctuation that don't work consistently in v2, but special cases for the parts split by the infixes isn't one of them.
Checking for special cases around infixes is relatively simple to add, but I'd need to check whether it slows the tokenizer down too much. If it is a lot slower, I think we can consider adding an option that enables more thorough special case handling, which would be off by default.
It turned out that this has too much of an effect on the existing tokenizer settings, which were designed without this infix special case checking. It might be possible to add an option to the tokenizer to allow this, but we're wary of adding even more options, so for now we're going to put the idea on hold.
for some reason, the tokenizer is built around the spaces so it's unable to split strings without issues,
I run into the issue with the string
"($10/$20)" by default it's
(, $, 10/$20, )
with infixes it's (, $, 10, /, $20, )
just because infixes are not powerful enough to run suffix\prefix behavior
How to reproduce the behaviour
I would expect the two sentences below to be tokenized the same way. However, in the second, the special cases for "won't" and "can't" do not work.
Your Environment