Tokenizer special cases do not work around infix punctuation

explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

https://spacy.io

MIT License

30.38k stars 4.41k forks source link

Tokenizer special cases do not work around infix punctuation #5598

Open cassidylaidlaw opened 4 years ago

cassidylaidlaw commented 4 years ago

How to reproduce the behaviour

I would expect the two sentences below to be tokenized the same way. However, in the second, the special cases for "won't" and "can't" do not work.

>>> import en_core_web_sm
>>> nlp = en_core_web_sm.load()
>>> [token.text for token in nlp("I can't / won't tolerate that.")]
['I', 'ca', "n't", '/', 'wo', "n't", 'tolerate', 'that', '.']
>>> [token.text for token in nlp("I can't/won't tolerate that.")] 
['I', "can't", '/', "won't", 'tolerate', 'that', '.']

Your Environment

spaCy version: 2.3.0
Platform: Darwin-18.7.0-x86_64-i386-64bit
Python version: 3.7.4

adrianeboyd commented 4 years ago

Thanks for the report!

There are number of changes coming soon for spacy v3 that make the tokenizer more consistent, in particular for special cases that contain prefix/suffix/infix punctuation that don't work consistently in v2, but special cases for the parts split by the infixes isn't one of them.

Checking for special cases around infixes is relatively simple to add, but I'd need to check whether it slows the tokenizer down too much. If it is a lot slower, I think we can consider adding an option that enables more thorough special case handling, which would be off by default.

adrianeboyd commented 4 years ago

It turned out that this has too much of an effect on the existing tokenizer settings, which were designed without this infix special case checking. It might be possible to add an option to the tokenizer to allow this, but we're wary of adding even more options, so for now we're going to put the idea on hold.

veonua commented 3 years ago

for some reason, the tokenizer is built around the spaces so it's unable to split strings without issues, I run into the issue with the string "($10/$20)" by default it's
(, $, 10/$20, )

with infixes it's (, $, 10, /, $20, )

just because infixes are not powerful enough to run suffix\prefix behavior