token.idx gives incorrect index depends on some other tokens in doc

koren-v commented 4 years ago

Hi, found a bug: when the token "2 999" or "4 500" contain whitespace, the token.idx gives incorrect position:

>>> import stanza
>>> from spacy_stanza import StanzaLanguage
>>> stanza.download('ru')
>>> snlp = stanza.Pipeline(lang="ru")
>>> nlp = StanzaLanguage(snlp)
>>> text = text = "ведь первый месяц это минимум 2 999 рублей за устройство, плюс 4 500 на фурнитуру."
>>> doc = nlp(text)
>>> for token in doc: 
...     print(f"{token.text + ' '*(10-len(token.text))}  token.idx: {token.idx}  text.index(token.text): {text.index(token.text)}")
... 
ведь        token.idx: 0  text.index(token.text): 0
первый      token.idx: 5  text.index(token.text): 5
месяц       token.idx: 12  text.index(token.text): 12
это         token.idx: 18  text.index(token.text): 18
минимум     token.idx: 22  text.index(token.text): 22
2 999       token.idx: 30  text.index(token.text): 30
рублей      token.idx: 36  text.index(token.text): 36
за          token.idx: 43  text.index(token.text): 43
устройство  token.idx: 46  text.index(token.text): 46
,           token.idx: 57  text.index(token.text): 56     <---------------- there
плюс        token.idx: 59  text.index(token.text): 58
4 500       token.idx: 64  text.index(token.text): 63
на          token.idx: 70  text.index(token.text): 69
фурнитуру   token.idx: 73  text.index(token.text): 72
.           token.idx: 83  text.index(token.text): 81

In case with no spaces:

>>> text = "ведь первый месяц это минимум 2999 рублей за устройство, плюс 4500 на фурнитуру."
>>> doc = nlp(text)
>>> for token in doc: 
...     print(f"{token.text + ' '*(10-len(token.text))}  token.idx: {token.idx}  text.index(token.text): {text.index(token.text)}")
... 
ведь        token.idx: 0  text.index(token.text): 0
первый      token.idx: 5  text.index(token.text): 5
месяц       token.idx: 12  text.index(token.text): 12
это         token.idx: 18  text.index(token.text): 18
минимум     token.idx: 22  text.index(token.text): 22
2999        token.idx: 30  text.index(token.text): 30
рублей      token.idx: 35  text.index(token.text): 35
за          token.idx: 42  text.index(token.text): 42
устройство  token.idx: 45  text.index(token.text): 45
,           token.idx: 55  text.index(token.text): 55
плюс        token.idx: 57  text.index(token.text): 57
4500        token.idx: 62  text.index(token.text): 62
на          token.idx: 67  text.index(token.text): 67
фурнитуру   token.idx: 70  text.index(token.text): 70
.           token.idx: 79  text.index(token.text): 79

adrianeboyd commented 4 years ago

Can you try this with v0.2.3 (just released a few days ago)? It has some bug fixes related to whitespace processing and when I tried it out with your example, it looked like this problem was fixed.

koren-v commented 4 years ago

@adrianeboyd yes, everything is right in the new version

explosion / spacy-stanza

token.idx gives incorrect index depends on some other tokens in doc #42