explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Error: tokenizer #59

Closed gudgyo closed 3 years ago

gudgyo commented 3 years ago

Versions: spacy-stanza 0.2.4 stanza 1.1.1

Description: the following string throws error on the tokenizer: "?\n" How to reproduce error:

import stanza
from spacy_stanza import StanzaLanguage
# tried on 5 different languages with same result
snlp = stanza.Pipeline(lang="en", processors='tokenize')
nlp = StanzaLanguage(snlp)
nlp('?\n')

Update: Any given character followed by a newline '\n' and no other character produces the same error. eg.: nlp("example\n") ->error nlp("example2\n ") -> error nlp("example\nend") -> runs

Update 2: Character followed by two spaces also produce the same error, for some reason special characters work this way. eg.: nlp("example ") ->error nlp("example2 ") -> error nlp("\n ") -> runs nlp("\t ") -> runs

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-48883c9275df> in <module>
----> 1 nlp('?\n')

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    439                 Errors.E088.format(length=len(text), max_length=self.max_length)
    440             )
--> 441         doc = self.make_doc(text)
    442         if component_cfg is None:
    443             component_cfg = {}

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in __call__(self, text)
    165         offset = 0
    166         for i, word in enumerate(words):
--> 167             if word.isspace() and word != snlp_tokens[i + offset].text:
    168                 # insert a space token
    169                 pos.append(self.vocab.strings.add("SPACE"))

IndexError: list index out of range

Update 3: This particular string produces an error (language bulgarian, gpu True, all processors used): "Думи и срички: Горско училище ......................9 Буквен етап • "

Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-70-9dbeed0db463> in <module>
----> 1 nlp('Думи и срички: Горско училище ......................9 Буквен етап • ')

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    439                 Errors.E088.format(length=len(text), max_length=self.max_length)
    440             )
--> 441         doc = self.make_doc(text)
    442         if component_cfg is None:
    443             component_cfg = {}

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in __call__(self, text)
    139             return Doc(self.vocab, words=[text], spaces=[False])
    140 
--> 141         snlp_doc = self.snlp(text)
    142         text = snlp_doc.text
    143         snlp_tokens, snlp_heads = self.get_tokens_with_heads(snlp_doc)

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/core.py in __call__(self, doc)
    164         assert any([isinstance(doc, str), isinstance(doc, list),
    165                     isinstance(doc, Document)]), 'input should be either str, list or Document'
--> 166         doc = self.process(doc)
    167         return doc
    168 

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/core.py in process(self, doc)
    158         for processor_name in PIPELINE_NAMES:
    159             if self.processors.get(processor_name):
--> 160                 doc = self.processors[processor_name].process(doc)
    161         return doc
    162 

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/depparse_processor.py in process(self, document)
     46         # build dependencies based on predictions
     47         for sentence in batch.doc.sentences:
---> 48             sentence.build_dependencies()
     49         return batch.doc

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/models/common/doc.py in build_dependencies(self)
    479                 # id is index in words list + 1
    480                 head = self.words[word.head - 1]
--> 481                 assert(word.head == head.id)
    482             self.dependencies.append((head, word.deprel, word))
    483 

AssertionError: 

However after deleting a single dot from the string, we get the following warning instead of the error:

/home/gudmongyorgy/miniconda3/envs/gudgyo/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
  """Entry point for launching an IPython kernel.
/home/gudmongyorgy/miniconda3/envs/gudgyo/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Думи', 'и', 'срички', ':', 'Горско', 'училище', '......', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '...9', 'Буквен', 'етап', '•']
Entities: []
  """Entry point for launching an IPython kernel.

With the tokenized output: "Думи и срички : Горско училище ...... . . . . . . . . . . ...9 Буквен етап •"

adrianeboyd commented 3 years ago

Thanks for the report! The trailing whitespace issue is definitely a bug that we can fix easily.

The Bulgarian error looks like a separate issue within stanza, though? You could try with plain stanza to see if you get that error with the same text?

gudgyo commented 3 years ago

Yes, the bulgarian error remains with plain stanza, and seems to be language specific. Note: The trailing whitespace issue occurs when the string ends with at least 2 whitespaces.