clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
37 stars 17 forks source link

Tokenization bug #22

Closed markoferme closed 2 years ago

markoferme commented 2 years ago

When trying to use classla on texts, that are generated from conversion of other formats (pdf, docx,...), an error is thrown:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/classla/pipeline/core.py", line 167, in __call__
    doc = self.process(doc)
  File "/usr/local/lib/python3.7/site-packages/classla/pipeline/core.py", line 161, in process
    doc = self.processors[processor_name].process(doc)
  File "/usr/local/lib/python3.7/site-packages/classla/pipeline/tokenize_processor.py", line 87, in process
    return doc.Document(document, raw_text, metasentences=metadocument)
  File "/usr/local/lib/python3.7/site-packages/classla/models/common/doc.py", line 80, in __init__
    self._process_sentences(sentences, metasentences=metasentences)
  File "/usr/local/lib/python3.7/site-packages/classla/models/common/doc.py", line 147, in _process_sentences
    self.sentences.append(Sentence(tokens, doc=self, metadata=metadata))
  File "/usr/local/lib/python3.7/site-packages/classla/models/common/doc.py", line 352, in __init__
    self._process_tokens(tokens)
  File "/usr/local/lib/python3.7/site-packages/classla/models/common/doc.py", line 379, in _process_tokens
    is_complete_words = (len(self.words) >= len(self.tokens)) and (len(self.words) == self.words[-1].id)
IndexError: list index out of range

The error occurs, when there are no words in a sentence, returned by the tokenizer (obeliks in my case).

Steps to reproduce the behavior:

>>> import classla
>>> classla.download('sl')                            
>>> nlp = classla.Pipeline('sl')                      
>>> doc = nlp("This is some text\n!\nAnd some more text\n")

A fix would probably be, to check if the tokenizer is returning any words at all, and ignore such a sentence.

lkrsnik commented 2 years ago

This problem arises because classla version 1.0.2 is not adapted for obeliks tokenizer at or above version 1.1.0.

Solution: Downgrade obeliks library to version 1.0.6:

pip uninstall obeliks
pip install obeliks==1.0.6

The upcoming release of classla will support latest obeliks version as well.