ValueError when tokenize some inputs using `Vietnamese → English`

AutumnSun1996 commented 2 years ago

A simple example:

from argostranslate.translate import get_installed_languages

languages_list = get_installed_languages()
languages = {l.code: l for l in languages_list}

trans = languages['vi'].get_translation(languages['en'])

text = 'thuc luc di em trai <@!12345>'
res = trans.translate(text)

output:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    res = trans.translate(text)
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 52, in translate
    return self.hypotheses(input_text, num_hypotheses=1)[0].value
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 275, in hypotheses
    paragraph, num_hypotheses
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 160, in hypotheses
    self.pkg, paragraph, self.translator, num_hypotheses
  File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 385, in apply_packaged_translation
    stanza_sbd = stanza_pipeline(input_text)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 166, in __call__
    doc = self.process(doc)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 160, in process
    doc = self.processors[processor_name].process(doc)
  File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 88, in process
    no_ssplit=self.config.get('no_ssplit', False))
  File "/home/username/.local/lib/python3.7/site-packages/stanza/models/tokenize/utils.py", line 165, in output_predictions
    st0 = text.index(part, char_offset) - char_offset
ValueError: substring not found

The bug only occurs for vi->en, thus should be related to the model used by stanza.

PJ-Finlay commented 2 years ago

Hmm, not sure. It does look like something with Stanza though. Do you know what type of inputs cause the issue? Also you can run with export DEBUG=1 to see the sentence boundary detection output.

yonilevineafs commented 2 years ago

Did anyone ever figure out a solution to this? I'm running into the same issue with vi->en? @AutumnSun1996

dingedi commented 2 years ago

Could you run it with export DEBUG=1 and post the output ? @yonilevineafs

PJ-Finlay commented 2 years ago

I reproduced this, it looks like it's an issue with Vietnamese sentence boundary detection.

The root cause could be an issue with Stanza or the Stanza model was mispackaged somehow.

File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/argostranslategui/gui.py", line 39, in run
    translated_text = self.translation_function()
  File "/home/argosopentech/git/translate/argostranslate/translate.py", line 52, in translate
    return self.hypotheses(input_text, num_hypotheses=1)[0].value
  File "/home/argosopentech/git/translate/argostranslate/translate.py", line 274, in hypotheses
    translated_paragraph = self.underlying.hypotheses(
  File "/home/argosopentech/git/translate/argostranslate/translate.py", line 159, in hypotheses
    apply_packaged_translation(
  File "/home/argosopentech/git/translate/argostranslate/translate.py", line 388, in apply_packaged_translation
    stanza_sbd = stanza_pipeline(input_text)
  File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/core.py", line 166, in __call__
    doc = self.process(doc)
  File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/core.py", line 160, in process
    doc = self.processors[processor_name].process(doc)
  File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/tokenize_processor.py", line 85, in process
    _, _, _, document = output_predictions(None, self.trainer, batches, self.vocab, None,
  File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/models/tokenize/utils.py", line 163, in output_predictions
    st0 = text.index(part, char_offset) - char_offset
ValueError: substring not found
Aborted

argosopentech / argos-translate

ValueError when tokenize some inputs using `Vietnamese → English` #216