CoEDL / elpis

🙊 software for creating speech recognition models.
https://elpis.readthedocs.io/en/latest/
Apache License 2.0
152 stars 33 forks source link

Fix adding extra corpus text files breaking the dataset process #94

Closed benfoley closed 4 years ago

benfoley commented 4 years ago

Adding extra corpus text files in the dataset stage broke Elpis, with an error sre_constants.error: unterminated character set at position 0, from:

  File "/elpis/elpis/engines/common/input/clean_json.py", line 204, in deal_with_punctuation
    new_text: str = re.sub(rf"[{pattern_to_explode_by}]", " ", text)

This commit only attempts the re.sub if there is a string to build the match pattern from.

benfoley commented 4 years ago

The latest commit #08a6ffb adds a condition to only try punctuation match regex if there's a pattern to build from a string of punctuation marks. Seems that if that string is empty the pattern is empty and so the match fails. Wee seem to have lost the default set of punctuation to strip. It might have to be declared in the UI part of the engine selector. Will look at that separately. This PR will at least prevent it from breaking.

benfoley commented 4 years ago

This is failing because the punctuation isn't being cleaned. I'm about to submit another PR that deals with the that.