Corpora parsing should have each sentence on its own line

IBM / Train-Custom-Speech-Model

Create a custom Watson Speech to Text model using specialized domain data

https://developer.ibm.com/patterns/customize-and-continuously-train-your-own-watson-speech-service/

Apache License 2.0

59 stars 42 forks source link

Corpora parsing should have each sentence on its own line #50

Closed pvaneck closed 5 years ago

pvaneck commented 5 years ago

The Watson STT documentation specifies:

"Include each sentence of the corpus on its own line, and terminate each line with a carriage return. Including multiple sentences on the same line can degrade accuracy."

The data parser for all the txt files should ensure that the resulting output has each sentence on a new line.

https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-corporaWords#prepareCorpus