Tutorial file contains duplicate content

ebu / benchmarkstt

Open Source AI Benchmarking toolkit for benchmarking speech to text services

MIT License

54 stars 8 forks source link

Tutorial file contains duplicate content #128

Closed EyalLavi closed 4 years ago

EyalLavi commented 4 years ago

The Kaldi hypothesis file has the transcript duplicated. This creates WER > 1.

EyalLavi commented 4 years ago

The root cause is the regex in the tutorial: benchmarkstt-tools normalization --inputfile qt_kaldi.json --outputfile qt_kaldi_hypothesis.txt --regex '^.*"text":"([^"]+)".*' '\1'

EyalLavi commented 4 years ago

On further investigation, it looks like the regex does work, but the file in the repo has the content duplicated.