egorsmkv / ukrainian-tts-datasets

🇺🇦 Open Source Ukrainian Text-to-Speech datasets
Apache License 2.0
8 stars 0 forks source link

Audio files cut off? #1

Open egorsmkv opened 1 month ago

egorsmkv commented 1 month ago

1:

Some audio files appear to have been cut off. For example: accept/64926.ogg in the tetiana dataset (original text is "Уве́чері при ля́мпі ми сиді́ли в кімна́ті вчи́теля і розмовля́ли.").

I'm working on a tool to flag files that might not match the text by assuming speakers maintain a fairly consistent speaking rate. I'll post more here when I get results, but was curious if anyone else had seen this.

egorsmkv commented 1 month ago

2:

It was done faster than I expected! Here are the files flagged for each speaker: ukrainian-tts_filtered.json.zip

Here is the way this tool works (source code here):

  1. Punctuation is removed from text
  2. Silence is removed from audio (using silero vad)
  3. The speaking rate is calculated as len(text) / duration_sec for each audio file (characters per second)
  4. Outliers are found in a similar manner to this article -- basically any file with a speaking rate far above/below the average

Based on the tool, there were:

Should still be enough data to train with, but it might be good for humans to review those files :)