Audio files cut off? - Githubissues

egorsmkv / ukrainian-tts-datasets

🇺🇦 Open Source Ukrainian Text-to-Speech datasets

Apache License 2.0

12 stars 1 forks source link

It was done faster than I expected! Here are the files flagged for each speaker: ukrainian-tts_filtered.json.zip

Here is the way this tool works (source code here):

Punctuation is removed from text
Silence is removed from audio (using silero vad)
The speaking rate is calculated as len(text) / duration_sec for each audio file (characters per second)
Outliers are found in a similar manner to this article -- basically any file with a speaking rate far above/below the average

Based on the tool, there were:

Should still be enough data to train with, but it might be good for humans to review those files :)

egorsmkv / ukrainian-tts-datasets