SpeechColab / GigaSpeech2

An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawling, transcription and refinement
Apache License 2.0
107 stars 5 forks source link

thresholds #4

Closed wwfcnu closed 3 months ago

wwfcnu commented 3 months ago

你好,请问Multi-dimensional Filtering那里,Language Confidence Filtering和Audio Duration Filtering你们选用的阈值是多少

yfyeung commented 3 months ago

这个是语种/数据相关的。泰语用的 Language Confidence 0.95,audio duration 2-30,具体可以看看不同 thresholds 筛出来的数据质量怎么样,当时是每个 thresholds 区间扔给 chatgpt4 100 条评估。

yfyeung commented 3 months ago

不过泰语 20-30 也挺稀疏的,这块训练集占比很少,也可以丢了