falabrasil / speech-datasets

🗣️🇧🇷 Bases de áudio transcrito em Português Brasileiro
42 stars 7 forks source link

[datasets] voxforge: train dur greater than overall #8

Open cassiotbatista opened 1 year ago

cassiotbatista commented 1 year ago

train.list has lots of dups

$ wc -l datasets/voxforge/train.list 
8633 datasets/voxforge/train.list

$ sort -u datasets/voxforge/train.list | wc -l
4571

problem in src/split/split_voxforge.sh ?

cassiotbatista commented 1 year ago

Number of words in stats is also inflated with means src/stats.sh for voxforge should be disregarded from other FB datasets.

Makes sense because transcripts in VF are held in a PROMPTS file while in FB in *.txt files