falabrasil / kaldi-br

☕🇧🇷 Scripts para o Kaldi em Português Brasileiro
MIT License
48 stars 7 forks source link

[fb-falabrasil] #utts to subset for mono, tri-deltas, tri-lda, tri-sat #8

Closed cassiotbatista closed 2 years ago

cassiotbatista commented 2 years ago

I don't think that matters for the DNN at the end of the day but I've made some kind of avg mean on the numbers of librispeech and aspire recipes. This is just for logging / reporting purposes.

One thing I didn't not really took care of was to watch for the selection of westpoint's utts in the shortest for monophone training as I believe the dataset might dominate the subset because of it's full of word-pieces. Just smt to keep in mind.

recipe #utts hours mono tri-deltas tri-lda tri-sat
librispeech 29k 100h 2k 5k 10k full set
aspire 1.6M ? 10k 30k 100k full set
fb-falabrasil 650k ? 5k 10k 30k full set
cassiotbatista commented 2 years ago

An additional note: some recipes use all the data right at monophone training, but I believe that isn't very helpful especially for some datasets like lapsstory in which the length of the utts are unusually long (>= 30s).

When training individually, beam has to be scaled up to force-align lapsstory because of its long utts.