awni / speech

A PyTorch Implementation of End-to-End Models for Speech-to-Text
Apache License 2.0
750 stars 176 forks source link

A question about the TRAINING SET used in timit script #22

Closed wolverineq closed 6 years ago

wolverineq commented 6 years ago

Normally we use the standard 462-speaker data as training set, while this timit exmaple use 556-speaker data(including some data from the full test set) in train.json. Although the WER results seem pretty promising in this repo, are the methods you use here really convincing or comparable?

awni commented 6 years ago

Yup, the intention was to mimic the kaldi setup using the core-test set as the "test" set. The training and dev set might be split in a different way but should contain the same number of speakers.

wolverineq commented 6 years ago

I have followed your repo for a lone time and appreciate your work very much, but I still wonder why not split timit data in the same way that others do? I tried splitting them in a more common way (462-speaker from timit/train as train set, 50-speaker from timit/test as dev set. and 24-speaker from timit/test as the core test set). The results showed 19.9% WER for ctc_config.json in examples/timit, which is not a very good performance. I think if you provide some good results with a more common data splitting method in baseline system, it would be very helpful for other people who are interested in this repo and do some extention works... (thats why I see this problem very important) Thank you again for your work here!

awni commented 6 years ago

Hi, I checked the Kaldi setup and you are correct. The training set there uses only 462 speakers. I'm using the same test set though. I've updated the code to generate the same training set. The dev set is the same size in terms of speakers and utterances, but may not contain the exact same speakers. The test set is identical.

I will need to rerun experiments to have an apples-to-apples comparison.

wolverineq commented 6 years ago

Thanks for your attention on this! However, it may be better if you use exactly the same train/dev/test set as others do inconvenience of comparision...

awni commented 6 years ago

The training set is exactly the same, the dev set is a bit different. I'll look into making it the same, but this is lower priority at the moment.

On Mar 6, 2018 18:20, "wolverineq" notifications@github.com wrote:

Thanks for your attention on this! However, it may be better if you use exactly the same train/dev/test set as others do inconvenience of comparision...

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/awni/speech/issues/22#issuecomment-370999755, or mute the thread https://github.com/notifications/unsubscribe-auth/ABeKlVRULf1FUNJk961kWjYM2mlrliv7ks5tb0QJgaJpZM4SU70H .

stefan-falk commented 3 years ago

I have followed your repo for a lone time and appreciate your work very much, but I still wonder why not split timit data in the same way that others do? I tried splitting them in a more common way (462-speaker from timit/train as train set, 50-speaker from timit/test as dev set. and 24-speaker from timit/test as the core test set). The results showed 19.9% WER for ctc_config.json in examples/timit, which is not a very good performance. I think if you provide some good results with a more common data splitting method in baseline system, it would be very helpful for other people who are interested in this repo and do some extention works... (thats why I see this problem very important) Thank you again for your work here!

I've been looking for "the right way" to split TIMIT myself. Is it common practice to use 50 speakers from the original test-set as dev-set?.