Closed luomingshuang closed 2 years ago
Great idea, but we may have to clean/scale up the bilingual datasets before we try this. GigaSpeech is 10k hrs and Librispeech is 1k hours, with relatively clean labels. For English & Chinese lip reading, the size is usually ~200hrs and ~100hrs and of varying label quality. Do you have any VSR datasets in mind that would be ideal for this attempt?
I just provide an idea. -_-. When you decide to use which datasets, please think of it carefully. BTW, please throw a large chinese sentence-level lip-reading dataset out. I can't wait to look it. And then, you can use it and another English lip-reading dataset for this idea.
Okay, I will keep you posted.
In the above picture, Librispeech and Gigaspeech are all English ASR datasets. There is a recipe in icefall. The results get improved by this way. So I have an idea that we can use this structure for multilingual lip reading. We can use a Chinese lip-reading dataset and a English lip-reading dataset to replace these two English ASR datasets. Maybe someone can use it to have a try.
This structure is a bit similar to our previous work of SBL in BMVC-2020. Maybe we could release the SBL in the recent future. Thanks for this question, but the question seems not very relevant with our released work on this github platform, so I closed it. Any advice would be appreciated and we could discuss something which has not been done but maybe worth to do in other places at any time. ^_^
Em, actually, it differs from SBL. This structure just use a common encoder, and use two irrelavant decoders and joiners. (While SBL uses a common decoder for Chinese and English.) I think it can help different languages keep their language rules. Just a think. I'm not sure if it's scientific.
In the above picture, Librispeech and Gigaspeech are all English ASR datasets. There is a recipe in icefall. The results get improved by this way. So I have an idea that we can use this structure for multilingual lip reading. We can use a Chinese lip-reading dataset and a English lip-reading dataset to replace these two English ASR datasets. Maybe someone can use it to have a try.