VIPL-Audio-Visual-Speech-Understanding / AVSU-VIPL

Collection of works from VIPL-AVSU
40 stars 5 forks source link

Maybe someone can use this structure for multilingual (Mandarin and English) lip reading. #2

Closed luomingshuang closed 2 years ago

luomingshuang commented 2 years ago

image

In the above picture, Librispeech and Gigaspeech are all English ASR datasets. There is a recipe in icefall. The results get improved by this way. So I have an idea that we can use this structure for multilingual lip reading. We can use a Chinese lip-reading dataset and a English lip-reading dataset to replace these two English ASR datasets. Maybe someone can use it to have a try.

sailordiary commented 2 years ago

Great idea, but we may have to clean/scale up the bilingual datasets before we try this. GigaSpeech is 10k hrs and Librispeech is 1k hours, with relatively clean labels. For English & Chinese lip reading, the size is usually ~200hrs and ~100hrs and of varying label quality. Do you have any VSR datasets in mind that would be ideal for this attempt?

luomingshuang commented 2 years ago

I just provide an idea. -_-. When you decide to use which datasets, please think of it carefully. BTW, please throw a large chinese sentence-level lip-reading dataset out. I can't wait to look it. And then, you can use it and another English lip-reading dataset for this idea.

sailordiary commented 2 years ago

Okay, I will keep you posted.

yshnny commented 2 years ago

image

In the above picture, Librispeech and Gigaspeech are all English ASR datasets. There is a recipe in icefall. The results get improved by this way. So I have an idea that we can use this structure for multilingual lip reading. We can use a Chinese lip-reading dataset and a English lip-reading dataset to replace these two English ASR datasets. Maybe someone can use it to have a try.

This structure is a bit similar to our previous work of SBL in BMVC-2020. Maybe we could release the SBL in the recent future. Thanks for this question, but the question seems not very relevant with our released work on this github platform, so I closed it. Any advice would be appreciated and we could discuss something which has not been done but maybe worth to do in other places at any time. ^_^

luomingshuang commented 2 years ago

Em, actually, it differs from SBL. This structure just use a common encoder, and use two irrelavant decoders and joiners. (While SBL uses a common decoder for Chinese and English.) I think it can help different languages keep their language rules. Just a think. I'm not sure if it's scientific.