When I test the pre-trained multi-speaker model on the LRW test set I get similar STOI and ESTOI values quoted in the paper but the best WER I can achieve is 79.6% compared to the 34.2% in the paper.
Could you specify the steps you used to achieve 34.2% WER with Google ASR? Do you crop the synthesised word and use a specific Google ASR model/configuration? Do you use the entire LRW test dataset or just a subset?
It would be great to know for fair comparison of future research.
Hi, thanks for the great work.
When I test the pre-trained multi-speaker model on the LRW test set I get similar STOI and ESTOI values quoted in the paper but the best WER I can achieve is 79.6% compared to the 34.2% in the paper.
Could you specify the steps you used to achieve 34.2% WER with Google ASR? Do you crop the synthesised word and use a specific Google ASR model/configuration? Do you use the entire LRW test dataset or just a subset?
It would be great to know for fair comparison of future research.
Thanks