facebookresearch / VisualVoice

Audio-Visual Speech Separation with Cross-Modal Consistency
Other
218 stars 35 forks source link

Speech enhancement evaluation #29

Open syl4356 opened 9 months ago

syl4356 commented 9 months ago

Hello, thanks for your great work.

I've been trying to reproduce the enhancement performance on the VoxCeleb2 test set, but the performance of the given pre-trained model was much lower than in the paper. (I used evaluateSeparation.py from the main directory to evaluate the metrics.)

And when I tried with test_synthetic_script.sh, the outputs were bad for my hearing. The offscreen noise in the mixture (audio_mixed.wav) was much larger than the voice from what I heard, so I felt that the enhancement would be too difficult for the model.

I have 2 questions regarding this.

  1. Is the pre-trained model in the av-enhancement directory your best model for speech enhancement, not separation?
  2. Is your evaluation done with a mixture of two speeches and an offscreen noise with weight 1? Isn't it too difficult for the model to separate and enhance at the same time?

Thanks in advance.