Open kaiw7 opened 1 year ago
hi,
I don't know much about speech enhancement. It seems to me that MAE model has a larger chance to success. See Appendix C.2 of the AudioMAE Paper.
-Yuan
Hello @kaiw7 and Yuan,
I hate to cut in and share findings from my recent paper, but there's one thing I can share: the typical patch size of 16x16 might be too long duration to capture speech content. 20ms (as in typical speech models) was the best for me. Please find details at https arxiv.org/pdf/2305.14079.pdf (Please copy and paste to complete in your browser. I tried not to auto-link from here.) (This is a paper to specialize a similar SSL audio ViT in speech tasks, though not containing speech enhancement task.)
P.S. Yuan, your new paper "Listen, Think, and Understand" is very very interesting! https://arxiv.org/abs/2305.10790
hi @daisukelab,
Thanks so much for adding this!! Are you referring to Table 3 of the M2D-S paper? If so, I totally agree your point that 80f x 2t is a more appropraite patch shape than 16x16 for speech, and it is consistent with our experiment in Table 4 of the SSAST Paper (in short, frame-like patch is better for speech tasks, while 16x16 is better for general audio tasks).
And orthogonal to that, MAE for speech enhancement might be an interesting topic.
And thanks so much for your kind words about LTU, the LTU repo contains a interactive demo that you can play with (and also see its limitation).
-Yuan
Hi @YuanGongND,
Thank you for your valuable comment! I found that I totally missed that you already discussed that in Section 3.8 (Comparing Patch-based and Frame-based AST) in the SSAST paper! In addition, you already have tested to compare with speech models in Table 5. Fortunately, I haven't finished my camera ready for Interspeech; I'm hoping to mention what has been done in the SSAST paper. I'll try.
And I'd love to check out the LTU demo!
hi @daisukelab,
Thanks so much and congrats on your Interspeech paper.
I didn't mean to ask you adding SSAST to the M2D-S paper, I just wanted to say that if two groups independently find the same result (on different models), it is likely to be valid.
-Yuan
Hi Dr. Gong, could I know about whether the AST model can be used for speech enhancement task? Especially for testing, each waveform with different length will be fed into the trained model, where the position encoding needs to be applied into different-length waveform.