It is an interesting work and the task it aims to do is as exciting as SAM to me.
But I am not familar with audio research and I do have some questions related to this work.
Firstly, I checked the dataset amd it seems not very complete for "sound separation" or "separate anything in audio".
Actually I tried some samples for "separate vocal from songs", I found no matter use "Human Sounds" or "Vocal" the model cannot separate it even from a very slow and simple "guitar playing and singing" sample. And reversely I tried "acoustic guitar", it contains some vocal which is obvious.
Am I misunderstanding the scope that "songs" do not belong to music and the scope of this work?
Secondly, I would like to ask why it is foundation. It seems multimodal or multiple types of inputs = foundation model as I do not know what it provides for the "downstream tasks". Can someone provide me the insights?
It is an interesting work and the task it aims to do is as exciting as SAM to me. But I am not familar with audio research and I do have some questions related to this work.
Firstly, I checked the dataset amd it seems not very complete for "sound separation" or "separate anything in audio". Actually I tried some samples for "separate vocal from songs", I found no matter use "Human Sounds" or "Vocal" the model cannot separate it even from a very slow and simple "guitar playing and singing" sample. And reversely I tried "acoustic guitar", it contains some vocal which is obvious. Am I misunderstanding the scope that "songs" do not belong to music and the scope of this work?
Secondly, I would like to ask why it is foundation. It seems multimodal or multiple types of inputs = foundation model as I do not know what it provides for the "downstream tasks". Can someone provide me the insights?