What is the scope of "Anything"?

It is an interesting work and the task it aims to do is as exciting as SAM to me. But I am not familar with audio research and I do have some questions related to this work.

Firstly, I checked the dataset amd it seems not very complete for "sound separation" or "separate anything in audio". Actually I tried some samples for "separate vocal from songs", I found no matter use "Human Sounds" or "Vocal" the model cannot separate it even from a very slow and simple "guitar playing and singing" sample. And reversely I tried "acoustic guitar", it contains some vocal which is obvious. Am I misunderstanding the scope that "songs" do not belong to music and the scope of this work?

Secondly, I would like to ask why it is foundation. It seems multimodal or multiple types of inputs = foundation model as I do not know what it provides for the "downstream tasks". Can someone provide me the insights?

Audio-AGI / AudioSep

What is the scope of "Anything"? #34