Audio-AGI / AudioSep

Official implementation of "Separate Anything You Describe"
https://audio-agi.github.io/Separate-Anything-You-Describe/
MIT License
1.64k stars 118 forks source link

What is the scope of "Anything"? #34

Open Starlento opened 1 year ago

Starlento commented 1 year ago

It is an interesting work and the task it aims to do is as exciting as SAM to me. But I am not familar with audio research and I do have some questions related to this work.

Firstly, I checked the dataset amd it seems not very complete for "sound separation" or "separate anything in audio". Actually I tried some samples for "separate vocal from songs", I found no matter use "Human Sounds" or "Vocal" the model cannot separate it even from a very slow and simple "guitar playing and singing" sample. And reversely I tried "acoustic guitar", it contains some vocal which is obvious. Am I misunderstanding the scope that "songs" do not belong to music and the scope of this work?

Secondly, I would like to ask why it is foundation. It seems multimodal or multiple types of inputs = foundation model as I do not know what it provides for the "downstream tasks". Can someone provide me the insights?