VITA-MLLM / VITA

✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
Other
772 stars 38 forks source link

Missing citation #6

Closed ddlBoJack closed 1 month ago

ddlBoJack commented 1 month ago

Congratulations to your team on the progress you've made in integrating full-duplex speech capabilities into LLM on interactive MLLM in this work VITA. As I was reading your article, I came across the statement in the abstract: "To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM." While it's commendable that your team achieved this capability within the open-source, MLLM domain, we believe that this description might inadvertently overlook some prior work in this area.

Our team has recently published a paper titled Language Model Can Listen While Speaking (https://arxiv.org/abs/2408.02622), where we formalized the audio duplex problem and proposed a solution. We would greatly appreciate it if you could revise the description in your article to acknowledge our work and provide a citation where appropriate. Recognizing prior contributions not only upholds academic integrity but also honors the efforts of all researchers working on advancing this field.

Thank you for your attention to this matter. We are looking forward to your reply.

BradyFU commented 1 month ago

Hi, I have seen your good work and we will cite it. We learn a lot from it. But I am not sure why you're saying we're 'Biased Claim'? Our work focuses on MLLM (vision + LLM + more). It doesn't seem to conflict with your work (audio+language) right now. If there is any similar work, you are welcome to point it out and we will correct our claim as soon as possible. So far, I hope you could modify the "Biased claim" in the issue title. Or do you have more suggestions? Welcome to discuss. Thanks.

ddlBoJack commented 1 month ago

Thank you for your response to cite our work. We appreciate your acknowledgment and are glad that our research has been helpful to your team. Regarding the title, I have updated it to remove the term "Biased Claim."

I agree that our work focuses primarily on audio+language models, while your research expands into the realm of MLLM, integrating vision and other modalities. Nevertheless, given the overlap in the core concept of full-duplex interaction, I believe a citation would provide valuable context for readers in both fields.

Thank you again for your understanding and cooperation.

ddlBoJack commented 1 month ago

Regarding the first interactive and duplex MLLM with both speech and vision modalities, I know another work (https://aubrey-ao.github.io/BodyOfHer/) The author proposed some demos and a tech report(https://arxiv.org/abs/2408.02879), which you can have a look.

BradyFU commented 1 month ago

That is fun! I am thinking of limiting our claim to "the First-Ever open-source interactive omni multimodal LLM", instead of just "the first to exploit non-awakening interaction and audio interrupt in MLLM".

ddlBoJack commented 1 month ago

Hi, is there any progress?

BradyFU commented 1 month ago

Hi, we have received some feedback and will submit updates by the end of this week.

ifsheldon commented 1 month ago

@ddlBoJack Hi! I saw this issue and follow your link to your paper. Your research is also impressive. Do you have any plan to opensource your work? Or, release your model at least?

ddlBoJack commented 1 month ago

@ddlBoJack Hi! I saw this issue and follow your link to your paper. Your research is also impressive. Do you have any plan to opensource your work? Or, release your model at least?

Thank you for your interest in our work. There are other reproduction codes available online, and it's easy to do it yourself. We have plans to open source, since this is a school-enterprise cooperation project, it depends on a lot of other factors. So stay tuned.