Sreyan88 / GAMA

Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
https://sreyan88.github.io/gamaaudio/
Apache License 2.0
80 stars 8 forks source link
audio audio-language dataset large-language-model multimodal-large-language-models question-answering reasoning

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

GAMA Logo.

This is the official implementation of our paper GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities.

Updates ๐Ÿšจ

Demo

We have hosted 2 HF spaces, generously supported by HuggingFace๐Ÿค— for GAMA and GAMA-IT. Feel free to play around with our models here:

[![GAMA](https://img.shields.io/badge/%F0%9F%A4%97%20GAMA-Online_Demo-orange)](https://huggingface.co/spaces/sonalkum/GAMA)     [![GAMA](https://img.shields.io/badge/%F0%9F%A4%97%20GAMA%20IT-Online_Demo-black)](https://huggingface.co/spaces/sonalkum/GAMA-IT) 

Resources

All resources required for GAMA and GAMA-IT can be found in this drive. Information about the files is provided below in respective sections. We also share some additional CLAP Checkpoints (to be used with this repository) to promote research in this space. These CLAP checkpoints are trained on 2M+ audio-caption pairs with large batch sizes on H100s.

Setup ๐Ÿ‹๏ธ

conda create -n gama python=3.10
conda activate gama
pip install -r requirements.txt
pip install -e hf-dev-train/transformers-main
pip install -e peft-main

Training ๐Ÿƒโ€โ™‚๏ธ

When preparing audio files, please make sure all audio files use the same sampling rate of 16kHz.

The format of the dataset is a JSON file of a list of dicts, in the following format:

[
 {
  "audio_id": "path_to_audio_file",
  "instruction": "Question",
  "dataset": "dataset_name", % (optional)
  "task": "type_of_task", % question type (optional)
  "output": "corect_answer"
 },
  ...
]

Use the following commands to train the model:

conda activate gama
cd train_script
# run finetuning on the data to train GAMA
./stage1.sh # need to specify the path of Llama-2-7b-chat-hf-qformer in for the `--base_model` arg.
./stage2.sh # need to specify the checkpoint in stage 1 training
./stage3.sh # need to specify the checkpoint in stage 2 training
./stage4.sh # need to specify the checkpoint in stage 3 training
# to instruction tune GAMA
./stage5.sh # need to specify the checkpoint in stage 4 training

To infer or instruction tune GAMA on your own dataset, we have provided the checkpoints for stage 4 and stage 5 here.


Inference of GAMA ๐Ÿ”–

To infer GAMA/GAMA-IT on CompA-R benchmark, change the path to model in gama_inf.py on line 215, and run:

python gama_inf.py

Evaluation

To evaluate GAMA we use the evaluation scheme employed by LTU, the evaluation scripts can be found here.


Note: The current code of GAMA does not include the implementation of soft-prompt. The code for soft-prompt (and its related checkpoints) will be released after the paper is accepted. However, the stage 5 checkpoint released currently performs almost as well as with soft-prompt.


Acknowledgement ๐ŸŒป

We would like to thank the authors of LTU for open-sourcing their code, which inspired our work.

Citation ๐Ÿ”

@misc{ghosh2024gamalargeaudiolanguagemodel,
      title={GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities}, 
      author={Sreyan Ghosh and Sonal Kumar and Ashish Seth and Chandra Kiran Reddy Evuru and Utkarsh Tyagi and S Sakshi and Oriol Nieto and Ramani Duraiswami and Dinesh Manocha},
      year={2024},
      eprint={2406.11768},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2406.11768}, 
}