NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.
MIT License
163 stars 9 forks source link

PyTorch Implementation of Audio Flamingo

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

[Demo website] [Demo video] [ICML poster]

This repo contains the PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (ICML 2024). Audio Flamingo is a novel audio-understanding language model with

We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

Code Structure

Within each folder, the structure is highly based on the Open Flamingo repo (commit a05dcba). Each folder is self-contained and we expect no cross dependencies between these folders.

Preparation

Running the Code

We refer to foundation/README.md, chat/README.md, and inference/README.md for the specific instructions for training the foundation model, training the chat model, and inferencing, as they require different setups. We used 8 A100 GPUs to train our models.

Checkpoints

Downstream applications

References

The main training and inferencing code within each folder (foundation/, chat/, inference/), including train/, src/, data/, and configs/, are modified from Open Flamingo (commit a05dcba) (MIT license), which borrows from flamingo-pytorch (MIT license), flamingo-mini (MIT license), and open_clip (MIT license). src/helpers.py also includes self-attention implementations based on attention-is-all-you-need-pytorch (MIT license), which borrows from OpenNMT-py (MIT license). Our code also relies on LAION-AI/CLAP (CC0-1.0 license) and microsoft/CLAP (MIT license). In chat/data/prepare_each_dataset.py, the filtering keywords are based on the LLARK paper (CC-BY-4.0 license) and the LTU paper (CC-BY-4.0 license).

License

Citation

@article{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2402.01831},
  year={2024}
}