NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.

MIT License

163 stars 9 forks source link

release of prefix, prompt, pre/post-processing script for full reproduction #9

Open jasonppy opened 1 month ago

jasonppy commented 1 month ago

Hi Zhifeng,

Thank you so much for your help!

This issue is related to https://github.com/NVIDIA/audio-flamingo/issues/5, https://github.com/NVIDIA/audio-flamingo/issues/6, https://github.com/NVIDIA/audio-flamingo/issues/7, https://github.com/NVIDIA/audio-flamingo/issues/8. I would like to have a centralized page that address the reproduction issue so that other researchers don't have to check different issues for different datasets, considering there are 41 different datasets used for training and evaluation.

Below is my reproduction results (I loaded your open sourced weights for foundation model, and follow the deterministic decoding params as mentioned in https://github.com/NVIDIA/audio-flamingo/issues/5).

Table 2:	Task type	metric	Prior SotA	AF reported	AF reproduced
Clotho-v2	CAP	CIDEr	0.441	0.465	0.461
ClothoAQA unanimous	CLS	Acc	74.9	86.9	86.7
ClothoAQA non-binary	CLS	Acc	29.1	49.5	46.6
ClothoAQA numerical	CLS	Acc	26.2	36.4	35.4
MusicAVQA audio-only	CLS	Acc	72.1	71.6	75.6
CochlScene	CLS	Acc	91.6	83.0	82.2
NonSpeech7k	CLS	Acc	79.0	85.1	83.9
FSD50k	CLS	F1 approx	65.6	69.7	69.6
NS instrument	CLS	Acc	78.8	77.1	48.3
NS quality	CLS	F1	46.3	66.7	0
NS source	CLS	Acc	60.1	78.7	47.7

Table 3:	Task type	metric	Prior SotA	AF reported	AF reproduced
AudioCaps	CAP	CIDEr	0.281	0.502	0.469
CREMA-D	CLS	Acc	18.5	26.5	27.9
Ravdess	CLS	Acc	21.7	20.9	25.6
US8K	CLS	Acc	71.9	75.0	73.7
GTZAN	CLS	Acc	71.0	67.9	60.5
Medley-solos-DB	CLS	Acc	61.3	92.7	85.3

and below is the full list of prefix and prompt I used for each dataset:

path2prompt = {
            "clotho_v2": {
                'prefix': 'The task is audio captioning.',
                'prompt': 'Describe the sound in a sentence.'
            },
            "clotho_aqa": {
                'prefix': 'The task is audio question answering.',
                'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
            },
            "musicavqa": {
                'prefix': 'The task is audio visual question answering.',
                'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
            },
            "cochlscene": {
                'prefix': 'The task is scene classification.',
                'prompt': 'classify this sound.\nOPTIONS:\n - bus.\n - cafe.\n - car.\n - crowdedindoor.\n - elevator.\n - kitchen.\n - park.\n - residentialarea.\n - restaurant.\n - restroom.\n - street.\n - subway.\n - subwaystation.'
            },
            "nonspeech7k": {
                'prefix': 'The task is event classification.',
                'prompt': f"classify this sound." + "\nOPTIONS:\n - {}.".format('.\n - '.join(['cough', 'breath', 'screaming', 'laugh', 'sneeze', 'yawn', 'crying']))
            },
            "fsd50": {
                'prefix': 'The task is event classification.',
                'prompt': 'describe this sound in the order from specific to general.'
            },
            "audiocaps":{
                'prefix': 'The task is audio captioning.',
                'prompt': 'Describe the sound in a sentence.'
            },
            "crema-d": {
                'prefix': 'The task is emotion classification.',
                'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'neutral', 'disgusted', 'angry', 'happy']))
            },
            "ravdess":{
                'prefix': 'The task is emotion classification.',
                'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'calm', 'neutral', 'disgusted', 'angry', 'happy', 'surprised']))
            },
            "us8k":{
                'prefix': 'The task is event classification.',
                'prompt': "classify this sound" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['air conditioner', 'car horn', 'children playing', 'dog bark', 'drilling', 'engine idling', 'gun shot', 'jackhammer', 'siren', 'street music']))
            },
            "gtzan":{
                'prefix':'The task is genre classification.',
                'prompt': "what is the genre of this music?" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']))
            },
            "medley-solos-db": {
                'prefix': 'This task is instrument classification.',
                'prompt': 'what is the instrument of this music?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['clarinet', 'flute', 'distorted electric guitar', 'trumpet', 'violin', 'piano', 'female singer', 'tenor saxophone']))
            }
        }

excluding NS because they are mentioned in https://github.com/NVIDIA/audio-flamingo/issues/8.

The numbers underlined are the ones that I think are due to reasons beyond stochasticity in hardware.

Regarding NS dataset, as you've mentioned https://github.com/NVIDIA/audio-flamingo/issues/8, the prefix should be "The task is music information retrieval", and prompt is "the music note is". My question is, are they just for instrument and source? How about quality?

Regarding ClothoAQA non-binary, you mentioned here using from nltk.stem import PorterStemmer to handle typo. However, I think one thing that contributed to results mismatch is that my non-binary testset contains 945 QAs, while your testset contains 932 QAs. as mentioned here.

While I can try and error on what prefix, prompt, and processing steps to use for evaluation, it's not possible for training.

It would be tremendously helpful if you could open-source the prefix, prompt for each dataset (training and evaluation), and the pre/post-processing scripts.

Thanks for your time, efforts, and the brilliant paper!

Best, Puyuan

zhifengkongnv commented 1 month ago

To make some modifications:

For nonspeech7K, change "classify this sound" to "this is a sound of".
For NSynth, the prompt is "the task is music information retrieval." and the prompt is "this music note is".

For close-ended classification tasks (i.e. all candidate labels are given and there is one correct answer)

we first note that the order of the labels matters. In my code, the labels are in the alphabetical order. However, this may not be optimal as shown from your results on ravdess and crema-d.

Similar to Pengi's section 4.2 -- Text-matching, we apply an additional similarity-based text matching to the labels if the output is not any of the candidate labels.

from sentence_transformers import SentenceTransformer, util
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
options_embeddings = embedding_model.encode(all_options, convert_to_tensor=True)

We do some preprocessing as below.


for punc in [".", ",", ";"]:
ground_truth = ground_truth.replace(punc, "")
output = output.replace(punc, "")

gt_word = ps.stem(ground_truth.strip().lower()) output = ps.stem(output.strip().lower())

If output does not match ground_truth, then we compute similary and select the most similar label as prediction

gt_word_embedding = embedding_model.encode([gt_word], convert_to_tensor=True) prediction_embedding = embedding_model.encode([output], convert_to_tensor=True) similarity_pred_gt = util.pytorch_cos_sim(gt_word_embedding, prediction_embedding)[0, 0] similarity_pred_allopt = util.pytorch_cos_sim(options_embeddings, prediction_embedding)[:, 0] if similarity_pred_gt == max(similarity_pred_allopt):

prediction is correct

jasonppy commented 1 month ago

Thanks!

Regarding the answer order, I followed this https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L372, perhaps this avoided training/inference mismatch issue

jasonppy commented 1 month ago

regarding Sonyc-ust, for each audio file, multiple workers will annotate it and their answer usually differs from each other, for example:

{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['jackhammer', 'stationary music', 'reverse beeper', 'dog barking whining', 'small medium rotating saw', 'amplified speech', 'small sounding engine', 'chainsaw', 'medium sounding engine', 'mobile music', 'person or small group talking', 'person or small group shouting', 'hoe ram', 'car horn', 'non machinery impact', 'large sounding engine', 'large crowd', 'large rotating saw', 'rock drill', 'siren', 'car alarm', 'ice cream truck', 'pile driver']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['large sounding engine']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['small sounding engine']}

Did you just use all data? or there was some selection procedure?

Thanks!

jasonppy commented 1 month ago

Regarding JL-Corpus, there are sources (i.e. folders that contains audio) - JL(wav+txt) and "Perception test material on Qualtrics", seems like that the former is the entire corpus and the latter is those verified by other people, which folder do you used?

Thanks!

jasonppy commented 1 month ago

when constructing the training dataset for AQA, there are 3 templates here:

        "question: <question>"
        "<question>"
        "please answer this question: <question>"

which one should I use for different datasets? Also if we choose "question: ", augment_AQA will replace it with "please answer this question: " here . I also notice that in inference code uses "Please answer this question: " for clothoaqa with the first letter capitalized here any reasons for the differences? Why don't we just keep using "please answer this question: " for all AQA in both training and inference, or even just "" (as it's shorter)

Thanks!

zhifengkongnv commented 1 month ago

Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.

Regarding jl-corpus, we used an internal processed set so I'm not quite clear.

Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.

jasonppy commented 1 month ago

Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.

Regarding jl-corpus, we used an internal processed set so I'm not quite clear.

Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.

regarding sonyc-ust, I'm not able to understand what do you mean - for each question, 3 workers answered it and that produced 3 usually different answers, are they all used in training, or is there some processing approach?

Regarding AQA, by "templates have been changing over development", have you used "please answer this question: " as the prefix for all the AQA training data as well (including OpenAQA)?

zhifengkongnv commented 1 month ago

As you indicated you used validation subset of sonyc-ust. It'll call the DDP dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L1098

You can solve it by making it the test split when you create the dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/inference/inference.py#L65

For OpenAQA we used prefix provided in their datasets without modification.