NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.
MIT License
163 stars 9 forks source link

release of prefix, prompt, pre/post-processing script for full reproduction #9

Open jasonppy opened 1 month ago

jasonppy commented 1 month ago

Hi Zhifeng,

Thank you so much for your help!

This issue is related to https://github.com/NVIDIA/audio-flamingo/issues/5, https://github.com/NVIDIA/audio-flamingo/issues/6, https://github.com/NVIDIA/audio-flamingo/issues/7, https://github.com/NVIDIA/audio-flamingo/issues/8. I would like to have a centralized page that address the reproduction issue so that other researchers don't have to check different issues for different datasets, considering there are 41 different datasets used for training and evaluation.

Below is my reproduction results (I loaded your open sourced weights for foundation model, and follow the deterministic decoding params as mentioned in https://github.com/NVIDIA/audio-flamingo/issues/5).

Table 2:
 
Task type metric Prior SotA AF reported AF reproduced
Clotho-v2 CAP CIDEr 0.441 0.465 0.461
ClothoAQA unanimous CLS Acc 74.9 86.9 86.7
ClothoAQA non-binary CLS Acc 29.1 49.5 46.6
ClothoAQA numerical CLS Acc 26.2 36.4 35.4
MusicAVQA audio-only CLS Acc 72.1 71.6 75.6
CochlScene CLS Acc 91.6 83.0 82.2
NonSpeech7k CLS Acc 79.0 85.1 83.9
FSD50k CLS F1 approx 65.6 69.7 69.6
NS instrument CLS Acc 78.8 77.1 48.3
NS quality CLS F1 46.3 66.7 0
NS source CLS Acc 60.1 78.7 47.7

Table 3:
 
Task type metric Prior SotA AF reported AF reproduced
AudioCaps CAP CIDEr 0.281 0.502 0.469
CREMA-D CLS Acc 18.5 26.5 27.9
Ravdess CLS Acc 21.7 20.9 25.6
US8K CLS Acc 71.9 75.0 73.7
GTZAN CLS Acc 71.0 67.9 60.5
Medley-solos-DB CLS Acc 61.3 92.7 85.3

and below is the full list of prefix and prompt I used for each dataset:

path2prompt = {
            "clotho_v2": {
                'prefix': 'The task is audio captioning.',
                'prompt': 'Describe the sound in a sentence.'
            },
            "clotho_aqa": {
                'prefix': 'The task is audio question answering.',
                'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
            },
            "musicavqa": {
                'prefix': 'The task is audio visual question answering.',
                'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
            },
            "cochlscene": {
                'prefix': 'The task is scene classification.',
                'prompt': 'classify this sound.\nOPTIONS:\n - bus.\n - cafe.\n - car.\n - crowdedindoor.\n - elevator.\n - kitchen.\n - park.\n - residentialarea.\n - restaurant.\n - restroom.\n - street.\n - subway.\n - subwaystation.'
            },
            "nonspeech7k": {
                'prefix': 'The task is event classification.',
                'prompt': f"classify this sound." + "\nOPTIONS:\n - {}.".format('.\n - '.join(['cough', 'breath', 'screaming', 'laugh', 'sneeze', 'yawn', 'crying']))
            },
            "fsd50": {
                'prefix': 'The task is event classification.',
                'prompt': 'describe this sound in the order from specific to general.'
            },
            "audiocaps":{
                'prefix': 'The task is audio captioning.',
                'prompt': 'Describe the sound in a sentence.'
            },
            "crema-d": {
                'prefix': 'The task is emotion classification.',
                'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'neutral', 'disgusted', 'angry', 'happy']))
            },
            "ravdess":{
                'prefix': 'The task is emotion classification.',
                'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'calm', 'neutral', 'disgusted', 'angry', 'happy', 'surprised']))
            },
            "us8k":{
                'prefix': 'The task is event classification.',
                'prompt': "classify this sound" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['air conditioner', 'car horn', 'children playing', 'dog bark', 'drilling', 'engine idling', 'gun shot', 'jackhammer', 'siren', 'street music']))
            },
            "gtzan":{
                'prefix':'The task is genre classification.',
                'prompt': "what is the genre of this music?" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']))
            },
            "medley-solos-db": {
                'prefix': 'This task is instrument classification.',
                'prompt': 'what is the instrument of this music?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['clarinet', 'flute', 'distorted electric guitar', 'trumpet', 'violin', 'piano', 'female singer', 'tenor saxophone']))
            }
        }

excluding NS because they are mentioned in https://github.com/NVIDIA/audio-flamingo/issues/8.

The numbers underlined are the ones that I think are due to reasons beyond stochasticity in hardware.

Regarding NS dataset, as you've mentioned https://github.com/NVIDIA/audio-flamingo/issues/8, the prefix should be "The task is music information retrieval", and prompt is "the music note is". My question is, are they just for instrument and source? How about quality?

Regarding ClothoAQA non-binary, you mentioned here using from nltk.stem import PorterStemmer to handle typo. However, I think one thing that contributed to results mismatch is that my non-binary testset contains 945 QAs, while your testset contains 932 QAs. as mentioned here.

While I can try and error on what prefix, prompt, and processing steps to use for evaluation, it's not possible for training.

It would be tremendously helpful if you could open-source the prefix, prompt for each dataset (training and evaluation), and the pre/post-processing scripts.

Thanks for your time, efforts, and the brilliant paper!

Best, Puyuan

zhifengkongnv commented 1 month ago

To make some modifications:

For close-ended classification tasks (i.e. all candidate labels are given and there is one correct answer)

gt_word = ps.stem(ground_truth.strip().lower()) output = ps.stem(output.strip().lower())

If output does not match ground_truth, then we compute similary and select the most similar label as prediction

gt_word_embedding = embedding_model.encode([gt_word], convert_to_tensor=True) prediction_embedding = embedding_model.encode([output], convert_to_tensor=True) similarity_pred_gt = util.pytorch_cos_sim(gt_word_embedding, prediction_embedding)[0, 0] similarity_pred_allopt = util.pytorch_cos_sim(options_embeddings, prediction_embedding)[:, 0] if similarity_pred_gt == max(similarity_pred_allopt):

prediction is correct

jasonppy commented 1 month ago

Thanks!

Regarding the answer order, I followed this https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L372, perhaps this avoided training/inference mismatch issue

jasonppy commented 1 month ago

regarding Sonyc-ust, for each audio file, multiple workers will annotate it and their answer usually differs from each other, for example:

{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['jackhammer', 'stationary music', 'reverse beeper', 'dog barking whining', 'small medium rotating saw', 'amplified speech', 'small sounding engine', 'chainsaw', 'medium sounding engine', 'mobile music', 'person or small group talking', 'person or small group shouting', 'hoe ram', 'car horn', 'non machinery impact', 'large sounding engine', 'large crowd', 'large rotating saw', 'rock drill', 'siren', 'car alarm', 'ice cream truck', 'pile driver']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['large sounding engine']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['small sounding engine']}

Did you just use all data? or there was some selection procedure?

Thanks!

jasonppy commented 1 month ago

Regarding JL-Corpus, there are sources (i.e. folders that contains audio) - JL(wav+txt) and "Perception test material on Qualtrics", seems like that the former is the entire corpus and the latter is those verified by other people, which folder do you used?

Thanks!

jasonppy commented 1 month ago

when constructing the training dataset for AQA, there are 3 templates here:

        "question: <question>"
        "<question>"
        "please answer this question: <question>"

which one should I use for different datasets? Also if we choose "question: ", augment_AQA will replace it with "please answer this question: " here . I also notice that in inference code uses "Please answer this question: " for clothoaqa with the first letter capitalized here any reasons for the differences? Why don't we just keep using "please answer this question: " for all AQA in both training and inference, or even just "" (as it's shorter)

Thanks!

zhifengkongnv commented 1 month ago

Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.

Regarding jl-corpus, we used an internal processed set so I'm not quite clear.

Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.

jasonppy commented 1 month ago

Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.

Regarding jl-corpus, we used an internal processed set so I'm not quite clear.

Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.

regarding sonyc-ust, I'm not able to understand what do you mean - for each question, 3 workers answered it and that produced 3 usually different answers, are they all used in training, or is there some processing approach?

Regarding AQA, by "templates have been changing over development", have you used "please answer this question: " as the prefix for all the AQA training data as well (including OpenAQA)?

zhifengkongnv commented 1 month ago

As you indicated you used validation subset of sonyc-ust. It'll call the DDP dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L1098

You can solve it by making it the test split when you create the dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/inference/inference.py#L65

For OpenAQA we used prefix provided in their datasets without modification.