Open jasonppy opened 1 month ago
and below is the full list of prefix and prompt I used for each dataset:
path2prompt = {
"clotho_v2": {
'prefix': 'The task is audio captioning.',
'prompt': 'Describe the sound in a sentence.'
},
"clotho_aqa": {
'prefix': 'The task is audio question answering.',
'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
},
"musicavqa": {
'prefix': 'The task is audio visual question answering.',
'prompt': f'Please answer this question: {question}\nOptions:\nyes.\nno.' if answer in ['yes', 'no'] else f'Please answer this question: {question}' # if a is yes or no, then provide options, otherwise no options provided
},
"cochlscene": {
'prefix': 'The task is scene classification.',
'prompt': 'classify this sound.\nOPTIONS:\n - bus.\n - cafe.\n - car.\n - crowdedindoor.\n - elevator.\n - kitchen.\n - park.\n - residentialarea.\n - restaurant.\n - restroom.\n - street.\n - subway.\n - subwaystation.'
},
"nonspeech7k": {
'prefix': 'The task is event classification.',
'prompt': f"classify this sound." + "\nOPTIONS:\n - {}.".format('.\n - '.join(['cough', 'breath', 'screaming', 'laugh', 'sneeze', 'yawn', 'crying']))
},
"fsd50": {
'prefix': 'The task is event classification.',
'prompt': 'describe this sound in the order from specific to general.'
},
"audiocaps":{
'prefix': 'The task is audio captioning.',
'prompt': 'Describe the sound in a sentence.'
},
"crema-d": {
'prefix': 'The task is emotion classification.',
'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'neutral', 'disgusted', 'angry', 'happy']))
},
"ravdess":{
'prefix': 'The task is emotion classification.',
'prompt': 'what is the emotion of this speech?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['sad', 'fearful', 'calm', 'neutral', 'disgusted', 'angry', 'happy', 'surprised']))
},
"us8k":{
'prefix': 'The task is event classification.',
'prompt': "classify this sound" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['air conditioner', 'car horn', 'children playing', 'dog bark', 'drilling', 'engine idling', 'gun shot', 'jackhammer', 'siren', 'street music']))
},
"gtzan":{
'prefix':'The task is genre classification.',
'prompt': "what is the genre of this music?" + "\nOPTIONS:\n - {}.".format('.\n - '.join(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']))
},
"medley-solos-db": {
'prefix': 'This task is instrument classification.',
'prompt': 'what is the instrument of this music?' + "\nOPTIONS:\n - {}.".format('.\n - '.join(['clarinet', 'flute', 'distorted electric guitar', 'trumpet', 'violin', 'piano', 'female singer', 'tenor saxophone']))
}
}
excluding NS because they are mentioned in https://github.com/NVIDIA/audio-flamingo/issues/8.
The numbers underlined are the ones that I think are due to reasons beyond stochasticity in hardware.
Regarding NS dataset, as you've mentioned https://github.com/NVIDIA/audio-flamingo/issues/8, the prefix should be "The task is music information retrieval", and prompt is "the music note is". My question is, are they just for instrument and source? How about quality?
Regarding ClothoAQA non-binary, you mentioned here using from nltk.stem import PorterStemmer
to handle typo. However, I think one thing that contributed to results mismatch is that my non-binary testset contains 945 QAs, while your testset contains 932 QAs. as mentioned here.
While I can try and error on what prefix, prompt, and processing steps to use for evaluation, it's not possible for training.
It would be tremendously helpful if you could open-source the prefix, prompt for each dataset (training and evaluation), and the pre/post-processing scripts.
Thanks for your time, efforts, and the brilliant paper!
Best, Puyuan
To make some modifications:
For close-ended classification tasks (i.e. all candidate labels are given and there is one correct answer)
from sentence_transformers import SentenceTransformer, util
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
options_embeddings = embedding_model.encode(all_options, convert_to_tensor=True)
We do some preprocessing as below.
for punc in [".", ",", ";"]:
ground_truth = ground_truth.replace(punc, "")
output = output.replace(punc, "")
gt_word = ps.stem(ground_truth.strip().lower()) output = ps.stem(output.strip().lower())
If output does not match ground_truth, then we compute similary and select the most similar label as prediction
gt_word_embedding = embedding_model.encode([gt_word], convert_to_tensor=True) prediction_embedding = embedding_model.encode([output], convert_to_tensor=True) similarity_pred_gt = util.pytorch_cos_sim(gt_word_embedding, prediction_embedding)[0, 0] similarity_pred_allopt = util.pytorch_cos_sim(options_embeddings, prediction_embedding)[:, 0] if similarity_pred_gt == max(similarity_pred_allopt):
Thanks!
Regarding the answer order, I followed this https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L372, perhaps this avoided training/inference mismatch issue
regarding Sonyc-ust, for each audio file, multiple workers will annotate it and their answer usually differs from each other, for example:
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['jackhammer', 'stationary music', 'reverse beeper', 'dog barking whining', 'small medium rotating saw', 'amplified speech', 'small sounding engine', 'chainsaw', 'medium sounding engine', 'mobile music', 'person or small group talking', 'person or small group shouting', 'hoe ram', 'car horn', 'non machinery impact', 'large sounding engine', 'large crowd', 'large rotating saw', 'rock drill', 'siren', 'car alarm', 'ice cream truck', 'pile driver']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['large sounding engine']}
{'split': 'validate', 'path': './sonyc-ust/validate/00_000066.wav', 'presence_labels': ['small sounding engine']}
Did you just use all data? or there was some selection procedure?
Thanks!
Regarding JL-Corpus, there are sources (i.e. folders that contains audio) - JL(wav+txt) and "Perception test material on Qualtrics", seems like that the former is the entire corpus and the latter is those verified by other people, which folder do you used?
Thanks!
when constructing the training dataset for AQA, there are 3 templates here:
"question: <question>"
"<question>"
"please answer this question: <question>"
which one should I use for different datasets? Also if we choose "question:
Thanks!
Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.
Regarding jl-corpus, we used an internal processed set so I'm not quite clear.
Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.
Regarding sonyc-ust, make it test split rather than the validation split, because validation split is ddp dataloader.
Regarding jl-corpus, we used an internal processed set so I'm not quite clear.
Regarding AQA, this is because templates have been changing over development of the model. We find the last one works the best.
regarding sonyc-ust, I'm not able to understand what do you mean - for each question, 3 workers answered it and that produced 3 usually different answers, are they all used in training, or is there some processing approach?
Regarding AQA, by "templates have been changing over development", have you used "please answer this question: " as the prefix for all the AQA training data as well (including OpenAQA)?
As you indicated you used validation subset of sonyc-ust. It'll call the DDP dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/data/data.py#L1098
You can solve it by making it the test split when you create the dataloader. https://github.com/NVIDIA/audio-flamingo/blob/main/foundation/inference/inference.py#L65
For OpenAQA we used prefix provided in their datasets without modification.
Hi Zhifeng,
Thank you so much for your help!
This issue is related to https://github.com/NVIDIA/audio-flamingo/issues/5, https://github.com/NVIDIA/audio-flamingo/issues/6, https://github.com/NVIDIA/audio-flamingo/issues/7, https://github.com/NVIDIA/audio-flamingo/issues/8. I would like to have a centralized page that address the reproduction issue so that other researchers don't have to check different issues for different datasets, considering there are 41 different datasets used for training and evaluation.
Below is my reproduction results (I loaded your open sourced weights for foundation model, and follow the deterministic decoding params as mentioned in https://github.com/NVIDIA/audio-flamingo/issues/5).