X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
579 stars 52 forks source link

Example Dataset for Inference with DrCaps_Zeroshot_Audio_Captioning #169

Closed javanasse closed 4 days ago

javanasse commented 1 week ago

Can the developers provide an example JSONL file for running inference on unlabeled audio using DrCaps_Zeroshot_Audio_Captioning?

It appears that the dataset JSONL must have this form:

{"source": "/path/to/a_file.wav", "key": "", "target": "", "text": "", "similar_captions": ""}

but the content for each field is not clear to me. What should populate "target", "text" and ""similar_captions"?

Thank you!

ddlBoJack commented 1 week ago

Please refer to #170

Andreas-Xi commented 1 week ago

Hi thanks for following our work, we have uploaded an example inference data for Audiocaps and Clotho in examples/drcap_zeroshot_aac/data_examples/ . Feel free to check it out. For each filed, "target" is the ground truth caption, "text" is the caption fed to CLAP text encoder during training. "Text" and "target" are the same in the last version. But we previously conducted experiments on replacing certain words in the ground truth captions to enhance model robustness, which is why there are both 'text' and 'target' fields. And similar_captions are captions similar to "target" (i.e. GT captions) to perform RAG.

javanasse commented 6 days ago

Thanks for your timely response. Is it possible to infer the caption for an audio file when "text" and "target" are unknown? If I have misunderstood, please correct me.

Andreas-Xi commented 6 days ago

Yes, as long as you have audio_source and similar_captions it is possible to perform inference.