Closed javanasse closed 4 days ago
Please refer to #170
Hi thanks for following our work, we have uploaded an example inference data for Audiocaps and Clotho in examples/drcap_zeroshot_aac/data_examples/ . Feel free to check it out. For each filed, "target" is the ground truth caption, "text" is the caption fed to CLAP text encoder during training. "Text" and "target" are the same in the last version. But we previously conducted experiments on replacing certain words in the ground truth captions to enhance model robustness, which is why there are both 'text' and 'target' fields. And similar_captions are captions similar to "target" (i.e. GT captions) to perform RAG.
Thanks for your timely response. Is it possible to infer the caption for an audio file when "text" and "target" are unknown? If I have misunderstood, please correct me.
Yes, as long as you have audio_source and similar_captions it is possible to perform inference.
Can the developers provide an example JSONL file for running inference on unlabeled audio using DrCaps_Zeroshot_Audio_Captioning?
It appears that the dataset JSONL must have this form:
but the content for each field is not clear to me. What should populate
"target"
,"text"
and ""similar_captions"
?Thank you!