Open julian-carpenter opened 9 months ago
Hi Julian, thanks for the message. The main differences between the released set and the full set described in the paper:
After adding AudioCaps and processing, this should be the right set to approximately recreate the training. I don't have a good script to cleanly download and process these ready yet, but this is something I hope to share soon!
Congratulations on this very interesting and nice paper.
I was just looking through your released AnimalSpeak CSV file at HuggingFace. The file contains only 894284 entries, but in the paper, you mention 1.1M audio-caption pairs. Could you maybe clarify where the discrepancy comes from?
Will you release a script that downloads all the snippets and processes everything so that it can be used to recreate the results from your paper? That would be a great help. Thank you very much.