LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 137 forks source link

On the K2C Augmentation #58

Closed DaehanKim closed 1 year ago

DaehanKim commented 1 year ago

Hi, Authors. I wonder which prompt you used for Keyword-to-caption generation with T5 models. (i.e. I guess it's something like "generate a caption with given keywords. Keywords : K1, K2, K3,..., Caption:" but not sure) Plus, you mentioned you could increase the number of data to 2.63M in section 2.3. But I didn't get how you actually augmented the number of samples. Is it that you sampled captions multiple times with T5 model? or you used both the original label sequences(label1, label2, ..., label_n as a text caption) and T5-generated captions together?

Thank you again for your previous answers and this codebase!

lukewys commented 1 year ago

Hi DaehanKim,

Thanks for your interest. For the keyword-to-caption model, we are using https://github.com/gagan3012/keytotext off the shelf.

For increase the number of data to 2.63M, we mean that we are increasing the audio sample with actual language prompt-like text into 2.63M. Without the keyword to caption, we only have the template-based text for the ~2M data in Audioset.

Cheers, Yusong

DaehanKim commented 1 year ago

Thank you for the reply. I also found out flan-t5-3B works good as well, if proper instructions and shots are given.