Closed DaehanKim closed 1 year ago
Hi DaehanKim,
Thanks for your interest. For the keyword-to-caption model, we are using https://github.com/gagan3012/keytotext off the shelf.
For increase the number of data to 2.63M
, we mean that we are increasing the audio sample with actual language prompt-like text into 2.63M. Without the keyword to caption, we only have the template-based text for the ~2M data in Audioset.
Cheers, Yusong
Thank you for the reply.
I also found out flan-t5-3B
works good as well, if proper instructions and shots are given.
Hi, Authors. I wonder which prompt you used for Keyword-to-caption generation with T5 models. (i.e. I guess it's something like "generate a caption with given keywords. Keywords : K1, K2, K3,..., Caption:" but not sure) Plus, you mentioned you could increase the number of data to 2.63M in section 2.3. But I didn't get how you actually augmented the number of samples. Is it that you sampled captions multiple times with T5 model? or you used both the original label sequences(label1, label2, ..., label_n as a text caption) and T5-generated captions together?
Thank you again for your previous answers and this codebase!