Closed Franklin905 closed 6 months ago
Hi,
They are trained on dataset that contains Audioset but we report the metrics on Audiocaps and Clotho. To train Audioset with multiple labels, in here we simply treat each label-audio pair as positive pair. There are better ways to treat multiple-label problem. See for example https://arxiv.org/abs/2204.03610
Got it! Thanks for your reply.
Are the models '630k-audioset-fusion-best.pt' and '630k-audioset-best.pt' trained and evaluated on AudioSet? If so, how are they trained or evaluated on AudioSet? Because videos in AudioSet contain multiple labels, I'm unsure how to calculate contrastive loss on AudioSet videos to train CLAP.