I want to know the exact splits of AudioSet or VggSound used to train the CLAP. Because many audio-related datasets for downstream tasks were collected from these two large-scale datasets, if all their test data were seen during the pre-training stage, the evaluation results would be unconvincing.
While evaluating, we manually eliminate those examples already seen in the pretraining stage. For example, while testing on ESC-50, we eliminated all overlaps with freesound and audioset.
I want to know the exact splits of AudioSet or VggSound used to train the CLAP. Because many audio-related datasets for downstream tasks were collected from these two large-scale datasets, if all their test data were seen during the pre-training stage, the evaluation results would be unconvincing.