Hello, thanks for your great work. Here are some questions when I read your paper and implement the work. In the paper, you said "you trained your model on a dataset of approximately 750,000 videos sampled from AudioSet." As we know, AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
So the AudioSet you used for training model, is human-labeled 10-second clips drawn from YouTube videos?
Can we download the full videos according to YouTubeID provided in audioset and spilt them to video clips for training? Actually, we have trained with this dataset for serval models. But some of them always predict labels as "1" (aligned), and others always predict labels as "0" (not aligned).
Hello, thanks for your great work. Here are some questions when I read your paper and implement the work. In the paper, you said "you trained your model on a dataset of approximately 750,000 videos sampled from AudioSet." As we know, AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.