LoieSun / Auto-ACD

code for A Large-scale Dataset for Audio-Language Representation Learning
Creative Commons Zero v1.0 Universal
10 stars 0 forks source link

About the dataset #1

Closed yudongliu97 closed 9 months ago

yudongliu97 commented 1 year ago

Thank you for the release of the dataset! Can you show me an example how to get the youtube link given a youtube ID in your dataset? Such as oPRC3zComfA_000117.

ac-alpha commented 1 year ago

@linyifan123456 I am not an author of this paper, but based on the dataset samples provided on the dataset website, the youtube_id can be split into two parts separated by the underscore (_) character.

Example - oPRC3zComfA_000117 can be split into oPRC3zComfA giving us the source video URL as https://www.youtube.com/watch?v=oPRC3zComfA and the start time in the video as 117 seconds which is 1:57 timestamp.

However, some samples do not have the second part (start time) which is strange. I will request the authors to clarify more on that.

LoieSun commented 1 year ago

Thank you for your interest in our dataset. As @ac-alpha mentioned, the number following the underscore indicates the video's start time. Our dataset is based on AudioSet and VGGSound; the samples marked with a start time are sourced from VGGSound, whereas the remaining samples are from AudioSet. For additional download options, please refer to the homepages of AudioSet and the VGGSound.

Sreyan88 commented 11 months ago

Hi @LoieSun ,

I tried matching the names in the dataset you provided with the file names from the original AudioSet, but I found only 53k matches. Am I missing something here?

LoieSun commented 11 months ago

hi @Sreyan88 , I would like to know more details about this problem if possible. It is worth noting that in Auto-ACD, the filenames of samples sourced from AudioSet are not marked with a start time.

JunZhan2000 commented 11 months ago

hi @Sreyan88 , I would like to know more details about this problem if possible. It is worth noting that in Auto-ACD, the filenames of samples sourced from AudioSet are not marked with a start time.

Hello, very good work, I want to use this dataset in my work! I would like to ask if I want to download the dataset, do I have to crawl the data from youtube to align? If you can, can you provide a direct download link?

LoieSun commented 11 months ago

@junzhan18 hi, thank you for your interest. For text, you could download from this link. For audio, you could download the original AudioSet and VGGsound, and separate audio from video.

Sreyan88 commented 11 months ago

Hi @LoieSun , Thank You so much for your reply. I am trying to match with these filenames but only able to achieve 53k matches. Any help would be highly appreciated, and excellent work again!

JunZhan2000 commented 11 months ago

@junzhan18 hi, thank you for your interest. For text, you could download from this link. For audio, you could download the original AudioSet and VGGsound, and separate audio from video.

Audioset only provides YouTube links and extracted features. If I need the original audio, do I still need to crawl it myself?

JunZhan2000 commented 11 months ago

@junzhan18 hi, thank you for your interest. For text, you could download from this link. For audio, you could download the original AudioSet and VGGsound, and separate audio from video.

Hello, I am downloading this dataset, but I also encountered a problem that does not match Audioset. If it is convenient, can you add a WeChat or email exchange? I cannot find your email address.

JunZhan2000 commented 11 months ago

Hi @LoieSun , Thank You so much for your reply. I am trying to match with these filenames but only able to achieve 53k matches. Any help would be highly appreciated, and excellent work again!

I downloaded auto-cad from huggingface, http://huggingface.co/datasets/Loie/Auto-ACD/viewer/default/train? p=19218, downloaded audioset metadata from this link http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv, but does not match their youtube id completely.

Sreyan88 commented 11 months ago

Hi @junzhan18 , I am facing the same problem. Please lmk if you were able to solve it.

JunZhan2000 commented 11 months ago

Hi @LoieSun , Thank You so much for your reply. I am trying to match with these filenames but only able to achieve 53k matches. Any help would be highly appreciated, and excellent work again!

I downloaded auto-cad from huggingface, http://huggingface.co/datasets/Loie/Auto-ACD/viewer/default/train? p=19218, downloaded audioset metadata from this link http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv, but does not match their youtube id completely.

@LoieSun VGGSound's data matches, but AudioSet's doesn't.

LoieSun commented 10 months ago

@junzhan18 @Sreyan88 hi, for data from AudioSet, we named them using a template, 'Y+youtube id'. For example, 'Yae9Icdo66YA' indicates the YouTube video with id 'ae9Icdo66YA', and you could find it in the _unbalanced_trainsegments.csv.

misc: -my WeChat: Loie142857 -E-mail address: loiesun411@gmail.com

Sreyan88 commented 10 months ago

Hi @LoieSun ,

Thank You for your response. However, I still don't find a match. For example, for YcxmYcIGBZDs (in auto-acd) I don't find cxmYcIGBZDs in unbalanced_train_segments.csv.

Am I missing something?

Sreyan88 commented 10 months ago

for your above given example, I dont even find 'ae9Icdo66YA' in unbalanced_train_segments.csv!

JunZhan2000 commented 10 months ago

Hi @LoieSun ,

Thank You for your response. However, I still don't find a match. For example, for YcxmYcIGBZDs (in auto-acd) I don't find cxmYcIGBZDs in unbalanced_train_segments.csv.

Am I missing something?

hi, you can use this code to load audioset metadata, it's ok

import pandas as pd audioset_metadata = "path/to/unbalanced_train_segments.csv" audioset = pd.read_csv(audioset_metadata, usecols=[0])