XinhaoMei / WavCaps

This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
197 stars 11 forks source link

Freesound JSON files #4

Closed MorenoLaQuatra closed 1 year ago

MorenoLaQuatra commented 1 year ago

Thank you so much for providing such valuable resources for the audio community. The WavCaps dataset can be incredibly helpful for training and evaluating models.

I was wondering if it may be possible to upload the JSON files for the Freesound dataset on the WavCaps Github repository, similar to the files provided for AudioSet, BBC, and SoundBible? Access to the JSON metadata through the repository could make it those resources even more useful for researchers.

Thank you again for your contributions to advancing this important area of research.

XinhaoMei commented 1 year ago

Hi,

Thanks for your interests! You can download it through Google Drive (https://drive.google.com/drive/folders/1h9P4_qiNVZR-PIZrL5Ow0v62S8C4ygyo). We also provide the waveforms whose duration are less than 2 seconds (222935 audio clips)!

Cheers!

MorenoLaQuatra commented 1 year ago

Thank you for the prompt response and for providing the link to the Freesound dataset!

If I can I have an additional question regarding the AudioSet JSON file. The "id" key seems to refer to the YouTube video ID and "duration" refers to the length of the audio sample. However, I noticed that there is no "start" time specified in the file. Am I missing something or is this information not included in the JSON metadata? Should we join the AudioSet metadata to get the audio files?

Thank you again for your help!

XinhaoMei commented 1 year ago

Hi,

We generate captions using the metadata provided in AudioSet stronly-labelled subset, therefore, you can directly use the start time in the AudioSet metadata. Because we use the waveforms provided in PANNs (https://github.com/qiuqiangkong/audioset_tagging_cnn), the 'id' is appened a 'Y' at the start. You can use the id to retrieval the metadata.

Cheers!

MorenoLaQuatra commented 1 year ago

Thank you for the feedback. I tried to join the information contained in AudioSet strongly labeled and the json files provided in the repo. However, it seems that each entry in WavCaps has multiple entries in the strongly labeled split. As an example

In the json file:

{"id": "Y9-twZb7XCAg.wav", "caption": "Scraping and thumping noises occur.", "audio": "wav_path", "duration": 10.00065625}

In the audioset metadata

9-twZb7XCAg_30000   0.725   0.953   /t/dd00099
9-twZb7XCAg_30000   1.228   1.339   /t/dd00099
9-twZb7XCAg_30000   1.512   1.827   /m/07qv4k0
9-twZb7XCAg_30000   1.913   2.008   /t/dd00099
9-twZb7XCAg_30000   2.094   2.402   /m/07qv4k0
9-twZb7XCAg_30000   3.024   3.346   /m/07qv4k0
9-twZb7XCAg_30000   3.331   3.543   /t/dd00099
9-twZb7XCAg_30000   3.764   4.315   /m/07qv4k0
9-twZb7XCAg_30000   3.787   4.520   /t/dd00099
9-twZb7XCAg_30000   4.630   5.071   /m/07qv4k0
9-twZb7XCAg_30000   5.063   5.150   /m/07qjznt
9-twZb7XCAg_30000   5.220   5.858   /m/07qnq_y
9-twZb7XCAg_30000   5.583   5.953   /m/07s02z0
9-twZb7XCAg_30000   5.882   6.220   /m/07qv4k0
9-twZb7XCAg_30000   6.606   6.850   /m/07qv4k0
9-twZb7XCAg_30000   7.047   7.323   /m/07qv4k0
9-twZb7XCAg_30000   7.535   7.827   /m/07qv4k0
9-twZb7XCAg_30000   7.969   8.189   /m/07qv4k0
9-twZb7XCAg_30000   9.150   9.425   /m/07qv4k0
9-twZb7XCAg_30000   9.622   9.992   /m/07qv4k0

Is there any way to get the start-end time? Is it always 0-10 seconds?

XinhaoMei commented 1 year ago

Hello,

The first column is the clip_id is in the format ytid_startimems with ytid as the parent YouTube id and starttimems indicates the beginning of the 10 sec clip that was annotated within that clip’s soundtrack.

In our vesion, we append a 'Y' at the start of the ytid. For example, in this case, the ytid is 9-twZb7XCAg, the start time is 30000. And the file name in our json file is Y9-twZb7XCAg.wav.

MorenoLaQuatra commented 1 year ago
segment_id      start_time_seconds      end_time_seconds        label
b0RFKhbpFJA_30000       0.000   10.000  /m/03m9d0z
b0RFKhbpFJA_30000       4.753   5.720   /m/05zppz
b0RFKhbpFJA_30000       0.000   10.000  /m/07pjwq1
b0RFKhbpFJA_30000       6.899   7.010   /m/07qjznt
b0RFKhbpFJA_30000       8.534   9.156   /t/dd00092
NQNTnl0zaqU_70000       0.000   0.103   /m/07rdhzs
NQNTnl0zaqU_70000       0.233   0.443   /m/07rdhzs
NQNTnl0zaqU_70000       0.542   0.785   /m/07rdhzs
NQNTnl0zaqU_70000       0.940   1.208   /m/07rdhzs
NQNTnl0zaqU_70000       1.200   2.183   /m/024dl
NQNTnl0zaqU_70000       1.947   4.246   /m/0_ksk
NQNTnl0zaqU_70000       3.539   5.464   /m/01b82r
NQNTnl0zaqU_70000       4.944   6.951   /m/0284vy3
NQNTnl0zaqU_70000       6.975   7.869   /m/0c1dj
NQNTnl0zaqU_70000       7.999   8.933   /m/0c1dj
NQNTnl0zaqU_70000       9.063   10.000  /m/0c1dj
4PPmyY_-YrA_30000       0.000   10.000  /m/03wvsk
4PPmyY_-YrA_30000       7.983   8.161   /m/07qjznt
LvNUyQ3xuAQ_0   0.596   5.677   /m/015jpf

Actually this is the header file in audioset_train_strong.tsv, so I think this got me confused. Just to clarify, the number XX (id_XX) in audioset is the starting time in ms, and we take the duration specified in the JSON file (in WavCaps) to get the final sample, is it correct?

XinhaoMei commented 1 year ago

You could just take 10 seconds for all audio clips. The duration in our file is based on our version. You can overwrite it for your downloaded version.

I will try to upload our version to Google Drive by the end of tomorrow!

MorenoLaQuatra commented 1 year ago

Thanks, I'm actually trying to implement a standalone downloader.

MorenoLaQuatra commented 1 year ago

I think I solved all the issues with AudioSet. About FreeSound I want to ask a clarification, the audio provided on GDrive are the ones below 2 seconds (as previously stated) or 2 minutes (as stated in the README)

XinhaoMei commented 1 year ago

I think I solved all the issues with AudioSet. About FreeSound I want to ask a clarification, the audio provided on GDrive are the ones below 2 seconds (as previously stated) or 2 minutes (as stated in the README)

Hi, I am sorry for my typo before. They are below 2 minutes!

MorenoLaQuatra commented 1 year ago

That's great! Thank you again for the efforts.