NovaFrost / SHS100K

metadata for SHS100K
21 stars 7 forks source link

Datasets #2

Open AndyGuo1 opened 7 months ago

AndyGuo1 commented 7 months ago

Hello author, do I need to download all the songs to the local through the url and then extract the CQT feature of each song? Is there any other way to get the CQT feature of all songs

BastienDZ commented 2 months ago

Since many Youtube links have gone dead, we intend to release this summer an updated version of the SHS100K dataset. Therefore, our question to you: Are there particular aspects we need to take into account? Representativity for example? What can be improved vs the 2017 dataset? Any input is more than welcome!

NovaFrost commented 1 month ago

Since many Youtube links have gone dead, we intend to release this summer an updated version of the SHS100K dataset. Therefore, our question to you: Are there particular aspects we need to take into account? Representativity for example? What can be improved vs the 2017 dataset? Any input is more than welcome!

I just read through the "secondhandsongs" site, and it has much more data when I crawled the data in 2017. I think you could re-crawl data from the site, and get a larger dataset. Also, I think that the 2017 dataset has no much metadata (only the performer, title and youtube url are provided). Maybe this time you could utilize APIs of the site to collect more metadata.

NovaFrost commented 1 month ago

Hello author, do I need to download all the songs to the local through the url and then extract the CQT feature of each song? Is there any other way to get the CQT feature of all songs

Hi. You could download CQT features from

BastienDZ commented 2 weeks ago

Since many Youtube links have gone dead, we intend to release this summer an updated version of the SHS100K dataset. Therefore, our question to you: Are there particular aspects we need to take into account? Representativity for example? What can be improved vs the 2017 dataset? Any input is more than welcome!

Hi Nova, thanks for your reply! Is there any we could get in touch directly? Feel free to reach out to us via

We'd like to chat about other aspects, like 2DFM and HPCP extraction. Do you have any idea whether researchers used those? Do we need to include thema gain? Any insight welcome !