AI-ON / Few-Shot-Music-Generation

161 stars 32 forks source link

Brainstorming for lyrics dataset #1

Closed sachinravi14 closed 6 years ago

sachinravi14 commented 6 years ago

Details of the requirements for the dataset can be found in the proposal. We are looking for suggestions for creating this dataset, including:

AnishShah commented 6 years ago

Hi, I just went through the papers in reading list. Regarding lyrics dataset, I was wondering that we can crawl lyrics from some website maybe.

korjani commented 6 years ago

Hi, I am working on lyrics generation for a while, I am gathering lyrics data from web and add them to github https://github.com/MohMehKo/lyrics/tree/master/artist_songs let me know if it is a good starting point.

heaven00 commented 6 years ago

In this project https://github.com/rasbt/musicmood by Sebastian Raschka.

The data collection is done by getting songs from million song dataset and then lyrics are scraped from lyricWikia more details here and demonstration here

vadirajmkulkarni commented 6 years ago

Second @heaven00's idea. We will need to convert the track to MIDI format.

sagelywizard commented 6 years ago

Looks like the Lakh MIDI dataset has lyrics attached to ~23800 MIDI files, which might be useful. I haven't looked at the quality of the dataset yet though.

edit: link

sachinravi14 commented 6 years ago

@korjani, the repository you linked to looks great! Can you give details about how the dataset was created?

  1. How were the artists picked and where was the data scraped from?
  2. What are the details about the dataset: specifically how many different artists are there and how many songs are there per artist on average?
vishalbhalla commented 6 years ago

These lyrics data sources could also be used as a good starting point.

  1. Lyrics for 57650 songs acquired from LyricsFreak through scraping.
  2. 380,000+ lyrics from MetroLyrics This one also has the source code which explains how it was extracted from the website.
  3. Moreover, the musiXmatch dataset is an official lyrics collection of the Million Song Dataset (MSD). However, they use bag-of-words and not the original lyrics as the latter is protected by copyright and they do not have permissions to redistribute it.
  4. A smaller dataset comprising of Year-End Hot 100 songs 50 Years of Pop Music Lyrics that Billboard has published from 1958.