edublancas / song-lyrics

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset
https://blancas.io/song-lyrics/
MIT License
5 stars 1 forks source link

Song lyrics project

Exploratory Data Analysis and Visualization, Columbia University, Spring 2018.

Project overview

We explored song lyrics data from the Musixmatch + Million Songs dataset to derive conclusions about trends in song lyrics and music across time and geography. We asked questions to explore different facets of the dataset and identified some interesting trends.

Deliverables

The report for this project is available here.

The interactive component, built in d3, allows you to explore data points such as sentiment scores, topic scores and similar artists for the top artists in the One Million Songs + Musixmatch dataset. Click here to view the interactive component.

Folder structure

Data

We are using the Million Song Dataset, specifically the musiXmatch dataset which contains lyics data for 237,662 tracks.

Quickstart

git clone https://github.com/edublancas/song-lyrics
cd song-lyrics

0. Software requirements

This project requires Python 3 and R.

To install Python and R required packages:

make requirements

1. Get raw data

The following command fetches all the datasets we used, it will create a new data/ folder in the current working directory raw data will be stored in data/raw.

make get_data

Note: GLoVe gives some problems when trying to download it using wget, it's better to download it manually, put the uncompressed data in data/raw.

2. Process data

This script runs all the cleaning, processing we did on the data and it outputs the final datasets we used in the report and the interactive component.

make bootstrap

3. Build report

Build the final report.

make report