Exploratory Data Analysis and Visualization, Columbia University, Spring 2018.
We explored song lyrics data from the Musixmatch + Million Songs dataset to derive conclusions about trends in song lyrics and music across time and geography. We asked questions to explore different facets of the dataset and identified some interesting trends.
The report for this project is available here.
The interactive component, built in d3, allows you to explore data points such as sentiment scores, topic scores and similar artists for the top artists in the One Million Songs + Musixmatch dataset. Click here to view the interactive component.
data/
- Data is dumped here, not included in the repositoryinteractive/
- Source code for interactive componentexperiments/
- Notebooks/scripts that we used to explore the datalib/
- R utility functions used in the projectprocess/
- Scripts for downloading and processing the data (Python 3)
process/pkg/
- Python package with utility functionsprocess/clean/
- Cleaning the raw dataprocess/transform/
- Code for generating various song vector representationsprocess/cluster/
- Clustering songsreport/
- Report filesWe are using the Million Song Dataset, specifically the musiXmatch dataset which contains lyics data for 237,662 tracks.
git clone https://github.com/edublancas/song-lyrics
cd song-lyrics
This project requires Python 3 and R.
To install Python and R required packages:
make requirements
The following command fetches all the datasets we used, it will create a new data/ folder in the current working directory raw data will be stored in data/raw.
make get_data
Note: GLoVe gives some problems when trying to download it using wget
, it's better to download it manually, put the uncompressed data in data/raw
.
This script runs all the cleaning, processing we did on the data and it outputs the final datasets we used in the report and the interactive component.
make bootstrap
Build the final report.
make report