Brahex / text-mining-final-project

0 stars 0 forks source link

Text Mining Group Assignment:

Sarah de Jong, Tom Klein Tijssink, Lukas Busch


The goal of this project is to explore different machine learning approaches to generate song lyrics. We use a dataset that contains 362.237 songs. First, we use a BERT model in order to add a positive or negative sentiment to each song. Then, we explore a simple N-gram model, a word-based LSTM, a character-based LSTM, and a GPT-2 model to generate song lyrics. After evaluating the results with a survey judged by people, it is found that the GPT-2 model performs best.

Research questions

Can we create a song writing program that takes a number of words as well as a genre and sentiment (positive/negative) to generate lyrics?


For this project, we used the dataset 380000-lyrics-from-metrolyrics, because this dataset includes lyrics as well as genres. It was originally available on kaggle, but it currently is not anymore. We retrieved the data from a project from last year: The dataset contains 362.237 different songs, which includes the artist, year, genre, and lyric. It can be downloaded in csv format.

A tentative list of milestones for the project

Division of the work

We met about 10 times to discuss the progress and the project. In the first week, we determined the topic together. In the second week, Lukas did the sentiment analysis of the songs, Tom looked at an N-gram RNN model, and Sarah looked at ways to evaluate the songs. In the third week, Lukas looked at the GPT-2 model and Sarah tried to implement the N-gram RNN model for a larger number of songs, which used too many resources and which is therefore not in the final project. In the fourth and fifth week, Sarah looked at a basic N-gram model and Tom and Lukas both looked at an LSTM model. In the sixth week, we cleaned up the code and repository, did the evaluation of our songs with a survey, wrote the report, and prepared for the presentation.


The src folder in this repository contains our work. We did not put all our code in one notebook, because we thought it was a lot more organized to put every different model in a different notebook. Furthermore, we started out in separate files, and if we had wanted to put them all together at the end, we would have needed to rerun everything (or submit a notebook that was not run). Our repository contains the following files:

The repository also includes our report and the slides for the presentation.

Our results can be reproduced by unzipping the data-set. Then, in order to reproduce our results, you need to start by running the code in sentiment_analysis/Sentiment_Analysis.ipynb and sentiment_analysis/predicting_sentiments.ipynb. Next, you can run the code for each model.