edublancas / song-lyrics

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset
https://blancas.io/song-lyrics/
MIT License
5 stars 1 forks source link

first exploration - aaron #13

Closed edublancas closed 6 years ago

aaronsadholz commented 6 years ago

Just took my first look at the data. Was able to get the feather files working from Eduardo's instructions, and made a couple of simple plots. One shows each artist's location on the world map and the other plots which artists have the most songs in the dataset for a specified year.

I'll move onto looking at the bag of words next.

edublancas commented 6 years ago

Thanks for the update!

aaronsadholz commented 6 years ago

Just took a first look at doing some topic modeling on the data. Results are in my exploration folder.

I used 2 different data representations:

I used 2 different topic modeling techniques from sklearn:

Results at this point don't seem to clearly identify different topics, but I see a lot of room for improvement:

edublancas commented 6 years ago

Thanks for the update! Maybe the word embeddings representation can help here, will update you when I get that to work

aaronsadholz commented 6 years ago

Here's an update on the topic modeling, this summary is also included in my exploration notebook.

After exploring a few basic models, here are my initial takeaways from the model I believe performed the best:

There is still a lot of room for improvement, but I think there is quite a bit we could do with these topics as a supplemental part of this project. Each word is assigned a weight to each topic, so we can provide a score for each song (therefore also each artist, location, etc..) for how prevalent each topic is. Since this is an unsupervised algorithm, we have to assign the topics to the clusters ourselves. Below are the top 10 weighted words in each topic, with the topic name I selected (I'm only selecting topics which appeared clear, there are others in the notebook).

Would love to hear any feedback.

Clear Topics:

Clear Groupings that Don't Indicate a Topic:

Topics I'd say are "a bit of a stretch":

edublancas commented 6 years ago

Those are amazing results! I can't wait to see what we get when I finished the word embeddings part. What do you think @jose-alvarado-guzman @valmikkpatel ?