edublancas / song-lyrics

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset
https://blancas.io/song-lyrics/
MIT License
5 stars 1 forks source link

Representation using word embeddings #8

Closed edublancas closed 6 years ago

edublancas commented 6 years ago

We can cluster our 5K words by representing each one as a word embedding, once we get the groups we can use this to change the bag of words representation (e.g. add up all the counts for a given group) this can help us reduce dimensionality - we can cluster this as well

See this: https://nlp.stanford.edu/projects/glove/

edublancas commented 6 years ago

I already have a function to convert the bag of words representation to dense vectors using word embeddings, but I found a problem: due to stemming some words do not exist in the word embeddings data (i.e. in our data there is a word "someth", in the embeddings there is "something"). Will work on solving this issue

This is related to #14, I will probably write a script to subset the word embeddings data and do entity resolution.