edublancas / song-lyrics

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset
https://blancas.io/song-lyrics/
MIT License
5 stars 1 forks source link

Word embeddings representation #18

Closed edublancas closed 6 years ago

edublancas commented 6 years ago

I pushed the script to convert the bag of words representation to word embeddings, you can get it using this command:

# building word embeddings vocabulary by subsetting the GLOVE dataset
# for exact matches in the musixmatch dataset and fuzzy matching the remaining
# words
./process/clean/subset_embeddings data/clean/mxm_dataset.json \
    data/raw/glove.6B/glove.6B.50d.txt data/clean/embeddings_subset.json

# use word embeddings to represent songs, each song is represented as the
# sum of the count * embedding vectors for every word, run --help
# for more info
./process/transform/word_embeddings data/clean/embeddings_subset.json \
    data/transform/mxm_embeddings.feather

I added some dependencies so you need to reinstall the package:

pip install .

or just install all the dependencies manually:

pip install fuzzywuzzy python-Levenshtein pyyaml

Let me know if you have any problems running this.

valmikkpatel commented 6 years ago

Where did you download the glove dataset from?

I tried https://nlp.stanford.edu/projects/glove/. But couldn't access the website for some reason.

valmikkpatel commented 6 years ago

Ok I am done. Thanks for sharing the file.

edublancas commented 6 years ago

do everyone else was able to get the data and run the script? @aaronsadholz @jose-alvarado-guzman

aaronsadholz commented 6 years ago

I'll be working on it this afternoon. I'll let you know

aaronsadholz commented 6 years ago

I got it working. Thanks, Eduardo.