Getting the same final dataset

edublancas commented 6 years ago

I updated the repo so we all can work on the same dataset, these are the steps to follow:

Reinstall the package (added some dependencies)
Download the new files (get_data script was updated)
Run ./bootstrap

bootstrap contains the code that was previously on the README file, the final output are three files:

bag_of_words.feather this is our main dataset: all words (stopwords removed) + metadata
bag_of_words_top_1000.feather, same as 1. but only top 1k most popular words
bag_of_words_top_1000_normalized.feather, same as 1. but only top 1k most popular words and normalized
embeddings.feather, embeddings (dense vectors) for every song (50 dimensions)

We are all going to be using mostly 1. 2, 3 and 4 are for seeing if those smaller representations help with the topic modeling, clustering, measuring similarity. So probably just @aaronsadholz and me need those. But in any case, @jose-alvarado-guzman and @valmikkpatel: feel free to explore those datasets as well.

Let me know if you have any trouble running the scripts, hopefully we can all get this done before our next meeting on saturday.

aaronsadholz commented 6 years ago

Was able to run the scripts successfully. Thanks, @edublancas!

aaronsadholz commented 6 years ago

I don't see artist location anywhere in the updated dataset. Am I missing something?

edublancas commented 6 years ago

I probably missed something in the scripts. Will fix them now.

edublancas commented 6 years ago

Just fixed the error, thanks for letting me know. In order to get the location please re-run export_track_metadata (I also updated the bootstrap script) and then the join scripts.

aaronsadholz commented 6 years ago

Is this complete? If so, I'll run my (hopefully) final topic model on the data

edublancas commented 6 years ago

I need to make some changes, working on it now

edublancas commented 6 years ago

@aaronsadholz I pushed the updated code: cleaning artist_name, artist_id and adding language.

Since computing language takes a while I uploaded the output to Google Drive (the one that José shared). So you only need to run the script starting on line 51

edublancas / song-lyrics

Getting the same final dataset #24