edublancas / song-lyrics

Exploratory Analysis of 200K+ song lyrics from the 1 million songs dataset
https://blancas.io/song-lyrics/
MIT License
5 stars 1 forks source link

track_id/song_id handling #21

Closed aaronsadholz closed 6 years ago

aaronsadholz commented 6 years ago

Looking at the unique tracks file (there are 1,000,000 entries) we can see there are 1,000,000 unique track_id's, and there are 999,056 unique song_id's. This holds true with Jose's point, however, this means that track_id's are vastly under reported in this dataset.

I'd suggest that we remove all track_id's corresponding to the same song. It will remove less than 0.1% of our data, and all confusion on this issue.

I looked into in this in the unique track_id and song_id counts in the following r notebook: https://github.com/edublancas/song-lyrics/blob/master/experiments/aaron/180327_analysis.Rmd