Looking at the unique tracks file (there are 1,000,000 entries) we can see there are 1,000,000 unique track_id's, and there are 999,056 unique song_id's. This holds true with Jose's point, however, this means that track_id's are vastly under reported in this dataset.
I'd suggest that we remove all track_id's corresponding to the same song. It will remove less than 0.1% of our data, and all confusion on this issue.
Looking at the unique tracks file (there are 1,000,000 entries) we can see there are 1,000,000 unique track_id's, and there are 999,056 unique song_id's. This holds true with Jose's point, however, this means that track_id's are vastly under reported in this dataset.
I'd suggest that we remove all track_id's corresponding to the same song. It will remove less than 0.1% of our data, and all confusion on this issue.
I looked into in this in the unique track_id and song_id counts in the following r notebook: https://github.com/edublancas/song-lyrics/blob/master/experiments/aaron/180327_analysis.Rmd