Open colinmorris opened 4 years ago
Hey @colinmorris, you are right. There's a problem in the AZLyrics where, as you said, multiple artists could lead to the same song URL, which I didn't know it and sucks. I'm gonna try to add a PR fixing this adding a checker before adding the row if the song URL exists or not. Thanks!
In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in
azlyrics_lyrics_l.csv
under the artist name "Liz Phair". 13 are inazlyrics_lyrics_p.csv
under "Phair, Liz".There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.
I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.
I did a quick analysis and found 6,513 total rows with duplicate song urls.