AlbertSuarez / azlyrics-scraper

🎵 AZLyrics scraper for getting song lyrics publishing to Box
https://app.box.com/s/vats4n6slxtknuaxz58mxlo6ry8v04pd?sortColumn=name&sortDirection=ASC
MIT License
19 stars 7 forks source link

Some songs have duplicate rows (due to artist aliases?) #5

Open colinmorris opened 4 years ago

colinmorris commented 4 years ago

In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in azlyrics_lyrics_l.csv under the artist name "Liz Phair". 13 are in azlyrics_lyrics_p.csv under "Phair, Liz".

There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.

I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.

I did a quick analysis and found 6,513 total rows with duplicate song urls.

AlbertSuarez commented 4 years ago

Hey @colinmorris, you are right. There's a problem in the AZLyrics where, as you said, multiple artists could lead to the same song URL, which I didn't know it and sucks. I'm gonna try to add a PR fixing this adding a checker before adding the row if the song URL exists or not. Thanks!