AlbertSuarez / azlyrics-scraper

🎵 AZLyrics scraper for getting song lyrics publishing to Box
https://app.box.com/s/vats4n6slxtknuaxz58mxlo6ry8v04pd?sortColumn=name&sortDirection=ASC
MIT License
18 stars 7 forks source link

csv data contains malformed rows for song 6'1 #4

Open colinmorris opened 4 years ago

colinmorris commented 4 years ago

The row in azlyrics_lyrics_l.csv looks like:

"liz phair","https://www.azlyrics.com/p/phair.html","6'1"","https://www.azlyrics.com/lyrics/lizphair/61.html","i bet you fall in bed[....]"

There's an extra double-quote in the song title field, which confuses the parser in Python's csv library (and probably most others). Per the csv RFC:

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"

(btw, thank you for publishing this dataset! It's sorely needed.)

AlbertSuarez commented 3 years ago

Hey @colinmorris, thanks for letting me know and sorry for the delay. I don't know how GitHub doesn't notify me about it. Related to the issue, you are completely right. This is like this because there's no pre-processing of the data for skipping problematic characters like the mentioned one ("). I'll try to submit a PR fixing this. Thanks!