TUVIMEN / lyryx

A python script for downloading lyrics from multiple sources
GNU General Public License v3.0
3 stars 0 forks source link

A small question about the tool #1

Closed francqz31 closed 4 months ago

francqz31 commented 5 months ago

Hey there , amazing tool here , it works really good , but for some reason when it downloads from genius , it doesn't download all the songs on the page ? for example , https://genius.com/artists/Lil-peep/songs , this page has 605 songs , and when I run lyryx Lil-peep it only downloads 125 songs ,, if it can be adjusted to download from the /artist/name/songs pages that would be really great and useful in NLP projects

francqz31 commented 5 months ago

edit: Yeah it downloads from this page https://genius.com/artists/Lil-peep/albums , but the thing is that sometimes albums are not organized well on the genius website and sometimes they forget to categorize songs in albums correctly + also some artists have alot of unreleased albums and their song lyrics exist on /artist/name/songs pages, https://genius.com/artists/Lil-peep/songs these kind of pages are better although they have some hallucinations too ,

TUVIMEN commented 5 months ago

Thanks for the notice, getting songs for artist from genius has been changed to /songs in this commit. I've also corrected the way of getting lyrics because highlighted text wasn't getting stored e.g. https://genius.com/Lil-peep-runaway-lyrics.

francqz31 commented 5 months ago

Thank you so much , I have took a look at your scrapers , to be honest all your scraping projects are amazing , they would be really useful at scraping data For NLP and machine learning projects , I love scraping data too :) , currently in the process of scraping deezer since they got a weak drm protection.

francqz31 commented 5 months ago

also you might wanna add "|" to line 64 cause some song names have it, and it breaks the script, amiright? return field.strip().lower().translate(str.maketrans('','',',.()|[]{}"\'+/\\&@!?#%^*:')) Edit: Also when i tried running , lyryx Lil-peep it downloaded 573 files supposedly out of 607 which is a big leap i should say, but is it correct ? can you try that from your end? I guess it just skipped some stuff, maybe ? although it downloads the corrupted links and empty pages on there "the 0 kb" ones

TUVIMEN commented 5 months ago

Arguments are split by | character into fields so there is no | character in the fields. If it breaks your script you'll have to replace them with space or dash before you pass them as arguments. Empty pages on the genius like https://genius.com/Lil-peep-and-death-plus-baby-michael-doesnt-sleep-lyrics have Lyrics for this song have yet to be released. Please check back once the song has been released.. They apparently make pages of lyrics even if they don't have them (propably as a SEO). I've changed the sort method in api, so that it returns 10 results more - which is exacly how much you'll get if you manually scroll through /songs page. 607 might be arbitrary value which probably is higher because of deleted lyrics.

francqz31 commented 5 months ago

yep you were right , just checked another artist , she had 90 songs assigned as discography , but the actual count was 87 , they mess up their count on the page, yeah the script is perfect now. 2- Last thing , i would suggest being able to download multiple artist , lyryx burzum lil-peep drake etc , it is already working, but for some reason it doesn't download each artist in each separate folder.

TUVIMEN commented 5 months ago

The --directory-artist option has been added. It creates the directories named in the same way as the songs e.g. sammy-davis-jr.

francqz31 commented 4 months ago

I think that's it , the tool is complete👍

francqz31 commented 4 months ago

@TUVIMEN One small thing I think there is something up with italics for example this https://genius.com/Juice-wrld-10-feet-lyrics gets scraped like this Intro: Juice WRLD & Daniel Caesar ] You make me feel, yeah Uh, I don't wanna try, but I... So primal , and ( Woah ) and the problem is genius uses the italics decoding thingy too much , like this one for example https://genius.com/Juice-wrld-x-mas-list-christmas-list-lyrics is decoded like that {Intro: Juice WRLD & < i>Future< /i> } {(DY Krazy< /i >)}(25726034)

TUVIMEN commented 4 months ago

Fixed it. Previously lyryx was using a simple //text() xpath for genius, which was returning an array of text from each object. I have found no way of making an exeption for the br tag (which is the only thing separating the lines in genius) so i had to reimplement the text() function with such exeption. Unfortunetely to traverse both nodes and text of HtmlElement it has to be returned by node() xpath function, and i haven't found any method for this class so it runs .xpath('node()') for each element which adds needless processing.