geekpradd / PyLyrics

A Pythonic Implementation of lyrics.wikia.com for getting lyrics of songs
72 stars 27 forks source link

LyricWikia is not a valid community #26

Open Christo77793 opened 4 years ago

Christo77793 commented 4 years ago

I have a project that relies on this, I was hoping there would be some work around on this.

noembryo commented 4 years ago

I don't think there can be a workaround for this..

The site was later shut down completely on September 21, 2020, now redirected into community.fandom.com and data dump is not generated.

I have an app (KataLib) that was using it, but now I have to find an alternative...

Musixmatch needs money for full API usage..

Looking at Genius now, but the LyricsGenius needs Python 3 (I use 2.7), so not an option for me. There are some small scripts around for it, so I'll check them too..

Christo77793 commented 4 years ago

Hey, thank you. Wasn't aware of LyricsGenius, but I might be able to look into it and work with it. Much appreciated.

geekpradd commented 4 years ago

If this is the case then I'd have to shut down this plugin. If any of you can work on migrating the plugin to a different source I'd be more than happy to push this up the stream and into PyPi.

noembryo commented 4 years ago

I'm trying to adapt your getLyrics method (its the one I use), to work with scraping Genius (without API key). It doesn't look difficult, but it has a strange behavior.. I don't get the same page every time I call the same url!!! So no luck yet.. :o(

noembryo commented 4 years ago

This is kinda works, but I can't format it better since Genius returns at least 3 different pages! One is a normal html that is using a class="lyrics" for the lyrics container, and two or more others that they use a class like this "Lyrics__Container-sc-1ynbvzw-2" for the container. The class name changes, but its OK with me. The main problem is that sometimes they contain all the lyrics and some times a part of them. I think it is because of some new lines in the html that break the tag.. :o( Anyway, here is the code I use so far:

    @staticmethod
    def getLyrics(singer, song):
        BASE_URL = "https://genius.com/"
        url = "{}-{}-lyrics".format(singer, song).replace(" ", "-").capitalize()
        url = BASE_URL + url
        r = requests.get(url)
        s = BeautifulSoup(r.text, features="lxml")  # added features="lxml"
        for string in s.strings:
            string.strip()

        # Get main lyrics holder
        lyrics = s.find("div", {"class": "lyrics"})
        if lyrics is None:
            lyrics = s.find("div",
                            {"class": lambda x: x and x.startswith("Lyrics__Container")})
        if lyrics is None:
            raise ValueError(
                "Song or Singer does not exist or the API does not have Lyrics")

        # Remove Scripts
        [s.extract() for s in lyrics("script")]

        # Remove Comments
        comments = lyrics.find_all(text=lambda text: isinstance(text, Comment))
        [comment.extract() for comment in comments]

        # Remove unnecessary tags
        for tag in ["span", "div", "i", "b", "a"]:
            for match in lyrics.find_all(tag):
                match.replaceWithChildren()

        # Get output as a string and remove non unicode characters
        # and replace <br> with newlines

        # noinspection PyCompatibility
        lyrics.text.strip()
        # noinspection PyBroadException
        try:
            return output
        except Exception:
            return output.encode("utf-8")

Any ideas are welcomed ;o)

Edit: Keep in mind that this needs the exact artist name and title. I will update it with the search page scrape if/when the final url work.

Christo77793 commented 4 years ago

Wish I could help, but I have no knowledge about web scrapping.

noembryo commented 4 years ago

@Christo77793 Don't worry.. It's just a fun game for me ;o) If I make something that works decently, I'll post it here..

Christo77793 commented 4 years ago

@noembryo Thank you, much appreciated!

noembryo commented 4 years ago

Well, bad news! I did fix the scraper for the lyrics page..

    def getLyrics(self, singer, song):
        url = "{}-{}-lyrics".format(singer, song).replace(" ", "-").capitalize()
        url = "https://genius.com/{}".format(urllib.quote(url))
        r = requests.get(url)
        output = self.parse_page4lyrics(r.text)
        # noinspection PyBroadException
        try:
            return output
        except Exception:
            return output.encode("utf-8")

    @staticmethod
    def parse_page4lyrics(page):
        s = BeautifulSoup(page, features="lxml")  # added features="lxml"

        # Get main lyrics holder
        normal_html = True
        found_lyrics = [s.find("div", {"class": "lyrics"})]
        found_lyrics = [i for i in found_lyrics if i]  # remove item if nothing found
        if not found_lyrics:
            normal_html = False
            cls = "Lyrics__Container"
            found_lyrics = s.find_all("div", {"class": lambda x: x and x.startswith(cls)})
            found_lyrics = [i for i in found_lyrics if i]  # remove item if nothing found
        if not found_lyrics:
            raise ValueError("Song or Singer does not exist or the API does not have Lyrics")

        output = []
        for lyrics in found_lyrics:
            # Remove Scripts
            [s.extract() for s in lyrics("script")]

            # Remove Comments
            comments = lyrics.find_all(text=lambda txt: isinstance(txt, Comment))
            [comment.extract() for comment in comments]

            # Remove unnecessary tags
            for tag in ["span", "div", "i", "b", "a"]:
                for match in lyrics.find_all(tag):
                    match.replaceWithChildren()

            # Get output as a string and remove non unicode characters
            # and replace <br> with newlines
            if normal_html:
                output.append(lyrics.text.strip().replace("<br/>", "\n"))
            else:
                text = lyrics.contents
                # noinspection PyCompatibility
                text = [unicode(i.string)  # Convert to normal strings
                        if type(i) == NavigableString else "<br/>" for i in text]
                # Remove more than one continuous spaces
                text = "".join([" ".join(i.split()) for i in text])
                output.append(text.replace("<br/>", "\n"))
        if len(output) == 1:
            output = output[0]
        else:
            output = "".join(output)
        # noinspection PyBroadException
        try:
            return output
        except Exception:
            return output.encode("utf-8")

... but I can't parse the search results page to get the lyrics page url (it's generated with JavaScript). :o(

If you use "https://genius.com/Michael-jackson-beat-it-lyrics" as url (calling `getLyrics("Michael Jackson", "Beat It") you get the same results every time, although Genius serves different pages. But without the search results, we have to know exactly what the Artist's name and the song's title is, which most of the time we don't..

Bummer! :o(

noembryo commented 4 years ago

OK, I spoke too soon.. Finally I manage to sniff out the url of the results, and found the url for lyrics' page!! So, to make the getLyrics method to work, all you have to do is replace it with these three methods..

BASE_URL = "https://genius.com"

    def getLyrics(self, singer, song):
        query = urllib.quote_plus("{} {}".format(singer, song))
        query_url = "{}/api/search/multi?per_page=5&q={}".format(BASE_URL, query)

        url = self.get_best_result_url(query_url)
        if not url:  # Nothing found
            return
        r = requests.get(url)
        output = self.parse_page4lyrics(r.text)

        # noinspection PyBroadException
        try:
            return output
        except Exception:
            return output.encode("utf-8")

    @staticmethod
    def get_best_result_url(url):
        json_data = json.loads(requests.get(url).content)
        sections = json_data["response"]["sections"]
        try:
            best = [i for i in sections if i["type"] == "top_hit"][0]["hits"][0]
            if not best["index"] == "song":  # Rap lyrics not a song..
                return
        except IndexError:  # Nothing found
            return
        url = BASE_URL + best["result"]["path"]
        return url

    @staticmethod
    def parse_page4lyrics(page_text):
        s = BeautifulSoup(page_text, features="lxml")  # added features="lxml"

        # Get main lyrics holder
        normal_html = True
        found_lyrics = [s.find("div", {"class": "lyrics"})]
        found_lyrics = [i for i in found_lyrics if i]  # remove item if nothing found
        if not found_lyrics:
            normal_html = False
            cls = "Lyrics__Container"
            found_lyrics = s.find_all("div", {"class": lambda x: x and x.startswith(cls)})
            found_lyrics = [i for i in found_lyrics if i]  # remove item if nothing found
        if not found_lyrics:
            raise ValueError("Song or Singer does not exist "
                             "or the API does not have Lyrics")

        output = []
        for lyrics in found_lyrics:
            # Remove Scripts
            [s.extract() for s in lyrics("script")]

            # Remove Comments
            comments = lyrics.find_all(text=lambda txt: isinstance(txt, Comment))
            [comment.extract() for comment in comments]

            # Remove unnecessary tags
            for tag in ["span", "div", "i", "b", "a"]:
                for match in lyrics.find_all(tag):
                    match.replaceWithChildren()

            # Get output as a string and remove non unicode characters
            # and replace <br> with newlines
            if normal_html:
                output.append(lyrics.text.strip().replace("<br/>", "\n"))
            else:
                text = lyrics.contents
                # noinspection PyCompatibility
                text = [unicode(i.string)  # Convert to normal strings
                        if type(i) == NavigableString else "<br/>" for i in text]
                # Remove more than one continuous spaces
                text = "".join([" ".join(i.split()) for i in text])
                output.append(text.replace("<br/>", "\n"))
        if len(output) == 1:
            output = output[0]
        else:
            output = "".join(output)
        # noinspection PyBroadException
        try:
            return output
        except Exception:
            return output.encode("utf-8")

@geekpradd This fix is for Python 2.7xx. If you like, I could make a PR with it, and add some Python 3 compatibility stuff. The rest of the script is not working because I don't use it, but it shouldn't be that hard to fix using the json that the get_best_result_url gets.. ;o)