Open Christo77793 opened 4 years ago
I don't think there can be a workaround for this..
I have an app (KataLib) that was using it, but now I have to find an alternative...
Musixmatch needs money for full API usage..
Looking at Genius now, but the LyricsGenius needs Python 3 (I use 2.7), so not an option for me. There are some small scripts around for it, so I'll check them too..
Hey, thank you. Wasn't aware of LyricsGenius, but I might be able to look into it and work with it. Much appreciated.
If this is the case then I'd have to shut down this plugin. If any of you can work on migrating the plugin to a different source I'd be more than happy to push this up the stream and into PyPi.
I'm trying to adapt your getLyrics
method (its the one I use), to work with scraping Genius (without API key).
It doesn't look difficult, but it has a strange behavior..
I don't get the same page every time I call the same url!!!
So no luck yet.. :o(
This is kinda works, but I can't format it better since Genius returns at least 3 different pages!
One is a normal html that is using a class="lyrics"
for the lyrics container, and two or more others that they use a class like this "Lyrics__Container-sc-1ynbvzw-2"
for the container.
The class name changes, but its OK with me.
The main problem is that sometimes they contain all the lyrics and some times a part of them.
I think it is because of some new lines in the html
that break the tag.. :o(
Anyway, here is the code I use so far:
@staticmethod
def getLyrics(singer, song):
BASE_URL = "https://genius.com/"
url = "{}-{}-lyrics".format(singer, song).replace(" ", "-").capitalize()
url = BASE_URL + url
r = requests.get(url)
s = BeautifulSoup(r.text, features="lxml") # added features="lxml"
for string in s.strings:
string.strip()
# Get main lyrics holder
lyrics = s.find("div", {"class": "lyrics"})
if lyrics is None:
lyrics = s.find("div",
{"class": lambda x: x and x.startswith("Lyrics__Container")})
if lyrics is None:
raise ValueError(
"Song or Singer does not exist or the API does not have Lyrics")
# Remove Scripts
[s.extract() for s in lyrics("script")]
# Remove Comments
comments = lyrics.find_all(text=lambda text: isinstance(text, Comment))
[comment.extract() for comment in comments]
# Remove unnecessary tags
for tag in ["span", "div", "i", "b", "a"]:
for match in lyrics.find_all(tag):
match.replaceWithChildren()
# Get output as a string and remove non unicode characters
# and replace <br> with newlines
# noinspection PyCompatibility
lyrics.text.strip()
# noinspection PyBroadException
try:
return output
except Exception:
return output.encode("utf-8")
Any ideas are welcomed ;o)
Edit: Keep in mind that this needs the exact artist name and title. I will update it with the search page scrape if/when the final url work.
Wish I could help, but I have no knowledge about web scrapping.
@Christo77793 Don't worry.. It's just a fun game for me ;o) If I make something that works decently, I'll post it here..
@noembryo Thank you, much appreciated!
Well, bad news! I did fix the scraper for the lyrics page..
def getLyrics(self, singer, song):
url = "{}-{}-lyrics".format(singer, song).replace(" ", "-").capitalize()
url = "https://genius.com/{}".format(urllib.quote(url))
r = requests.get(url)
output = self.parse_page4lyrics(r.text)
# noinspection PyBroadException
try:
return output
except Exception:
return output.encode("utf-8")
@staticmethod
def parse_page4lyrics(page):
s = BeautifulSoup(page, features="lxml") # added features="lxml"
# Get main lyrics holder
normal_html = True
found_lyrics = [s.find("div", {"class": "lyrics"})]
found_lyrics = [i for i in found_lyrics if i] # remove item if nothing found
if not found_lyrics:
normal_html = False
cls = "Lyrics__Container"
found_lyrics = s.find_all("div", {"class": lambda x: x and x.startswith(cls)})
found_lyrics = [i for i in found_lyrics if i] # remove item if nothing found
if not found_lyrics:
raise ValueError("Song or Singer does not exist or the API does not have Lyrics")
output = []
for lyrics in found_lyrics:
# Remove Scripts
[s.extract() for s in lyrics("script")]
# Remove Comments
comments = lyrics.find_all(text=lambda txt: isinstance(txt, Comment))
[comment.extract() for comment in comments]
# Remove unnecessary tags
for tag in ["span", "div", "i", "b", "a"]:
for match in lyrics.find_all(tag):
match.replaceWithChildren()
# Get output as a string and remove non unicode characters
# and replace <br> with newlines
if normal_html:
output.append(lyrics.text.strip().replace("<br/>", "\n"))
else:
text = lyrics.contents
# noinspection PyCompatibility
text = [unicode(i.string) # Convert to normal strings
if type(i) == NavigableString else "<br/>" for i in text]
# Remove more than one continuous spaces
text = "".join([" ".join(i.split()) for i in text])
output.append(text.replace("<br/>", "\n"))
if len(output) == 1:
output = output[0]
else:
output = "".join(output)
# noinspection PyBroadException
try:
return output
except Exception:
return output.encode("utf-8")
... but I can't parse the search results page to get the lyrics page url (it's generated with JavaScript). :o(
If you use "https://genius.com/Michael-jackson-beat-it-lyrics"
as url (calling `getLyrics("Michael Jackson", "Beat It") you get the same results every time, although Genius serves different pages.
But without the search results, we have to know exactly what the Artist's name and the song's title is, which most of the time we don't..
Bummer! :o(
OK, I spoke too soon..
Finally I manage to sniff out the url of the results, and found the url for lyrics' page!!
So, to make the getLyrics
method to work, all you have to do is replace it with these three methods..
BASE_URL = "https://genius.com"
def getLyrics(self, singer, song):
query = urllib.quote_plus("{} {}".format(singer, song))
query_url = "{}/api/search/multi?per_page=5&q={}".format(BASE_URL, query)
url = self.get_best_result_url(query_url)
if not url: # Nothing found
return
r = requests.get(url)
output = self.parse_page4lyrics(r.text)
# noinspection PyBroadException
try:
return output
except Exception:
return output.encode("utf-8")
@staticmethod
def get_best_result_url(url):
json_data = json.loads(requests.get(url).content)
sections = json_data["response"]["sections"]
try:
best = [i for i in sections if i["type"] == "top_hit"][0]["hits"][0]
if not best["index"] == "song": # Rap lyrics not a song..
return
except IndexError: # Nothing found
return
url = BASE_URL + best["result"]["path"]
return url
@staticmethod
def parse_page4lyrics(page_text):
s = BeautifulSoup(page_text, features="lxml") # added features="lxml"
# Get main lyrics holder
normal_html = True
found_lyrics = [s.find("div", {"class": "lyrics"})]
found_lyrics = [i for i in found_lyrics if i] # remove item if nothing found
if not found_lyrics:
normal_html = False
cls = "Lyrics__Container"
found_lyrics = s.find_all("div", {"class": lambda x: x and x.startswith(cls)})
found_lyrics = [i for i in found_lyrics if i] # remove item if nothing found
if not found_lyrics:
raise ValueError("Song or Singer does not exist "
"or the API does not have Lyrics")
output = []
for lyrics in found_lyrics:
# Remove Scripts
[s.extract() for s in lyrics("script")]
# Remove Comments
comments = lyrics.find_all(text=lambda txt: isinstance(txt, Comment))
[comment.extract() for comment in comments]
# Remove unnecessary tags
for tag in ["span", "div", "i", "b", "a"]:
for match in lyrics.find_all(tag):
match.replaceWithChildren()
# Get output as a string and remove non unicode characters
# and replace <br> with newlines
if normal_html:
output.append(lyrics.text.strip().replace("<br/>", "\n"))
else:
text = lyrics.contents
# noinspection PyCompatibility
text = [unicode(i.string) # Convert to normal strings
if type(i) == NavigableString else "<br/>" for i in text]
# Remove more than one continuous spaces
text = "".join([" ".join(i.split()) for i in text])
output.append(text.replace("<br/>", "\n"))
if len(output) == 1:
output = output[0]
else:
output = "".join(output)
# noinspection PyBroadException
try:
return output
except Exception:
return output.encode("utf-8")
@geekpradd This fix is for Python 2.7xx. If you like, I could make a PR with it, and add some Python 3 compatibility stuff.
The rest of the script is not working because I don't use it, but it shouldn't be that hard to fix using the json that the get_best_result_url
gets.. ;o)
I have a project that relies on this, I was hoping there would be some work around on this.