lyrics: Tolerate more tags when gathering lyrics texts in Google backend

dannn-o commented 3 years ago

Problem

I finally got beets working again under Linux Mint 20.2 with the recently-released version beets 1.5.0 (thank you to everyone involved with the beats team that helped make that long-awaited update happen!). I am a user of the Lyrics plug-in, and took the opportunity to obtain a Google API key to help improve the lyrics scraping.

I've imported a handful of albums so far, and while most of the functionality, including lyrics scraping, is working well, there are times when portions of the lyrics are being missed for certain songs. Though I continue to investigate this, the testing I have been able to do so far (combined with including the verbose (-vv) mode parameter) seems to indicate this primarily happens with AZLyrics (www.azlyrics.com).

One album I have been using that consistently reproduces the error is Molly Hatchet's first album (self-titled "Molly Hatchet"). Verbose mode of the command output shows that Google used AZLyrics for most songs, with SongLyrtics (www.songlyrics.com) being used in a few cases. For the lyrics being provided from AZLyrics, a few songs did not have any issues at all:

Gator Country
Dreams I'll Never See

However, quite a few had only a portion of the complete lyrics, as shown in Puddletag. These include:

Bounty Hunter
The Creeper
The Price You Pay
I'll Be Runing
Cheatin' Woman
Trust Your Old Friend

I haven't tried searching through source code on Github, but one thing I noticed anecdotally is that the songs having this problem seemed to have identifiers within the lyrics that used brackets ("[" and "]") to denote a subset of the lyrics, such as "[Chorus:]" or "[LEAD BREAK]". See The Price You Pay as an example. With that song, beets returned lyrics starting after the "[LEAD BREAK]" identifier:

I shot a man in Macon over a poker game,
I killed another in Atlanta just to build my fame,
...

The 2 songs on the album not encountering the lyrics problem did not have any such bracketed-identifiers in their lyrics. If I can provide any documentation in addition to the configuration file (below), please let me know what might help.

Setup

OS: Linux Mint 20.2
Python version: 3.8.10
beets version: 1.5.0
Turning off plugins made problem go away (yes/no): (N/A, since lyrics are provided by a plug-in)

My configuration (output of beet config) is:

directory: ~/Music
library: ~/data/musiclibrary.blb

id3v23: yes
plugins: lyrics fetchart zero embedart scrub

import:
    move: yes
    write: yes
    log: /home/dan/Music/import-log.txt
    timid: yes

zero:
    fields: comments day

lyrics:
    force: yes
    google_API_key: <My key here - not included to avoid potential hacking>
    sources: musixmatch google

embedart:
    maxwidth: 300
    remove_art_file: yes

sampsyo commented 3 years ago

Thanks for the detailed report!! As you probably know, the Google backend is very heuristic, in the sense that it makes lots of guesses about how pages are formatted. This page you've linked to seems like it shouldn't be that hard to scrape, however—it looks like it's formatted pretty normally.

For what it's worth, the place to start digging into the code would be this function: https://github.com/beetbox/beets/blob/aad5253c875d13c396a43c80b30ad9cd3aab0dba/beetsplug/lyrics.py#L540-L562

That's where the heuristic scraper attempts to identify what looks like lyrics on a page. Printing out the intermediate values in that function might reveal what's going wrong for this example.

dannn-o commented 3 years ago

Though I don't know Python coding (yet), I'm willing to dig into the code and try some debugging to narrow the scope of the problem. I'm not sure if you can suggest any programs that can be used to step (execute) through lines of code, display variable values, etc., but if not, I'd appreciate some guidance on the best way to print out the intermediate string values from the 'Lyrics' module.

And, though I admit I often struggle with interpreting regular expressions in Linux, my gut feeling is that maybe HTML code involving bracketed sections of the lyrics is being dropped with this: https://github.com/beetbox/beets/blob/aad5253c875d13c396a43c80b30ad9cd3aab0dba/beetsplug/lyrics.py#L535-L537

sampsyo commented 3 years ago

There certainly are debuggers that could print things out during execution—the main one to recommend is pdb. It’s pretty easy to use, overall!

But perhaps an even easier way to get started is with just plain print-debugging. If you throw in lines like print(‘html is’, html) and run the plugin, you can get a very detailed view about exactly what values are showing up.

dannn-o commented 3 years ago

Cool - sounds easy enough! I will try to carve out time during the next few days to do some trouble-shooting and will post back once I have more data. And, thanks for the 'pdb' suggestion...will likely be worth learning about if I plan to dive deeper into learning Python (or, if I can't seem to get what I'm hoping to using just "print").

dannn-o commented 3 years ago

Quick update with minor progress: I added some "print" lines before and after line 536 I suspected above, and the first lesson I learned with Python is that indentation is a bit picky. I kept getting errors like

IndentationError: unindent does not match any outer indentation level

I finally found I had to be consistent with using spaces instead of tabs! (I normally use Bluefish as my HTML editor).

Seems, though, that the complete set of lyrics was in tact both before and after the scrape/merge completed, so I think I have to look a bit further 'downstream'. Guess I'll take a fresh look tomorrow.

sampsyo commented 3 years ago

Ah yes, that is certainly a frustration when getting started editing Python! I'm glad you sorted that out.

dannn-o commented 3 years ago

Though I had limited time tonight, I'm pretty confident that the section of lyrics is being lost during execution of this line of code: https://github.com/beetbox/beets/blob/aad5253c875d13c396a43c80b30ad9cd3aab0dba/beetsplug/lyrics.py#L558

A quick Internet search seems to indicate "stripped_strings" is part of the Beautiful Soup library, but that's as far as I've got time for tonight. Thought I'd quickly post this, though, in case someone with Python experience might spot something obvious, or can suggest a more targeted next step. Otherwise, I plan to research this a bit more tomorrow.

sampsyo commented 3 years ago

Got it. What that does, for what it's worth, is take the longest text blob on the page. So I think the underlying problem here is that the logic is somehow breaking the lyrics into two shorter runs of text, and then the first one "wins" because it's longest. I suppose this means we need additional logic to somehow merge together elements that make up parts of a longer listing of lyrics, which would come earlier in this function… I wonder if @Kraymer, the original brains behind this backend, happens to be around to comment?

dannn-o commented 3 years ago

That's precisely what it seemed like when I was reviewing the data. Even now having had some time to look through the Beautiful Soup documentation, I'm still unclear on how the logic seems to expect a single text blob to be extracted at that point. Until @Kraymer has a chance to hopefully comment on this, my next planned step was going to be running through another import, and this time include one of the songs for which the lyrics processing seems to keep everything in tact. Then, make a comparison of how the 'soup' object is structured for each before the stripped_strings function is invoked.

dannn-o commented 3 years ago

So, I imported a song off of the album that doesn't encounter the problem, and as I think was expected from analysis above, the entire lyrics' text was contained within a single string. That's why the "stripped_strings" function was successful, because there was only one text blob in competition (actually, there was a 2nd string identifying the album, song, and AZLyrics.com site, which was quite small in comparison).

I analyzed lyrics for the song I've been using to reproduce the problem, and the lyrics definitely are broken up into 6 different blobs (not counting the album/song/AZlyrics string) before the "stripped_strings" function is invoked. In looking at the lyrics at the azlyrics.com web site, the points at which the lyrics separate into the 6 blobs occur with the bracketed text I mentioned in my original post (for example, where [Chorus:] or [LEAD BREAK] appear within the lyrics' text).

I think there could be 2 approaches: Either preceding code could be tweaked to try and prevent the lyrics from splitting up, or if that proves difficult, logic could be added to combine the resulting blobs into one string. I think the former option would probably be the preferred one. In this particular case, the bracketed text is surrounded by italics elements ( and ), so those tags could be the cause rather than the brackets.

I've hit my limit for tonight, but plan to next see if I can find where and how the lyrics' fracturing happens.

dannn-o commented 3 years ago

From additional testing I've done so far, it seems the code is, in fact, expecting the entire set of lyrics to be contained within one of the soup string elements by the time the "stripped_strings" function is invoked. Backing up in the code a bit, I found the fracturing of lyrics happens as a result of the "try_parse_html function that invokes the "SoupStrainer" function within it: https://github.com/beetbox/beets/blob/aad5253c875d13c396a43c80b30ad9cd3aab0dba/beetsplug/lyrics.py#L553

When comparing the HTML code for a song processed successfully against a song that has lyrics omitted, the basic structure in both cases appears the same....lyrics are in paragraph blocks in succession, separated by single spaces. However, it seems that any entries within the lyrics containing HTML tags causes the try_parse_html / SoupStrainer logic to break the following section of lyrics text into a separate string entity in the soup array. One example in the AZlyrics link I provided above is:

[Chorus:]

At this point, I have a lot of self-education to do before I think I'll be able to break things down further. I'm hoping some of these details will help someone with a bit more knowledge than I propose a possible code fix. In case it can help, I have text files showing the HTML code before the try_parse_html call, along with the resulting strings in the soup array after the call, for both a song encountering the problem and a song not encountering it.

One thing I noticed...not sure if it could be helpful or not...there appears to be different processing for the various backends (Google API, Genius, Tekstowo, etc), and I noticed a comment with code in the Genius processing that seemed like it may be dealing with a similar situation:

https://github.com/beetbox/beets/blob/aad5253c875d13c396a43c80b30ad9cd3aab0dba/beetsplug/lyrics.py#L400-L401

@sampsyo and @Kraymer - I hope this info is helpful.

sampsyo commented 3 years ago

Quite helpful; thanks! I think the next step, for anyone interested in nailing this down, will be to look specifically for tags like  and to strip them out before extracting the text chunks—since it seems clear that tags like this are still part of the main lyrics body.

sampsyo commented 3 years ago

I should probably add some meta-commentary here: the Google backend, unlike the other lyrics backends in this plugin, is necessarily very heuristic. That is, there will never be a consistent set of rules that works on all lyrics pages on the Web; all we can do is our best: iteratively improve the parser when we run across examples that need help and are fixable. But there are an infinite number of such fixes that are possible and perfection, in the end, is unattainable.

dannn-o commented 3 years ago

Point well taken. Before discovering Beets (and Linux), I had been using MediaMonkey for ID3 tag management, and it had a plug-in named Lyricator that was fairly decent. I learned then that scraping lyrics is inherently an inexact science (even when dealing with only one source, which is itself fairly rare), but I'm definitely a fan of embedded lyrics since so many MP3 players now have the capability to display them.

I'll continue assisting as I can to help with this feature and making the Google backend processing more robust. It may be a bit of an impractical request at this point in time, but I've often thought that an interface for selecting the best match of lyrics candidates when multiple sources are found, similar to what is presented when multiple candidates are found for an album search within MusicBrainz, could be a nice feature (maybe could be enabled / controlled with Lyrics options within the config file).

Kraymer commented 3 years ago

Good detective work @dannn-o Long time I wrote this code but indeed the main idea was that if you're on a lyrics website, whatever it is, you can expect the longest stream of text to correspond to the lyrics of the page you're on. So last @sampsyo comment rings true. And any potential improvement should probably be backed up by a set of unit tests on a pool of selected songs (where you add a song when it fails parsing) so you're guaranteed that your new heuristic don't fix a case while breaking another.

dannn-o commented 3 years ago

Thanks for joining in, @Kraymer . I totally understand your comments about thorough testing being needed to ensure there is no regression to other functionality. I believe I have a decent set of candidate albums that may work well for this, and am willing to help test if / when the time comes.

I don't want to veer too far off topic, but thought I'd bounce this off of you and possibly @sampsyo: I commented out my Google API key and sources entries in the config.yaml file to revert back to defaults, and noticed that the tekstowo.pl site had a high success rate of obtaining lyrics with the test albums I had been using. However, the lyrics consistently seem to have some Polish text left in at as headers / footers:

Tekst piosenki:
<lyrics for song>
Historia edycji tekstu

Another issue mentioned this in passing (#3904), but it was a secondary problem not pursued. Since, in this case, the code is dealing with only one of the backends (tekstowo), I'm wondering if the Polish header/footer text could easily be filtered out. I admit I have not pursued digging into that code, and I know it's a bit unrelated, but I'd be glad to open a separate issue for that if you think it is merited.

Just want to say thanks again to you both, and the 'cast of thousands' who help develop and support beets. It really is a terrific app.

beetbox / beets

lyrics: Tolerate more tags when gathering lyrics texts in Google backend #4043

Problem

Setup