lyrics: Avoid searches with empty metadata and repeated searches for the same metadata

anarcat commented 7 years ago

I thought it could be a good idea to optimize lyrics searches. here, when i want to fetch all the lyrics in my collection, the first results are like this:

$ beet lyrics 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  - 
lyrics: lyrics not found:  -  -

a few things come to mind here:

wtf is going on?
okay, this is probably some broken tags in my collection, sorry, but still, wtf? :)
okay, it's probably trying to find "empty artist, empty album, empty title", and failing - shouldn't we just skip those?
why is it repeatedly trying those? I guess that's because I have multiple such entries, but still, wtf? :)

Network access is expensive, way more than disk or memory. We should limit is as much as possible, if only to avoid being blocked at the provider (e.g. #2632 for example).

While we could try and tweak our searches to (say) avoid searching certain patterns like (say) empty artists or titles, it would seem even better to use a bloom filter for this purpose. It would delegate to the backends the responsibility of filtering out ridiculous requests, but we would do them at least only once. And since the resulting search from item A and B may be the same, we would also avoid sending duplicate requests to the backend.

Opinions?

anarcat commented 7 years ago

i researched bloom filters for this recently and found two main implementations: pybloom, which seems abandoned but has an active fork (that doesn't seem to have been picked up by distros, btw), and pybloomfiltermmap, which seems to be a Cython-based approached with mmap which is faster by a few orders of magnitude.

i'm not sure this is such a critical path that we'd need that mmap version, but i guess this is something to consider...

anarcat commented 7 years ago

a bloom filter would look something like this:

diff --git a/beetsplug/lyrics.py b/beetsplug/lyrics.py
index 57265a46..871c52b3 100644
--- a/beetsplug/lyrics.py
+++ b/beetsplug/lyrics.py
@@ -32,6 +32,8 @@
 import six
 from six.moves import urllib

+from pybloom import ScalableBloomFilter
+
 try:
     from bs4 import SoupStrainer, BeautifulSoup
     HAS_BEAUTIFUL_SOUP = True
@@ -641,6 +643,9 @@ def __init__(self):
         self.config['google_engine_ID'].redact = True
         self.config['genius_api_key'].redact = True

+        # bloom filter of non-existent lyrics
+        self.bf = ScalableBloomFilter()
+
         # State information for the ReST writer.
         # First, the current artist we're writing.
         self.artist = u'Unknown artist'
@@ -858,12 +863,17 @@ def get_lyrics(self, artist, title):
         """Fetch lyrics, trying each source in turn. Return a string or
         None if no lyrics were found.
         """
+        # if we already failed to find that pair, just skip the request
+        if artist + '-' + title in self.bf:
+            return
         for backend in self.backends:
             lyrics = backend.fetch(artist, title)
             if lyrics:
                 self._log.debug(u'got lyrics from backend: {0}',
                                 backend.__class__.__name__)
                 return _scrape_strip_cruft(lyrics, True)
+        # remember our failure
+        self.bf.add(artist + '-' + title)

     def append_translation(self, text, to_lang):
         import xml.etree.ElementTree as ET

it significantly speeds up the startup here, at least.

sampsyo commented 7 years ago

Hmm; good point. Here are a few scattered comments:

Maybe an easy option would just be to filter those empty artist/title combinations—they're not going to turn up anything anyway, so why bother searching? They're also more likely to be duplicated than other artist/title pairs, so this could go a long way toward solving the problem.
Even before jumping all the way to a Bloom filter, we could just keep a normal Python set of failed matches to avoid repeating them. A Bloom filter helps mitigate storage costs for large sets by approximating, but I'm guessing the set of failed searches wouldn't grow prohibitively large.
In this little example, I think you might have the "sense" of a Bloom filter inverted. A Bloom filter admits false positives but no false negatives, meaning that it is allowed to return True for X in self.bf even when you have never said self.bf.add(X). (Of course, it is not allowed to return False when you have added the item.) In general, a Bloom filter is usually useful as a quick pre-check before proceeding to checking an exhaustive set for full precision.

anarcat commented 7 years ago

yep, i definitely got the bloom filter backwards here, sorry. we should remember success and exit if the element is not in the filter.

i was wondering if a set would be good enough. maybe. it's just that there are collections with a lot of songs out there... it's around 50k here. it's starts to scale up quite a bit. i was wondering if we could just abuse the format of the lyrics column instead and delegate to sqlite (which does use bloom filters itself, i believe - if not, it should).

filtering on empty combinations is one thing, but there are also actual duplicates across compilations and so on that we could use.

sampsyo commented 7 years ago

Sure, beets does handle lots of large collections. But I'd hazard a guess that it's a minority of one's library that will fail to match, meaning that we wouldn't have to keep track of all 50k songs in such a set. It's worth measuring, at least! :rocket:

anarcat commented 7 years ago

okay let's see here:

$ beet -l library.db  stats
Tracks: 26939
Total time: 10.4 weeks
Approximate total size: 1012.3 GiB
Artists: 1614
Albums: 2123
Album artists: 793

let's bump that up to 100k songs, just for fun. let's use this benchmarking script:

#!/usr/bin/python3

import datetime
import logging
import humanize
import os
import psutil
import random
import string
import sys

from pybloom import ScalableBloomFilter, BloomFilter
from pybloomfilter import BloomFilter as BloomFilterMmap

class Timer(object):
    """this class is to track time and resources passed"""

    def __init__(self):
        """initialize the timstamp"""
        self.stamp = datetime.datetime.now()

    def times(self):
        """return a string designing resource usage"""
        return 'user %s system %s chlduser %s chldsystem %s' % os.times()[:4]

    def rss(self):
        process = psutil.Process(os.getpid())
        return process.memory_info().rss

    def memory(self):
        return 'RSS %s' % humanize.naturalsize(self.rss())

    def diff(self):
        """a datediff between the creation of the object and now"""
        return datetime.datetime.now() - self.stamp

    def __str__(self):
        """return a string representing the time passed and resources used"""
        return 'elasped: %s (%s %s)' % (str(self.diff()),
                                        self.times(),
                                        self.memory())

def randomwords(n, length=10):
    for c in range(n):
        yield ''.join(random.choice(string.lowercase) for i in range(length))

def load_set(n):
    s = set()
    for word in randomwords(n):
        s.add(word)
    return s

def load_bloom(n):
    bf = ScalableBloomFilter()
    for word in randomwords(n):
        bf.add(word)
    return bf

def BloomFilterMmapWrapper():
    return BloomFilterMmap(100000, 0.1, 'filter.bloom')

def BloomFilterWrapper():
    return BloomFilter(capacity=100000, error_rate=0.1)

def run_tests(n):
    logging.basicConfig(level=logging.DEBUG, format='%(message)s')
    for tool in (set,
                 BloomFilterMmapWrapper,
                 BloomFilterWrapper,
                 ScalableBloomFilter):
        timer = Timer()
        s = tool()
        for word in randomwords(n):
            s.add(word)
        logging.info('%r loaded %d entries: %s', tool, n, timer)
        timer = Timer()
        for word in randomwords(n):
            if word in s:
                pass
        logging.info('%r %d lookups: %s', tool, n, timer)

if __name__ == '__main__':
    run_tests(int(sys.argv[1]))

and let's run it:

$ python set-tests.py 100000
<type 'set'> loaded 100000 entries: elasped: 0:00:00.680757 (user 0.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 26.5 MB)
<type 'set'> 100000 lookups: elasped: 0:00:00.664798 (user 1.39 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 27.1 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> loaded 100000 entries: elasped: 0:00:00.670700 (user 2.06 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> 100000 lookups: elasped: 0:00:00.657779 (user 2.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> loaded 100000 entries: elasped: 0:00:01.274390 (user 3.98 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> 100000 lookups: elasped: 0:00:01.093647 (user 5.07 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 100000 entries: elasped: 0:00:07.461861 (user 12.52 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 100000 lookups: elasped: 0:00:07.427330 (user 19.91 system 0.03 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)

observations:

the set datastructure is very efficient, especially considering it actually holds a copy of all that data (whereas bloom filters only say if the data has been seen or not)
set and BloomFilterMmap are roughly as fast, both of them. one has to wonder if set doesn't use a bloom filter internally ;) (in fact, it uses a hashtable)
therefore, as could be expected, the set datastructure uses up more memory - about double
pybloom is about 2 times slower than the mmap version, which isn't so bad considering it is pure python
the ScalableBloomFilter from pybloom is horribly slow! presumably this is because it copies up the datastructure multiple times as it scales up, but it doesn't explain why lookups are that slow
with the wrapper: we're cheating a little. we're guessing the final size of the dataset
this is as much a test of those datastructures as the speed of the Python RNG... the bottomline is how fast Python can generate random strings, of course.

If we go up one more order of magnitude, to 1 million songs, results are roughly similar as well for most tools except the ScalableBloomFilter which explodes to 87 seconds execution time (instead of 6 seconds for set) - still around an order of magnitudes slower, however. set also uses up way more memory here as well - 128MB vs ~48MB for the bloom filters. Nothing to worry about there, I would say, although that memory usage also scales up with the song length - we have been very conservative here in guessing a "10 character" length for the test, which disproportionatly favors the set structure. However, bumping up to 100 characters doesn't yield significant impacts on the set memory usage.

Anyways. All this to show that set is fast enough and there's no reason to introduce a third-party datastructure/dependency here, if we do want to have some sort of caching here.

anarcat commented 7 years ago

results with 100 character words and 1M entries:

$ python set-tests.py 1000000
<type 'set'> loaded 1000000 entries: elasped: 0:00:56.268277 (user 56.18 system 0.08 chlduser 0.0 chldsystem 0.0 RSS 219.3 MB)
<type 'set'> 1000000 lookups: elasped: 0:00:55.556960 (user 111.71 system 0.09 chlduser 0.0 chldsystem 0.0 RSS 227.4 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> loaded 1000000 entries: elasped: 0:00:56.018806 (user 167.63 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> 1000000 lookups: elasped: 0:00:55.751069 (user 223.31 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> loaded 1000000 entries: elasped: 0:01:02.873864 (user 286.16 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> 1000000 lookups: elasped: 0:01:01.404508 (user 347.55 system 0.12 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 1000000 entries: elasped: 0:02:33.855005 (user 501.24 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 1000000 lookups: elasped: 0:02:30.389974 (user 651.39 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)

set memory usage does get a little prohibitive then...

sampsyo commented 7 years ago

Awesome! I’m a sucker for quantitative measurement. I would have never have guessed that the scalable Bloom filter implementation would be so slow, for example… it’s hard to think of a circumstance where that would be useful.

Anyway, sounds like a normal cache with set for unmatched artist/title pairs is the way to go. We could even still special-case the empty strings to avoid needlessly searching on first lookup.

beetbox / beets

lyrics: Avoid searches with empty metadata and repeated searches for the same metadata #2635