Open anarcat opened 7 years ago
i researched bloom filters for this recently and found two main implementations: pybloom, which seems abandoned but has an active fork (that doesn't seem to have been picked up by distros, btw), and pybloomfiltermmap, which seems to be a Cython-based approached with mmap which is faster by a few orders of magnitude.
i'm not sure this is such a critical path that we'd need that mmap version, but i guess this is something to consider...
a bloom filter would look something like this:
diff --git a/beetsplug/lyrics.py b/beetsplug/lyrics.py
index 57265a46..871c52b3 100644
--- a/beetsplug/lyrics.py
+++ b/beetsplug/lyrics.py
@@ -32,6 +32,8 @@
import six
from six.moves import urllib
+from pybloom import ScalableBloomFilter
+
try:
from bs4 import SoupStrainer, BeautifulSoup
HAS_BEAUTIFUL_SOUP = True
@@ -641,6 +643,9 @@ def __init__(self):
self.config['google_engine_ID'].redact = True
self.config['genius_api_key'].redact = True
+ # bloom filter of non-existent lyrics
+ self.bf = ScalableBloomFilter()
+
# State information for the ReST writer.
# First, the current artist we're writing.
self.artist = u'Unknown artist'
@@ -858,12 +863,17 @@ def get_lyrics(self, artist, title):
"""Fetch lyrics, trying each source in turn. Return a string or
None if no lyrics were found.
"""
+ # if we already failed to find that pair, just skip the request
+ if artist + '-' + title in self.bf:
+ return
for backend in self.backends:
lyrics = backend.fetch(artist, title)
if lyrics:
self._log.debug(u'got lyrics from backend: {0}',
backend.__class__.__name__)
return _scrape_strip_cruft(lyrics, True)
+ # remember our failure
+ self.bf.add(artist + '-' + title)
def append_translation(self, text, to_lang):
import xml.etree.ElementTree as ET
it significantly speeds up the startup here, at least.
Hmm; good point. Here are a few scattered comments:
set
of failed matches to avoid repeating them. A Bloom filter helps mitigate storage costs for large sets by approximating, but I'm guessing the set of failed searches wouldn't grow prohibitively large.X in self.bf
even when you have never said self.bf.add(X)
. (Of course, it is not allowed to return False when you have added the item.) In general, a Bloom filter is usually useful as a quick pre-check before proceeding to checking an exhaustive set for full precision.yep, i definitely got the bloom filter backwards here, sorry. we should remember success and exit if the element is not in the filter.
i was wondering if a set
would be good enough. maybe. it's just that there are collections with a lot of songs out there... it's around 50k here. it's starts to scale up quite a bit. i was wondering if we could just abuse the format of the lyrics
column instead and delegate to sqlite (which does use bloom filters itself, i believe - if not, it should).
filtering on empty combinations is one thing, but there are also actual duplicates across compilations and so on that we could use.
Sure, beets does handle lots of large collections. But I'd hazard a guess that it's a minority of one's library that will fail to match, meaning that we wouldn't have to keep track of all 50k songs in such a set. It's worth measuring, at least! :rocket:
okay let's see here:
$ beet -l library.db stats
Tracks: 26939
Total time: 10.4 weeks
Approximate total size: 1012.3 GiB
Artists: 1614
Albums: 2123
Album artists: 793
let's bump that up to 100k songs, just for fun. let's use this benchmarking script:
#!/usr/bin/python3
import datetime
import logging
import humanize
import os
import psutil
import random
import string
import sys
from pybloom import ScalableBloomFilter, BloomFilter
from pybloomfilter import BloomFilter as BloomFilterMmap
class Timer(object):
"""this class is to track time and resources passed"""
def __init__(self):
"""initialize the timstamp"""
self.stamp = datetime.datetime.now()
def times(self):
"""return a string designing resource usage"""
return 'user %s system %s chlduser %s chldsystem %s' % os.times()[:4]
def rss(self):
process = psutil.Process(os.getpid())
return process.memory_info().rss
def memory(self):
return 'RSS %s' % humanize.naturalsize(self.rss())
def diff(self):
"""a datediff between the creation of the object and now"""
return datetime.datetime.now() - self.stamp
def __str__(self):
"""return a string representing the time passed and resources used"""
return 'elasped: %s (%s %s)' % (str(self.diff()),
self.times(),
self.memory())
def randomwords(n, length=10):
for c in range(n):
yield ''.join(random.choice(string.lowercase) for i in range(length))
def load_set(n):
s = set()
for word in randomwords(n):
s.add(word)
return s
def load_bloom(n):
bf = ScalableBloomFilter()
for word in randomwords(n):
bf.add(word)
return bf
def BloomFilterMmapWrapper():
return BloomFilterMmap(100000, 0.1, 'filter.bloom')
def BloomFilterWrapper():
return BloomFilter(capacity=100000, error_rate=0.1)
def run_tests(n):
logging.basicConfig(level=logging.DEBUG, format='%(message)s')
for tool in (set,
BloomFilterMmapWrapper,
BloomFilterWrapper,
ScalableBloomFilter):
timer = Timer()
s = tool()
for word in randomwords(n):
s.add(word)
logging.info('%r loaded %d entries: %s', tool, n, timer)
timer = Timer()
for word in randomwords(n):
if word in s:
pass
logging.info('%r %d lookups: %s', tool, n, timer)
if __name__ == '__main__':
run_tests(int(sys.argv[1]))
and let's run it:
$ python set-tests.py 100000
<type 'set'> loaded 100000 entries: elasped: 0:00:00.680757 (user 0.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 26.5 MB)
<type 'set'> 100000 lookups: elasped: 0:00:00.664798 (user 1.39 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 27.1 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> loaded 100000 entries: elasped: 0:00:00.670700 (user 2.06 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> 100000 lookups: elasped: 0:00:00.657779 (user 2.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> loaded 100000 entries: elasped: 0:00:01.274390 (user 3.98 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> 100000 lookups: elasped: 0:00:01.093647 (user 5.07 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 100000 entries: elasped: 0:00:07.461861 (user 12.52 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 100000 lookups: elasped: 0:00:07.427330 (user 19.91 system 0.03 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)
observations:
set
datastructure is very efficient, especially considering it actually holds a copy of all that data (whereas bloom filters only say if the data has been seen or not)set
and BloomFilterMmap
are roughly as fast, both of them. one has to wonder if set
doesn't use a bloom filter internally ;) (in fact, it uses a hashtable)set
datastructure uses up more memory - about doublepybloom
is about 2 times slower than the mmap
version, which isn't so bad considering it is pure pythonScalableBloomFilter
from pybloom is horribly slow! presumably this is because it copies up the datastructure multiple times as it scales up, but it doesn't explain why lookups are that slowIf we go up one more order of magnitude, to 1 million songs, results are roughly similar as well for most tools except the ScalableBloomFilter which explodes to 87 seconds execution time (instead of 6 seconds for set
) - still around an order of magnitudes slower, however. set
also uses up way more memory here as well - 128MB vs ~48MB for the bloom filters. Nothing to worry about there, I would say, although that memory usage also scales up with the song length - we have been very conservative here in guessing a "10 character" length for the test, which disproportionatly favors the set
structure. However, bumping up to 100 characters doesn't yield significant impacts on the set memory usage.
Anyways. All this to show that set
is fast enough and there's no reason to introduce a third-party datastructure/dependency here, if we do want to have some sort of caching here.
results with 100 character words and 1M entries:
$ python set-tests.py 1000000
<type 'set'> loaded 1000000 entries: elasped: 0:00:56.268277 (user 56.18 system 0.08 chlduser 0.0 chldsystem 0.0 RSS 219.3 MB)
<type 'set'> 1000000 lookups: elasped: 0:00:55.556960 (user 111.71 system 0.09 chlduser 0.0 chldsystem 0.0 RSS 227.4 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> loaded 1000000 entries: elasped: 0:00:56.018806 (user 167.63 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> 1000000 lookups: elasped: 0:00:55.751069 (user 223.31 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> loaded 1000000 entries: elasped: 0:01:02.873864 (user 286.16 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> 1000000 lookups: elasped: 0:01:01.404508 (user 347.55 system 0.12 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 1000000 entries: elasped: 0:02:33.855005 (user 501.24 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 1000000 lookups: elasped: 0:02:30.389974 (user 651.39 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)
set
memory usage does get a little prohibitive then...
Awesome! I’m a sucker for quantitative measurement. I would have never have guessed that the scalable Bloom filter implementation would be so slow, for example… it’s hard to think of a circumstance where that would be useful.
Anyway, sounds like a normal cache with set
for unmatched artist/title pairs is the way to go. We could even still special-case the empty strings to avoid needlessly searching on first lookup.
I thought it could be a good idea to optimize lyrics searches. here, when i want to fetch all the lyrics in my collection, the first results are like this:
a few things come to mind here:
Network access is expensive, way more than disk or memory. We should limit is as much as possible, if only to avoid being blocked at the provider (e.g. #2632 for example).
While we could try and tweak our searches to (say) avoid searching certain patterns like (say) empty artists or titles, it would seem even better to use a bloom filter for this purpose. It would delegate to the backends the responsibility of filtering out ridiculous requests, but we would do them at least only once. And since the resulting search from item A and B may be the same, we would also avoid sending duplicate requests to the backend.
Opinions?