Performance: Common operations are slow.

icefoxen commented 7 years ago

Just started using beets and I love it, but holy cats is it slow sometimes, even for very simple things like simple queries and listings. I only have ~10k songs and it's still rather a pain.

So, I'd like to make it faster, and welcome advice on how to do so. The first suspect is database calls. Compare how long it takes to list items for even a small library, vs. just reading the information from the database:

$ time beet ls | wc  
1.01user 0.12system 0:03.15elapsed 36%CPU (0avgtext+0avgdata 29912maxresident)k
0inputs+0outputs (0major+9671minor)pagefaults 0swaps
    385    2725   15198

$ time sqlite3 musiclibrary.blb 'select * from items;' | wc
0.11user 0.00system 0:00.11elapsed 97%CPU (0avgtext+0avgdata 5456maxresident)k
0inputs+0outputs (0major+697minor)pagefaults 0swaps
  11437  193901 3919008

(Also I'm not sure why my library of 385 songs has 11k items in it; that's something else to investigate.)

Some profiling suggests that this is a likely candidate:

$ python3 -m cProfile -s cumtime /usr/local/bin/beet ls | less
...
         703753 function calls (689872 primitive calls) in 3.696 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1159    2.377    0.002    2.377    0.002 {method 'execute' of 'sqlite3.Connection' objects}
      770    0.083    0.000    0.297    0.000 db.py:173(_awaken)
    34677    0.074    0.000    0.119    0.000 types.py:93(from_sql)

So yes, it does spend most of its time talking to the DB, and adding more items to the library seems to add 3 more sqlite3.Connection.execute() calls per library item. So, the first goal is to do more intelligent batching of database calls. If I can make ls faster, then at best it will make a lot of other things faster as well, and at worst I'll be able to start attacking other problems.

Other performance issues to look into: https://github.com/beetbox/beets/issues/2370 https://github.com/beetbox/beets/issues/2281 https://github.com/beetbox/beets/issues/1795

Setup

OS: Debian Stretch
Python version: 3.5.3
beets version: 1.4.3 (git master, commit 272aa8870386b171c132206829509b410cd11d18)

sampsyo commented 7 years ago

Hi! Thanks for looking into this—we could really use some help with digging into performance issues exactly like this.

I also agree that starting with beet ls, which really should be fast but isn't, is a good idea. Here, I think the issues is the way that we've designed the query and lookup process in beets.dbcore, which is simple and elegant but terrible for performance. Specifically, here's how it goes:

The main query is issued, which should itself be fast. This returns the fixed attributes for all the items (or albums) matched, from the albums or items tables in the SQLite database.
We fetch all the data out of the query cursor eagerly. This is clearly not the best way to use SQLite, but it's useful to avoid long transactions that prevent other processes from accessing the database. Because the client code could otherwise take a long time to iterate over all the results, this eager buffering is probably necessary.
We construct a Results object, which is an iterator that lazily creates the actual Model objects (specifically, their subclasses Album and Item). Each time the client code gets one of these model objects, we issue more queries to look up the flexible attributes from the database. That happens in the _awaken method. These are then passed through methods like from_sql, as you saw in the profiling there, to create real Python values.
It's even worse for album queries (beet ls -a): when an album is awakened, we also materialize its list of albums (if I recall correctly). This involves getting the flexible attributes for each of those items.

In each of these cases, we could make better choices about where to be lazy and where to be eager. In particular, one big win might be to fetch all the flexible attributes at once using a hefty JOIN clause—that would require more up-front work from the database but much fewer one-at-a-time, back-and-forth trips from Python to SQLite.

I hope that helps! I'm happy to explain other stuff in more detail too if it would be helpful. But even just doing this profiling work to find the bottlenecks in ls is really helpful, so please keep it coming!

As an aside, I recently discovered this tool while doing a little profiling work: https://github.com/jiffyclub/snakeviz

It renders a nice interactive, graphical representation of Python profiling data.

icefoxen commented 7 years ago

It helps a lot! Thank you. My initial attempt was going to be to just make everything eager and see what that does. This would mean storing a lot more stuff in memory, but minimizing the number of db calls that have to be made. The main problem with this is that it makes me start to worry about memory consumption for particularly large libraries, but I'm not sure how to avoid that without trying to restructure everything to be a streaming operation, and I certainly don't want to do something that drastic. The next step would be to try to make the db calls more clever so that it only eagerly pulls out what data is actually necessary for whatever the user is trying to do, a la those JOIN statements you mentioned. This might involve building up a query piecemeal and then turning it into SQL and executing it all as one big chunk at the end.

There might be a few pure-code things that can be done as well but database will definitely be the low-hanging fruit. I'll try out snakeviz, I've never seen it before! Thanks.

ghost commented 7 years ago

we're also missing indexes.

icefoxen commented 7 years ago

Initial impressions:

Models get created a lot and each one involves at least one database hit. There appears to be no good way to create a query and then create a list of Model's from that query, though I confess I'm no database expert; my only prior example is working with QuerySets in Django.
There's lots of layers of sneaky indirection. This makes it really easy to hide complex interdependencies that end up doing unexpected things. For instance, calling format() on a Model, even one that already exists and has been loaded from the DB, usually involves a DB query. You have a whole general-purpose ORM to store objects of any type into the database, with hundreds upon hundreds of lines of code, and there's only actually two object types you store.
These things combine to form a bad situation I am not really qualified to deal with: You end up with lots and lots of sneaky little query calls all over the place. This is a structural problem, and so solving it will take structural refactoring, which I'm not going to have time or patience to do (though I might be able to get a start on it).
(Speculative) There's a lot of work done to make the code multithread-able, which adds lots of complexity for little gain. beet import is faster single-threaded than multi-threaded, at least on my box. Particularly, most of what you're doing is, in the end, either reading or writing a database. If you're reading from it then you're safe with any number of readers and no locks, and multiple threads probably won't make things go faster anyway in most cases -- you'll be either blocked on I/O bandwidth or on Python's GIL. If you're writing to it then you're only ever going to be able to have one thread writing to it at a time anyway because sqlite can't handle concurrent writes.

Some poking has made listing items 60% faster (on my machine), mostly by taking things away. See https://github.com/icefoxen/beets/commit/d12a1d2a82fe1cda4386fa3cd948ce538358d1ef . However it also breaks a lot of things, so no pull request yet.

sampsyo commented 7 years ago

Initial responses:

Yeah, sneaky queries is exactly what we need to crack down on! I think a good methodology would be to locate one category of sneaky queries at a time, quantify their cost, and see if they can be eliminated with some restructuring.
I'll gently defend the generality of dbcore: I actually don't think the generality costs us anything in terms of complexity and make the claim that it makes the database code much more maintainable. I'm basing that on a contrast with my extreme sadness with the old way, where database finagling was deeply entangled with the Album and Item classes themselves. That said, the simpler code can definitely hide performance pitfalls, which is totally worth addressing!
Multithreading in the importer is pretty important for usability. Did you run the importer with autotagging enabled? We get a lot of mileage out of being able to overlap user interaction (i.e., waiting at a prompt) with network queries and disk manipulation. So even when we don't get an end-to-end speedup, since all files eventually need to be copied, we still benefit in user-visible latency from being able to do the copying in the background while we load data from MusicBrainz for the next album.

Anyway, as you say, it would be interesting to explore a version that does everything eagerly and start from there. That change could be easily confined to the way we construct Results so that client code would be totally unaware. You're right that there's a memory/performance trade-off, but I think it's worth building and measuring the actual memory overhead—and how it scales with library size—to see if it creates a meaningful problem.

icefoxen commented 7 years ago

The only kind of query it does is the sneaky kind, and the way you restructure that is you write the database layer to have a way of doing things in batches. :-(
I'm not convinced, since database finagling is still deeply entangled, just with different classes. There is no one place you go from database to objects and back; if there were the aforementioned batching would be much easier. What you really need is the ability to take any operation you do to the database and make it happen to a query, not a single object; from then doing stuff to a single object becomes a special case of that, and whenever you want to select or update many items at once it is still a single query instead of thousands.
Okay, I didn't think of the usability portion; it IS true that a lot of time gets spent on network and disk stuff, and multithreading is valid for that.

Eagerly constructing Results is basically what the test change in my branch does. It does work! Haven't checked the memory performance though.

sampsyo commented 7 years ago

Got it, that makes sense. So you're saying, for example, that beet list would ideally bypass the ORMesque layer entirely—it wouldn't even bother constructing Item and Album objects at all. Instead, you would take a command like:

$ beet ls -f '$foo $bar'

and just issue the query SELECT foo, bar FROM items, and then fill the template with each returned row. No overhead required!

I totally agree that this would be the ideal speed. The game, of course, is finding a way to do that—or something close to it—while maintaining abstractions. I don't have any great ideas, but here are some bad ideas:

We could add a way to request partial information about Items and Albums without constructing the full objects.
If we're willing to compromise on the object construction cost, we could just allow eager fetching of flexible attributes and see where that gets us.

icefoxen commented 7 years ago

Not really. Abstraction is good. But each database call has quite a lot of overhead compared to the amount of actual work done, and we do multiple database calls to fetch all the information required for each item. The cost of actually constructing the Item and Album objects is... well, not quite trivial, but very small in comparison. We need an easier way to handle Items and Albums in batches, so that we do the same work with fewer calls into the database.

Right now if you want to construct 1000 Item objects based on a query, it does 1 database query to match the query, 1000 queries to get the attributes for each item, and 1000 queries to get the flexible attributes for each item. And then if you print them out, it does at least another 1000 queries in the Formatter objects for some reason (not sure what it's doing there yet, something about album info). Each of those database queries has a lot of overhead compared to the amount of work done; they're basically selecting a single row. It is much much faster to be able to say "oh we're going to be getting all of these objects at once anyway", do two queries (one to get attributes for all the items, one to get flexible attributes; you could even combine them together into a single call with a join) and construct all your Items from that info at once.

sampsyo commented 7 years ago

Yep, that sounds right. Maybe a useful exercise would be to hand-write the queries we wish we were executing, and then design a usable Python API around that?

RollingStar commented 7 years ago

I don't know if my issue belongs in a separate issue from this or not. It may fall under general performance just like this issue.

Problem

Same setup as #2446. Different issue: if a user wants to look up a giant boxset, it shouldn't take as long as it does. Is there any kind of optimization or parallelization that can be done? My CPU only has 20% use while this operation runs. What is beets actually doing when it's stuck looking up a big boxset? From my verbose log, it looks like it's calculating distance between the local files and candidate albums on the web.

ghost commented 6 years ago

I do not code and so I am not the best person to enter in this discussion, but based on my experience in compiling and testing stuff I guess the main reason of the beets slowness is the fact it is fully written in python.

If beets itself was written using some compiled language (C, C++, ADA etc) and then make use of the existent python libraries (py-musicbrainzngs, py-pyacoustid etc) it certainly would be rather faster.

While comparing apples and oranges, Picard is moderately fast but slow when performing actions what depend on (more or less) the same python libraries Beets does depend.

I am using beets for a while and I really enjoy it, but I was shocked when it took more the 36 hours to import a Maria Callas box with about 1400 tracks. :-O

It take about 2 seconds just to display "beet -h" in here. :-)

Just my 2c.

RollingStar commented 6 years ago

That's a good idea, but porting some or most of the code to a compiled language would be a huge undertaking. Similar gains may be possible just by optimizing the python code.

NoahTheDuke commented 5 years ago

The issue isn't with Python, the issue is with beets falling prey to the classic N+1 SELECT problem:

Right now if you want to construct 1000 Item objects based on a query, it does 1 database query to match the query, 1000 queries to get the attributes for each item, and 1000 queries to get the flexible attributes for each item.

The clever way to do such a series of queries would be to group all of the ids for each successive query and then join them together into a single WHERE statement and iterate over them in Python. This is what you meant by batching, right @icefoxen?

So if the old way was, selected_albums = "select * from db.albums where album_name = $foo" and then for song_id in selected_albums: song = "select * from db.songs where song_id = $song_id" and and so on, you'd call selected_albums = "select * from db.albums where album_name = $foo" and then song_ids = [song.id for song in selected_albums]; songs = "select * from db.songs where song_id in (" + ", ".join(song_ids) + ")" which would grab only those song_ids you actually one, but get them all in a single call.

I don't know anything about how beets works under the hood, so I don't know if that's easily done with the Object model y'all are working with, but it's a well-trod problem and can be done.

icefoxen commented 5 years ago

@NoahTheDuke Yeah that's more or less the problem, as I recall it. It was just too large a problem for me to be able to handle, since it seemed to me that it would take a substantial amount of redesigning to fix.

k6Q28k5w5eWuRuuzTg9j7GrKCbjt7 commented 5 years ago

Importing a parent folder with many albums vs "loop items in parentfolder and invoke beet import on every subfolder / album" improves the import time immensely. In fact, that's the only way I'm able to bulk import big sets (I just use a simple bash one liner). So I'm guessing the biggest gain will come from reflecting such invocation in the code.

SimonTeixidor commented 5 years ago

I decided to look into this a bit today. Beets is not painfully slow for me, but still quite a bit slower than it should be.

@NoahTheDuke: I disagree that N+1 SELECT is a problem here. While it would have been a problem in a traditional server/client DBMS, it doesn't really apply to sqlite. The latency is just much lower as sqlite does not do any network communication. In fact, the documentation even encourages the N+1 pattern [1].

I imagine that the indices in the DB where created after this issue was opened, because now the sqlite calls do not take a lot of time according to the profiler.

Running beet ls > /dev/null on my library of 14190 songs takes about 15 seconds. Profiling it shows that most of the effort is spent formatting and printing the output. As an experiment, I applied the following patch:

diff --git a/beets/ui/commands.py b/beets/ui/commands.py
index c89dbb6d..a9a6ec5c 100755
--- a/beets/ui/commands.py
+++ b/beets/ui/commands.py
@@ -1064,7 +1064,7 @@ def list_items(lib, query, album, fmt=u''):
             ui.print_(format(album, fmt))
     else:
         for item in lib.items(query):
-            ui.print_(format(item, fmt))
+            ui.print_(item.title)

 def list_func(lib, opts, args):

This brings the runtime down to 4 seconds. At this point, the output of cProfile looks like this:

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    192/1    0.013    0.000    4.908    4.908 {built-in method builtins.exec}
        1    0.000    0.000    4.908    4.908 beet:18(<module>)
        1    0.000    0.000    4.674    4.674 __init__.py:1261(main)
        1    0.000    0.000    4.674    4.674 __init__.py:1218(_raw_main)
        1    0.000    0.000    4.589    4.589 commands.py:1070(list_func)
        1    0.066    0.066    4.589    4.589 commands.py:1058(list_items)
    14192    0.101    0.000    2.165    0.000 db.py:660(_get_objects)
    14190    0.810    0.000    1.906    0.000 db.py:720(_make_model)
    14190    0.075    0.000    1.214    0.000 __init__.py:121(print_)
        6    0.000    0.000    0.845    0.141 db.py:820(query)
        2    0.000    0.000    0.844    0.422 library.py:1350(_fetch)
        2    0.000    0.000    0.844    0.422 db.py:1019(_fetch)
        1    0.000    0.000    0.844    0.844 library.py:1392(items)
    14190    0.012    0.000    0.819    0.000 __init__.py:84(_out_encoding)
    14190    0.037    0.000    0.807    0.000 __init__.py:90(_stream_encoding)

It seems like we spend 2 seconds translating db records to models (_make_model). That sounds like a lot, and might be possible to optimize as well.

I haven't yet looked into how much effort it would be to optimize these two functions (printing, and model creation), but it seems realistic to get beet ls down to under 5 seconds on a relatively large library without any major architectural changes.

[1] https://sqlite.org/np1queryprob.html

sampsyo commented 5 years ago

Awesome question! This is a super useful investigation.

I am particularly interested that you found such a high cost for formatting. Some time ago, I did a somewhat ridiculous thing to try to optimize template formatting—I wrote a compiler that translates template strings into proper Python functions for evaluation. After I did so, I vaguely remember concluding that it didn't have a large impact—and I thought it was because template formatting didn't matter all that much. And perhaps it doesn't—the problem might be formatting-adjacent—but it really looks like this piece of the puzzle is worth a closer look.

SimonTeixidor commented 5 years ago

Thanks for the hint! It looks like the template is actually recompiled over and over, which you actually discovered yourself in #2030 some time ago :)

As an experiment, I applied the following patch:

diff --git a/beets/dbcore/db.py b/beets/dbcore/db.py
index 97a4a7ce..d218141e 100755
--- a/beets/dbcore/db.py
+++ b/beets/dbcore/db.py
@@ -23,6 +23,7 @@ from collections import defaultdict
 import threading
 import sqlite3
 import contextlib
+import functools

 import beets
 from beets.util.functemplate import Template
@@ -36,6 +37,10 @@ else:
     from collections.abc import Mapping

+@functools.lru_cache(maxsize=None)
+def create_template(template_str):
+    return Template(template_str)
+
 class DBAccessError(Exception):
     """The SQLite database became inaccessible.

@@ -597,7 +602,7 @@ class Model(object):
         """
         # Perform substitution.
         if isinstance(template, six.string_types):
-            template = Template(template)
+            template = create_template(template)
         return template.substitute(self.formatted(for_path),
                                    self._template_funcs())

With this, beet ls runs in around 10 seconds, down from 15. There may be room for further improvement with just local optimizations in Template, as well. I'll investigate this further another day.

SimonTeixidor commented 5 years ago

I made investigated the templating logic further today, and created PR #3258 with my findings.

sampsyo commented 5 years ago

Thanks for the hint! It looks like the template is actually recompiled over and over, which you actually discovered yourself in #2030 some time ago :)

I am shocked but not surprised at my terrible memory. :smiley: Thank you for continuing this investigation!! :rocket:

beetbox / beets

Performance: Common operations are slow. #2388

Setup

Problem