ajslater / codex

Codex is a web based comic archive browser and reader
GNU General Public License v3.0
202 stars 6 forks source link

OPDS "Newest Issues" very slow with large library with lots of metadata. #261

Closed beville closed 1 year ago

beville commented 1 year ago

Still seeing this issue too, though maybe not as bad as before. Basically, with my library I'm seeing load time of the "Newest Issues" list taking about 60 seconds. On some clients that's long enough to cause a time-out, sometime not. Tested with Panels, PocketBook and Foliate.

Using the same tool I mentioned in the other issue (https://github.com/beville/gen-fake-comics-lib) I re-created this with a large library (20,000 comics) and one that had comparable metadata tags as in my real library. Also running about a minute to load. The follow command will create 50 publishers, 20 series per publisher, 2 volumes per series, and 10 issues per volume. It will also populate a sane amount of locations, teams, characters, credits, story arcs, and summaries.

gen-fake-comics-lib.py -t -p 50 -s 20 -v 2 -i 10 -d . -cTSa

Then I created a new library the same size as the above, but without as all the ancillary metadata tags.

gen-fake-comics-lib.py -t -p 50 -s 20 -v 2 -i 10 -d . 

With this library, the "Newest Issues" loads in just a few seconds! So it seems that adding all those tags and text are causing that view fetch to be bogged down.

I haven't tried to narrow this down any further. Hopefully you'll be able to recreate this, or maybe just glean enough understanding of what might be happening from what I described.

ajslater commented 1 year ago

For sure the database query is slowed by the extra tags. The minor contributors are 'groups', e.g. publishers, imprints, series, volumes, folders. The major contributors to slowness are retrieving the many to many values, aka tags. A couple versions ago I was overzealous, populating the 'categories' OPDS tag with many tags so show off lots of metadata. This lead to unusable load times. So now I only populate categories with characters and story_arcs. Most libraries don't have many story_arcs tagged and most comics don't have any. Many character tags, on the other hand, are usually pretty prevalent.

With a 20k library with lots of metadata, I am noticing a 1 second load time in Series View where I pull in lots of metadata, which is different then Publisher View where its nearly instant and there are no joins.

60 seconds would be cause for concern, but I'm having trouble replicating anything close to that.

beville commented 1 year ago

I wonder if it's my machine? I'm running on a relatively low-powered server NUC with 8GB RAM. (Intel(R) Celeron(R) J4005 CPU @ 2.00GH)

Did you try using the my tool to generate a test lib? That example above replicated the performance time with my own real library. I would be great to get an apples-to-apples test.

I can play with narrowing down the performance hit based on what you just related.

I can also try with your mock_comics.py test to see if I hit some the same problems.

But first I'll wait for your next release with the performance boots. Maybe that will address this.

ajslater commented 1 year ago

For v1.2.8 I just turned off all categories tags. So no characters or story_arc metadata anymore. I still do have author/creator credits which can slow things down, but tell me how this goes.

beville commented 1 year ago

Well, I'm still seeing one minute load times of my library on the "Recently Added" feed. Similar for the 20K test library with metadata that I generated with my tool. Using your fake mock_comics for 20K, I'm seeing more like 30s.

I did some more ordered tests on the command line on my Celeron server like this:

time wget -q -O /dev/null "http://localhost:9810/opds/v1.2/s/0/1?orderBy=created_at&orderReverse=True"

and another on the "All Series" view as a control.

time wget -q -O /dev/null "http://localhost:9810/opds/v1.2/r/0/1?topGroup=s"

Using gen-fake-comics-lib.py I created two libraries, one with lots of metadata, another with just the series. volume, title, issue stuff. I also created one with mock_comics.py. For each library, stopped codex, wiped out the config, and started fresh. Waited until import was complete for running the wget test.

Tested again on a much beefier desktop system. Both running Ubuntu, and with Codex as a Docker image.

NUC Media Server

Intel(R) Celeron(R) J4005 CPU @ 2.00GH w/ 8GB

This my house media server


"Real" library

(17K comics, almost all with metatdata)

Feed Load Time
All Series 0m0.865s
Recently Added 0m57.656s

mock_comics.py 20K

(20K comics created by mock_comics script)

Feed Load Time
All Series 0m0.936s
Recently Added 0m30.127s

gen-fake-comics-lib.py 20K - lots of metadata

(20K comics created by gen-fake-comics-lib.py script)

gen-fake-comics-lib.py -p 50 -s 20 -v 2 -i 10 -d . --tree --credits --tags --summaries --arcs

Feed Load Time
All Series 0m0.922s
Recently Added 1m11.045s

gen-fake-comics-lib.py 20K - minimal metadata

(20K comics created by gen-fake-comics-lib.py script)

gen-fake-comics-lib.py -p 50 -s 20 -v 2 -i 10 -d . --tree

Feed Load Time
All Series 0m1.200s
Recently Added 0m1.667s

Desktop PC

Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz w/ 16GB

For this I just tried to two slowest-loading generated libraries


mock_comics.py 20K

(20K comics created by mock_comics script)

Feed Load Time
All Series 0m0.534s
Recently Added 0m9.592s

gen-fake-comics-lib.py 20K - lots of metadata

(20K comics created by gen-fake-comics-lib.py script)

gen-fake-comics-lib.py -p 50 -s 20 -v 2 -i 10 -d . --tree --credits --tags --summaries --arcs

Feed Load Time
All Series 0m0.666s
Recently Added 0m23.360s

ajslater commented 1 year ago

Yeah this is really bad. Thanks for the detailed writeup. It turns out if I remove all the M2M field metadata i get subsecond acquisition page times, like Recently Added. I added back in authors and I got like 1.4 secs on average. So I'll try that for the next release. Made myself a little time_opds.sh script for 1.2.9 so hopefully I can avoid bloating this again.

beville commented 1 year ago

Awesome, thanks! Glad I could help with this.

That seems like a reasonable solution, especially given there are almost no clients rendering that feed metadata for users. The one that might the most interesting to add back eventually might be story arc, maybe. But that could be a later optimization. It would sure be cool if the feed standard provided for some sort of second-order metadata request link per-item, sort of a inspect-before-download/read query.


On a related note to the "Newest Issues" feed, I'm curious how some clients will deal with getting a full fire hose of issues in a feed. Chunky of course is trying to fetch the entire feed before rendering a list. From what I've seen I it seems Panels is rendering one page at a time, but I think is still doing page caches in the background. For a huge feed that's a lot of memory being slurped up.

An idea to consider is for Codex to add support for a feed limiting URL parameter such as "pageLimit" or "itemLimit". This would allow for certain feeds to include this limiter, and have the feed generation code not include a "Next" link once the limit has been hit. This might make sense for some issue-only feeds that probably don't need to include all issues in the library, such as "Newest Issues" (or maybe a new feed called "Recently Read" :wink:)

beville commented 1 year ago

v1.2.9 is better, but still quite slow for my actual library, down to about 20 seconds.

I'm guessing that your development machine is about 10 times faster than my media server, which is not super fast, but is far from anemic for what it generally does. (Huh, I wonder if anyone is trying to run Codex on a Raspberry Pi?)

Are you running a M2 Mac?

Here are the times on that server for loading the "Recently Added" feed.

Library v1.2.8 v1.2.9
"Real" library (17K comics, almost all with metatdata) 0:57.656 0:19.68
mock_comics.py 20K 0:30.127 0:06.94
gen-fake-comics-lib.py 20K - lots of metadata 1:11.045 0:28.52
gen-fake-comics-lib.py 20K - minimal metadata 0:01.667 0:01.92

What sort of times do you see?

Maybe a workaround for this (short of DB/query optimization) would be a have a URL param that requests to drop extra metadata (either a flag or a list) from the generated feed. That way certain feeds that return quick results could do the default richly populated thing, but slower ones, like "Recently Added", could have this param to speed it up. Possibly even an user-settable option, eventually.

ajslater commented 1 year ago

my M1 Mac is seeing 1.4s for mock_comics 20k.

I like this suggestion of tunable metadata by url! I think I’ll make three levels and the lowest level will drop even more joins from the db, very spare.

People did run this off of pi, but I dropped docker support for arm32 a while ago when python cryptography stopped publishing compiled wheels for it. My build times exceeded what’s allowed on my CI service’s free tier.

On Sat, Mar 11, 2023 at 10:25 AM beville @.***> wrote:

v1.2.9 is better, but still quite slow for my actual library, down to about 20 seconds.

I'm guessing that your development machine is about 10 times faster than my media server, which is not super fast, but is far from anemic for what it generally does. (Huh, I wonder if anyone is trying to run Codex on a Raspberry Pi?)

Are you running a M2 Mac?

Here are the times on that server for loading the "Recently Added" feed. Library v1.2.8 v1.2.9 "Real" library (17K comics, almost all with metatdata) 0:57.656 0:19.68 mock_comics.py 20K 0:30.127 0:06.94 gen-fake-comics-lib.py 20K - lots of metadata 1:11.045 0:28.52 gen-fake-comics-lib.py 20K - minimal metadata 0:01.667 0:01.92

What sort of times do you see?

Maybe a workaround for this (short of DB/query optimization) would be a have a URL param that requests to drop extra metadata (either a flag or a list) from the generated feed. That way certain feeds that return quick results could do the default richly populated thing, but slower ones, like "Recently Added", could have this param to speed it up. Possibly even an user-settable option, eventually.

— Reply to this email directly, view it on GitHub https://github.com/ajslater/codex/issues/261#issuecomment-1464974205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAKRVMSDFX5MQMQZ3VG4TW3S7R5ANCNFSM6AAAAAAVP5ITIQ . You are receiving this because you commented.Message ID: @.***>

ajslater commented 1 year ago

As of codex v1.3.0 all the metadata that makes opds slow has been relegated to an alternate link that will show only that one comic, which should be fast.

This functions a bit like the tags/metadata screen in the web ui.

beville commented 1 year ago

Much faster now!