fabd / kanji-koohii

A web application to help Japanese language learners remember the kanji.
https://kanji.koohii.com
GNU Affero General Public License v3.0
228 stars 21 forks source link

Better story sorting #140

Closed kieranclancy closed 6 years ago

kieranclancy commented 6 years ago

Rather than just sort by score descending, it would be nice if there were another couple of sort options.

  1. Sort by popularity (current option)

  2. Sort by most recent first

  3. Sort by "best" stories first

The motivation is simple. A story one month old which is already starred 10 times is quite possibly more useful than a story which was starred just 50 times in 5 years. Sorting should take into account the rough exposure a story has had to users versus how often it gets starred. There are often very useful corrections or points made in lower-ranked stories but they rarely ever manage to overtake stories from the site's early days.

Note that you would want to use number of stars vs some approximation of exposure. Note that stories at the bottom of the page or past the default story limit would have less exposure even if they are older. Depending on how the exposure is estimated, the sorting might be computationally expensive but only needs to be updated when a story is starred.

The most recently used sort setting could be stored for the user so they don't have to change it often. The default could remain sorting by popularity for now (the current behaviour) but perhaps later if the "best" option works well then it could become the default for new accounts.

fabd commented 6 years ago

To be honest it's very unlikely to change at this point. But I do agree the sorting can be improved so I'm leaving this one open.

Part of the issue is that the SQL queries are very slow. So if I add multiple sortings, I'd have to also cache multiple sorts per kanji, complicating the logic somewhat.

The stories table is 1GB + and I haven't found a way to make these queries fast. The queries can easily reach 500ms or more, which is really bad. One potential solution would involve saving shared stories in a completely separate table., seeing as public stories make for less than 10% of all stories saved. This however complicates logic in various places, particularly when making a list of the user's stories both shared and public.. which requires more merges.

So long story short just haven't had the heart to work on that. And until I fix that I'm unlikely to make changes to the stories sort logic unless it's a change to the default sort which is overall better and still fairly fast SQL wise.

kieranclancy commented 6 years ago

That's fair enough - perhaps with a different table design or with more caching some of those performance issues could be eliminated.

3000 kanji x 20 stories shown by default x maybe 250 characters on average = just 14 MB.

So you could have an in-memory cache of 99% of the data that's commonly requested, and you just need to do two small SQL queries I think: One to request a user's own story and keyword if relevant, and second to see if a user has starred/reported any stories with that ID to customise their own view of that data.

fabd commented 6 years ago

I've thought some kind of memory cache would drastically improve performance in some areas but I don't know much about them an I'm not sure it's available on a shared server. For example, I wasn't allowed to use partitioned tables, which was one way to improve selectivity of shared stories without having to actually create two tables in the schema.

fabd commented 6 years ago

max_heap_table_size may be up to 256 MB, which sounds good. I really need to learn how those tables work.

fabd commented 6 years ago

3000 kanji x 20 stories shown by default x maybe 250 characters on average = just 14 MB.

In MEMORY tables used fixed length rows. Stories are up to 512 UTF-8 characters, so we are easily over 1500 bytes per row.

It's not 20 stories shown. Having only the current page of the sort in memory makes no sense. We're looking at typically 10 pages of 20 rows , up to 200 shared stories per kanji.

409,000 + stories shared Average 136+ shared stories per 3000 kanji Approx 1550 bytes per row, with VARCHAR(512) story in broken utf-8 storage (not even utf8mb4)

With only 20 stories per page, we're over 93 MB already.

With all 409k + stories, to include the paging, we're looking at 633+ MB.

kieranclancy commented 6 years ago

There are two separate issues here in my mind. One is that there needs to be a good way to store the almost half million shared stories. The other is caching, which if done properly could drastically cut the load times. With caching it is not necessary to cover 100% of the data though, the goal is to speed up access to just the most commonly requested data. I would be interested to see where most of the database time is spent, but my guess based on my own usage of koohii is that 95%+ of the time users don't go past the first page of results. If this estimate is close, then caching just the first page will give significant benefits.

I don't think MySQL MEMORY tables are a good solution for the story data because they are fixed-length. When I said caching I was thinking of something more like APC(u) or opcache-based caching. You said this is on a shared server though so not sure what caching methods are available. You could still experiment with something like a separate SQLite database or even just a plain file-based cache, which if you keep small enough will stay entirely in memory. (14MB still seems within reach.) Note if you used SQLite I would try to keep it to one row per kanji, which means you would serialise the top 20 stories or whatever and put them into a single TEXT column.

The logic might be something like get_stories($kanji, $limit, $offset): ($limit == 20 && $offset == 0) -> get from cache rather than mysql (otherwise) -> get from mysql

fabd commented 6 years ago

which if you keep small enough will stay entirely in memory. (14MB still seems within reach.)

Like I said even with 20 stories per page, it's still 93+ MB.

You basically have no idea what you're talking about. You should know better than tell another developer that something takes just a couple simple queries, or give me pseudo code for something super obvious.

At this point I'm no longer interested in the conversation.