Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
704 stars 132 forks source link

Going to last page is very slow when there are lots of pages #1353

Closed trang closed 5 years ago

trang commented 7 years ago

http://dev.tatoeba.org/eng/sentences/show_all_in/eng/none/none/indifferent/page:2

SELECT `Sentence`.`id`, `Sentence`.`text`, `Sentence`.`lang`, `Sentence`.`user_id`, 
`Sentence`.`hasaudio`, `Sentence`.`correctness`, `Sentence`.`script`, `User`.`id`, 
`User`.`username`, `User`.`group_id`, `User`.`level` FROM `sentences` AS `Sentence` 
LEFT JOIN `users` AS `User` ON (`Sentence`.`user_id` = `User`.`id`) 
WHERE `Sentence`.`lang` = 'eng' ORDER BY `Sentence`.`id` desc LIMIT 10, 10

// took 2 ms

http://dev.tatoeba.org/eng/sentences/show_all_in/eng/none/none/indifferent/page:50998

SELECT `Sentence`.`id`, `Sentence`.`text`, `Sentence`.`lang`, `Sentence`.`user_id`, 
`Sentence`.`hasaudio`, `Sentence`.`correctness`, `Sentence`.`script`, `User`.`id`, 
`User`.`username`, `User`.`group_id`, `User`.`level` FROM `sentences` AS `Sentence` 
LEFT JOIN `users` AS `User` ON (`Sentence`.`user_id` = `User`.`id`) 
WHERE `Sentence`.`lang` = 'eng' ORDER BY `Sentence`.`id` desc LIMIT 509970, 10

// took 3833 ms

The higher the LIMIT offset is, the worse it gets.

Earlier today the server was very slow. The load average jumped to 12-13. It seemed to be due to more people (or maybe just one person) browsing the last pages of "Browse by language".

zachleigh commented 7 years ago

@trang Ill have a look at this one. This might be helpful for us: http://stackoverflow.com/questions/4481388/why-does-mysql-higher-limit-offset-slow-the-query-down

ckjpn commented 7 years ago

Earlier today the server was very slow. The load average jumped to 12-13. It seemed to be due to more people (or maybe just one person) browsing the last pages of "Browse by language".

If jumping to the last page of "Browse by language" is something members find useful, I wonder if it would speed things up if you offered the same "sort by" options as you have at the top of lists.

"Sort by: date added | sentence id"

screen shot 2016-11-07 at 09 17 15

Here is a direct link to a list if you need to check out a long list to compare loading speed.

https://tatoeba.org/eng/sentences_lists/show/907

trang commented 7 years ago

We can't predict if it would speed things up because we can't predict how many users will still click on the last page instead of using the sort feature.

We can definitely add the sort options, as an enhancement (maybe there's already an issue for it), but the problem of going to the last page still needs to be solved.

ckjpn commented 7 years ago

You suggested that it might primarily just be one user of the site jumping to the last page.

If that's the case, and you post an announcement on the Wall that we request that you sort by sentence ID instead, perhaps this would lessen the load. Of course, eventually you would want to solve the problem, but this might alleviate the problem in the meantime, assuming it's not too difficult to adapt the "sort by" option that you are already using elsewhere for this purpose. If it will take hours and hours to adapt the code, then maybe it's not worth your time.

ckjpn commented 7 years ago

Note that the "id" of "sentence id" should be capitalized.

zachleigh commented 7 years ago

I've been messing around with this and can't get anything to work well. I can get sentences using a late row lookup hack (https://explainextended.com/2011/02/11/late-row-lookups-innodb/) but I haven't been successful in getting the relationships that we need. Plus, Im doing it all with raw queries which isn't ideal.

Anybody else have any ideas?

Until we find a better solution for this, a quick fix would be to simply limit the number of returned rows. There probably aren't too many people who go through 50,000 pages anyway.

zachleigh commented 7 years ago

I think Ive got this figured out and Ive got a rough, kind-of-working version locally. Im using this for reference: https://mariadb.com/kb/en/mariadb/pagination-optimization/

A few things to think about:

ckjpn commented 7 years ago

This may be off topic, but I wonder if this and other problems might be solved if the whole site were updated to the most recent version of CakePHP.

It's now up to version 3.3 and has gone through a lot of improvements since the version that we are now using.

https://cakephp.org/

It would likely take a lot of work to convert over to the newer version, but it might be worth it.

trang commented 7 years ago

@zachleigh sounds like your implementation would be appropriate if we implemented an infinite scroll. Anyway, if you have something working locally, you could publish it to your fork (no need to create a pull request), so I can pull it on my side and try it out.

@ckjpn later versions of CakePHP do not solve this problem.

zachleigh commented 7 years ago

@trang Pushed my local version up. Its still very basic. I have not yet implemented the page button id lookups because I couldnt figure out how to override the Cake pagination stuff in the view. Because of this, the page button id values are incorrect and each time you go up a page, the assumed total count decreases. I didnt want to invest too much time on it until we talked about the issues I mentioned above.

If you want to go the infinite scroll route, then we could probably just use Cake's default pagination. We just have to make it return json instead of a view. The problem occurs when people browse the end of the pagination results. Infinite scroll doesn't allow for this so it would probably solve this problem. Anybody willing to infinite scroll through 50,000 pages should get an award.

ckjpn commented 7 years ago

As mentioned, earlier https://github.com/Tatoeba/tatoeba2/issues/1353#issuecomment-258722768 , perhaps you could offer the "reverse" sort to allow people to jump to the end of the list by making it the first page.

halfdan commented 7 years ago

@ckjpn This is a good idea - it does however not solve the original problem. Slow pages put additional stress on the server and are an easy target if an attacker wants to bring the site down.

jiru commented 6 years ago

There is another, more complex way to solve this problem: https://mariadb.com/kb/en/library/pagination-optimization/ And that’s also the way to go if we implement infinite scrolling.

@zachleigh Could you point me to your code please?

trang commented 6 years ago

@jiru I believe this is the branch where Zach pushed his code: https://github.com/zachleigh/tatoeba2/tree/issue-1353

jiru commented 6 years ago

I could implement infinite scrolling, that would solve the problem (while breaking all the existing links that include a page:n parameter, but these are not permanent links anyway). @trang You assigned this issue to yourself, do you have plans to implement it?

ckjpn commented 6 years ago

Does infinite scrolling mean that people won't be able to jump past some pages? Does this also mean there is a possibility that as visitors scroll to sentences lower on the list things might get sluggish as the memory on their devices fill up?

jiru commented 6 years ago

Does infinite scrolling mean that people won't be able to jump past some pages?

You’d have to scroll back instead. Infinite scrolling means that we’re loosing the concept of pages, it’s just a matter of how much sentences are preloaded on the screen. It can be implemented with a "load more" button, or fills the page automatically as the scrolling position goes down.

Does this also mean there is a possibility that as visitors scroll to sentences lower on the list things might get sluggish as the memory on their devices fill up?

Thank you for pointing this out. I think there are ways to get around this problem, basically by removing the upper entries as you scroll down, and eventually reloading them as you go back up.

If you’re concerned about the usability of infinite scrolling vs. traditional pages, note that this bug can also be solved by keeping current the paging system, but:

It could look a bit like this.

alanfgh commented 6 years ago

I have serious misgivings about our switching to infinite scrolling. I really like using pages as units of work in whatever task I'm doing, whether it's translating sentences, or checking sentences, or looking at search results. Otherwise, I have nothing to guide me in pacing myself. Furthermore, I often want to go to a certain page number within a multipage selection. For instance, let's say I'm going through a bunch of sentences marked "@needs native check" and I know that I've already looked at the early ones and want to look at different ones now. I can do this by going to, say, page 2, and then replacing "2" with "20" within the address in the URL bar. Then I can bookmark that. How would I do the equivalent if we had infinite scrolling?

jiru commented 6 years ago

Thank you for your insightful comment, Alan.

I really like using pages as units of work in whatever task I'm doing, whether it's translating sentences, or checking sentences, or looking at search results. Otherwise, I have nothing to guide me in pacing myself.

I totally understand. Regarding pacing, how would you compare the current paging system with an infinite scrolling that wouldn’t automatically load sentences when going down, but instead, show a button "load 10 more sentences" and the end of the page?

Furthermore, I often want to go to a certain page number within a multipage selection.

I’ll consider the case where we keep a traditional paging view with the limitations I mentioned in my previous comment. You would still have a URL to bookmark, however, instead of being /tags/show_sentences_with_tag/1207/page:20, it would look like /tags/show_sentences_with_tag/1207/min-id:297418, which is almost impossible to remember. So you would always need to bookmark it because you wouldn’t be able to simply edit the number in such an URL.

It is also worth noting that in the current implementation, if the contents of the list change, going to a page N will bring you to a different place. For example, consider the page "Show all sentences in language A not directly translated in language B" that we used to have. If you add translations into language B for all the sentences of page 1, they won’t show up there anymore, so what was shown on page 2 is now on page 1. As a result, going to page 2 after having translated page 1 actually makes you miss a whole page’s worth of sentences to translate. (This is exactly what happened during the migration of password hashes by the way.) In the implementation I’m thinking about, this problem would be solved, because the contents of a given page would be based on its first sentence. So your bookmarks would be more accurate.

I know that I've already looked at the early ones and want to look at different ones now

So we could consider this problem differently: you don’t want to "go to page N", you want to "look at sentences you haven’t seen already". We could allow this by, for example, letting you "go to the page you last visited", or by keeping track of which sentences you’ve been shown or not before.

alanfgh commented 6 years ago

Regarding pacing, how would you compare the current paging system with an infinite scrolling that wouldn’t automatically load sentences when going down, but instead, show a button "load 10 more sentences" and the end of the page?

That wouldn't help me remember where I was in my task. If I start at page 1 and am at the end of page 2, I know that I've processed 2n sentences, where n is the number of sentences per page. But I don't remember how many times I've pressed the "load 10 more sentences" page. If all the sentences remain displayed, I could count them, but that would be a pain.

It is also worth noting that in the current implementation, if the contents of the list change, going to a page N will bring you to a different place.

I do realize that. However, it doesn't bother me as a limitation because it is a self-evident one. By contrast, a URL that uses "min-id" raises other questions. For instance, does it presume that the sentences in the list are sorted by ID (in ascending order)?

We could allow this by, for example, letting you "go to the page you last visited", or by keeping track of which sentences you’ve been shown or not before.

Do you really want to keep track of all the sentences I've seen for every kind of list (list of sentences with a specific tag, list of search results for a phrase, etc.)? Also, in my case, it's often not as simple as a binary "Have I seen this sentence before or not?" For instance, if I'm looking at sentences tagged "@needs native check", I often look at, say, the sentences on page 6 several times before I'm able to come up with decisions as to what to do with some of them. Then I get sick of looking at some of them and need to put them away for a while and look at the sentences on page 2 for a while.

jiru commented 5 years ago

Note: in the show_all_in page, the query used to count of the number of results is also inefficient because of unnecessary left joins.

The current query:

SELECT (COUNT(*)) AS `count` FROM `sentences` `Sentences`
LEFT JOIN `users` `Users` ON `Users`.`id` = (`Sentences`.`user_id`)
WHERE `Sentences`.`lang` = 'eng';

Is much slower than:

SELECT (COUNT(*)) AS `count` FROM `sentences` `Sentences`
WHERE `Sentences`.`lang` = 'eng';

It looks like this can be easily optimized using CakePHP’s counter() query builder method or something.

jiru commented 5 years ago

@alanfgh Let’s forget about infinite scrolling and focus on the limitations of a seek-based paging system:

  1. We’d loose the ability to "jump to page N" for an arbitrary N, except maybe for the first pages.
  2. We’d only be able to jump to next/previous pages, and possibly next/previous pages further away like N+2, N+3 or more.
  3. We wouldn’t know the N of the page we’re showing any more, except maybe for the first pages.

It is however time-consuming to develop such a paginator. Another option is to limit the number of displayed pages to an arbitrary number like 100 or 1000. As a alternative, we could allow to download the complete result set as CSV once we improve our export functionalities.

trang commented 5 years ago

I'm in favor of limiting the number of pages to an arbitrary number.

Based on @alanfgh's comment in https://github.com/Tatoeba/tatoeba2/issues/1667#issuecomment-437530869, probably 50 pages or less would already be enough. Looking at Google's pagination, it seems they limited it at 25 pages (with 10 results per page). Our search is limited to 1000 results so we could go with that as our default max number of results in general (that would be 100 pages assuming 10 results per page).

From my observations, there is rarely (if ever) the need for a normal user to go through more than a few dozens of pages. Anyone who needs more than 1000 results is usually doing something programmatic such as doing statistics or gathering data for an app. They are not doing a regular activity such as translating, proofreading, learning or just exploring.

For the programmatic purpose, we can provide customized exports as suggested. For the regular activities, we can introduce additional filters.

For one thing, we could really make use of a date filter:

Anyway I don't think we can solve this issue all at once. It basically requires to rethink the features and purpose of each page and provide a different way to navigate through the data.

I suggest we start with applying a limit of 1000 results on some pages only. I wrote a Wall post about it, to ask for some feedback.

If there is no negative feedback, then we can set the limit on the following pages:

There are other pages that have large pagination, but we could first see how users react the the changes above before we set this limit more globally.

trang commented 5 years ago

Closing this.

Will open separate issues for each of the other pages where large pagination is a problem and needs to be removed.