Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
705 stars 132 forks source link

'Latest contributions' page for a specific language doesn't load #284

Closed jiru closed 10 years ago

jiru commented 10 years ago

When I try to see the latest contributions for a specific language, for instance :

http://tatoeba.org/jpn/contributions/latest/fra http://tatoeba.org/jpn/contributions/latest/eng

The page loads for some time and displays 504 Gateway Time-out. I’ve been unable to get that page working during the two last days and I tried at various times of the day.

Note that when you don’t filter by language, the page loads reasonably fast : http://tatoeba.org/jpn/contributions/latest.

alanfgh commented 10 years ago

I saw this same behavior today.

trang commented 10 years ago

Well, it's because the query in getLastContributions() is too heavy: https://github.com/Tatoeba/tatoeba2/blob/master/app/models/contribution.php#L107

Called from here: https://github.com/Tatoeba/tatoeba2/blob/master/app/controllers/contributions_controller.php#L120

The difference between specifying or not a language is that we don't read from the same table:

$table = "contributions";
if ($lang == 'und'|| empty($lang)) {
    $table = "last_contributions"; 
}

The table last_contributions is a subset of contributions, that contains only the last 400 contributions entries (or something like that). The contributions table has several millions of entries...

We may need to disable this feature until it's optimized.

loolmeh commented 10 years ago

The way I see this, you probably can't optimize this without some invasive changes to the schema, the real culprit, (as in those other terrible queries in the browse pages for filtering/exculding by language) is the comparisons across tables, in this case, probably "Contribution.user_id = User.id" ("Contribution.sentence_lang = '$lang'" probably needs an index as well?). My sql knowledge is pretty damn terrible, so I don't know if you can actually make an index across tables, but what you certainly can do is make another table that keeps these fields from both tables, and add some insert trigger code to keep them in sync with the main tables, then add a multicolumn index for the fields you want to compare. Pretty ugly and very hard to maintain in the long run. Also you're still trading insert time for select time, as in any other index, so might not be worth it if it slows down lots of insert operations on the site.

btw some sane pagination that doesn't use OFFSET is also welcome here as in the browse pages.

trang commented 10 years ago

Tried to review the query. Tested on contributions/latest/jpn on my local version. It seems the INNER JOIN was mostly the problem.

Current query (4049 ms):

SELECT 
  `Contribution`.`text`, 
  `Contribution`.`action`, 
  `Contribution`.`id`, 
  `Contribution`.`sentence_id`, 
  `Contribution`.`datetime`, 
  `Contribution`.`sentence_lang`, 
  `User`.`username`, 
  `User`.`id` 
FROM `contributions` AS `Contribution` 
  INNER JOIN `users` AS `User` ON (`Contribution`.`user_id` = `User`.`id`) 
WHERE `Contribution`.`sentence_lang` = 'jpn' 
  AND `Contribution`.`type` = 'sentence' 
ORDER BY `Contribution`.`id` DESC 
LIMIT 200   

Query using CakePHP's ORM, which does a LEFT JOIN (493 ms):

SELECT 
  `Contribution`.`sentence_id`, 
  `Contribution`.`sentence_lang`, 
  `Contribution`.`text`, 
  `Contribution`.`datetime`, 
  `Contribution`.`action`, 
  `User`.`id`, 
  `User`.`username` 
FROM `contributions` AS `Contribution` 
  LEFT JOIN `users` AS `User` ON (`Contribution`.`user_id` = `User`.`id`) 
WHERE `type` = 'sentence' 
  AND `sentence_lang` = 'jpn' 
ORDER BY `datetime` DESC 
LIMIT 200

We'll see how it goes in prod.

jiru commented 10 years ago

yay!

loolmeh commented 10 years ago

Great! But this makes absolutely no sense, doesn't a left join just return the results of an inner join plus all the records from the left table that match the conditions in where even if they didn't match the join predicate? Very curious how mysql is mucking this up.

Btw is there a multicolumn index on type and sentence_lang, because there should be. Also, I wonder if the conditions can come before the join predicate to narrow down the number of records that the comparison across tables has to be done for.

loolmeh commented 10 years ago

After fidgeting with this for a couple of hours and looking at the schema carefully, I came to the following conclusions:

-- Add an index on the WHERE clauses for contributions
ALTER TABLE contributions ADD INDEX type_lang_userid_idx (type, sentence_lang, user_id);

-- Filter on contributions using rows from users
SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime DESC LIMIT 200;

This reduced the time by 6 folds on my VM on some dummy data, I have no idea how it would fare on real data though in production. Also filtering by last_time_active or whatever on the users table should reduce the rows a bit I suppose, whatever the criteria you use, make sure it uses an index.

loolmeh commented 10 years ago

After looking at explain extended for these queries some more, it seems like ORDER BY...LIMIT is doing something very funky:

mysql> EXPLAIN EXTENDED SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime LIMIT 11 \G; *** 1. row *** id: 1 select_type: PRIMARY table: contributions type: index possible_keys: sentence_lang,type_lang_userid_idx key: datetime key_len: 8 ref: NULL rows: 22 filtered: 447077.28 Extra: Using where *** 2. row *** id: 2 select_type: DEPENDENT SUBQUERY table: users type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: func rows: 1 filtered: 100.00 Extra: Using index 2 rows in set, 1 warning (0.00 sec)

mysql> EXPLAIN EXTENDED SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime \G; *** 1. row *** id: 1 select_type: PRIMARY table: contributions type: ref possible_keys: sentence_lang,type_lang_userid_idx key: type_lang_userid_idx key_len: 16 ref: const,const rows: 98357 filtered: 100.00 Extra: Using where; Using filesort *** 2. row *** id: 2 select_type: DEPENDENT SUBQUERY table: users type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: func rows: 1 filtered: 100.00 Extra: Using index 2 rows in set, 1 warning (0.00 sec)

I hate mysql so so much. Can someone just move us to a saner database like postgresql or something? Thanks. Drop the LIMIT and figure out some other way of getting rid of the rest of the results, you can thank mysql for this and for world hunger and poverty too.

loolmeh commented 10 years ago
(SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng') UNION ALL (SELECT * FROM contributions WHERE user_id IN (SELECT id FROM users)) ORDER BY datetime DESC LIMIT 10;

This seems to have a comparable performance but looks pretty ugly and needs a different set of indices. EXPLAIN EXTENDED says it works with a shit ton of rows though so I have no idea how mysql makes this fast, I suppose it throws pixie dust at it. It doesn't seem to have the same problem as ORDER BY...LIMIT.

loolmeh commented 10 years ago

I think I found a way to reduce the number of rows to filter on dramatically. Use the contributions table itself to find the user id given a date criteria or something:

SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT DISTINCT user_id FROM contributions WHERE datetime > '2014-07-14 10:33:52' ) ORDER BY datetime;

I think you might wanna do SELECT DISTINCT user_id FROM contributions WHERE datetime > '2014-07-14 10:33:52' alone in php then use the ids in this query and another one to get the username from the users table. mysql's optimizer seems pretty good at optimizing when exact values are provided at compile time. This should be blazing fast considering that we only have around 10-20 users active at about the same time I think.

trang commented 10 years ago

Forgot to reply to this.

There is an index on type + sentence_lang already. I don't think we need to add the user_id in there, since it's not part of the WHERE. There is also an index on datetime, for the ORDER BY datetime LIMIT 200 part.

I think the LEFT JOIN thingy generated by CakePHP is fine, it's also a LEFT JOIN that should have been done since the beginning, because doing an INNER JOIN would actually remove a result for a user that has been deleted.

Anyway for what I've tested in prod, it's fast enough for French and English. Not sure why Japanese takes a bit longer, but it's still decent.

loolmeh commented 10 years ago

There is also an index on datetime, for the ORDER BY datetime LIMIT 200 part.

Look at the EXPLAIN again the indices are not used and if you wanna keep the left join you at least need an index on type + sentence_lang + datetime. Try looking into FORCE INDEX and USE INDEX and benchmark this then look at the EXPLAIN again. The LEFT JOIN without any filtering still goes through an awful lot of rows and the ORDER BY uses filesort on the entire table before any filtering is done... The ORDER BY...LIMIT issue is pretty weird and I don't think we can solve it without using the UNION hack, but I guess we can ignore it if mysql uses the indices properly.