Closed jiru closed 10 years ago
I saw this same behavior today.
Well, it's because the query in getLastContributions() is too heavy: https://github.com/Tatoeba/tatoeba2/blob/master/app/models/contribution.php#L107
Called from here: https://github.com/Tatoeba/tatoeba2/blob/master/app/controllers/contributions_controller.php#L120
The difference between specifying or not a language is that we don't read from the same table:
$table = "contributions";
if ($lang == 'und'|| empty($lang)) {
$table = "last_contributions";
}
The table last_contributions
is a subset of contributions
, that contains only the last 400 contributions entries (or something like that). The contributions
table has several millions of entries...
We may need to disable this feature until it's optimized.
The way I see this, you probably can't optimize this without some invasive changes to the schema, the real culprit, (as in those other terrible queries in the browse pages for filtering/exculding by language) is the comparisons across tables, in this case, probably "Contribution
.user_id
= User
.id
" ("Contribution
.sentence_lang
= '$lang'" probably needs an index as well?). My sql knowledge is pretty damn terrible, so I don't know if you can actually make an index across tables, but what you certainly can do is make another table that keeps these fields from both tables, and add some insert trigger code to keep them in sync with the main tables, then add a multicolumn index for the fields you want to compare. Pretty ugly and very hard to maintain in the long run. Also you're still trading insert time for select time, as in any other index, so might not be worth it if it slows down lots of insert operations on the site.
btw some sane pagination that doesn't use OFFSET is also welcome here as in the browse pages.
Tried to review the query. Tested on contributions/latest/jpn
on my local version. It seems the INNER JOIN was mostly the problem.
Current query (4049 ms):
SELECT
`Contribution`.`text`,
`Contribution`.`action`,
`Contribution`.`id`,
`Contribution`.`sentence_id`,
`Contribution`.`datetime`,
`Contribution`.`sentence_lang`,
`User`.`username`,
`User`.`id`
FROM `contributions` AS `Contribution`
INNER JOIN `users` AS `User` ON (`Contribution`.`user_id` = `User`.`id`)
WHERE `Contribution`.`sentence_lang` = 'jpn'
AND `Contribution`.`type` = 'sentence'
ORDER BY `Contribution`.`id` DESC
LIMIT 200
Query using CakePHP's ORM, which does a LEFT JOIN (493 ms):
SELECT
`Contribution`.`sentence_id`,
`Contribution`.`sentence_lang`,
`Contribution`.`text`,
`Contribution`.`datetime`,
`Contribution`.`action`,
`User`.`id`,
`User`.`username`
FROM `contributions` AS `Contribution`
LEFT JOIN `users` AS `User` ON (`Contribution`.`user_id` = `User`.`id`)
WHERE `type` = 'sentence'
AND `sentence_lang` = 'jpn'
ORDER BY `datetime` DESC
LIMIT 200
We'll see how it goes in prod.
yay!
Great! But this makes absolutely no sense, doesn't a left join just return the results of an inner join plus all the records from the left table that match the conditions in where even if they didn't match the join predicate? Very curious how mysql is mucking this up.
Btw is there a multicolumn index on type and sentence_lang, because there should be. Also, I wonder if the conditions can come before the join predicate to narrow down the number of records that the comparison across tables has to be done for.
After fidgeting with this for a couple of hours and looking at the schema carefully, I came to the following conclusions:
-- Add an index on the WHERE clauses for contributions
ALTER TABLE contributions ADD INDEX type_lang_userid_idx (type, sentence_lang, user_id);
-- Filter on contributions using rows from users
SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime DESC LIMIT 200;
This reduced the time by 6 folds on my VM on some dummy data, I have no idea how it would fare on real data though in production. Also filtering by last_time_active or whatever on the users table should reduce the rows a bit I suppose, whatever the criteria you use, make sure it uses an index.
After looking at explain extended for these queries some more, it seems like ORDER BY...LIMIT is doing something very funky:
mysql> EXPLAIN EXTENDED SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime LIMIT 11 \G; *** 1. row *** id: 1 select_type: PRIMARY table: contributions type: index possible_keys: sentence_lang,type_lang_userid_idx key: datetime key_len: 8 ref: NULL rows: 22 filtered: 447077.28 Extra: Using where *** 2. row *** id: 2 select_type: DEPENDENT SUBQUERY table: users type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: func rows: 1 filtered: 100.00 Extra: Using index 2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN EXTENDED SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT id FROM users) ORDER BY datetime \G; *** 1. row *** id: 1 select_type: PRIMARY table: contributions type: ref possible_keys: sentence_lang,type_lang_userid_idx key: type_lang_userid_idx key_len: 16 ref: const,const rows: 98357 filtered: 100.00 Extra: Using where; Using filesort *** 2. row *** id: 2 select_type: DEPENDENT SUBQUERY table: users type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: func rows: 1 filtered: 100.00 Extra: Using index 2 rows in set, 1 warning (0.00 sec)
I hate mysql so so much. Can someone just move us to a saner database like postgresql or something? Thanks. Drop the LIMIT and figure out some other way of getting rid of the rest of the results, you can thank mysql for this and for world hunger and poverty too.
(SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng') UNION ALL (SELECT * FROM contributions WHERE user_id IN (SELECT id FROM users)) ORDER BY datetime DESC LIMIT 10;
This seems to have a comparable performance but looks pretty ugly and needs a different set of indices. EXPLAIN EXTENDED says it works with a shit ton of rows though so I have no idea how mysql makes this fast, I suppose it throws pixie dust at it. It doesn't seem to have the same problem as ORDER BY...LIMIT.
I think I found a way to reduce the number of rows to filter on dramatically. Use the contributions table itself to find the user id given a date criteria or something:
SELECT * FROM contributions WHERE type = 'sentence' AND sentence_lang = 'eng' AND user_id IN (SELECT DISTINCT user_id FROM contributions WHERE datetime > '2014-07-14 10:33:52' ) ORDER BY datetime;
I think you might wanna do SELECT DISTINCT user_id FROM contributions WHERE datetime > '2014-07-14 10:33:52' alone in php then use the ids in this query and another one to get the username from the users table. mysql's optimizer seems pretty good at optimizing when exact values are provided at compile time. This should be blazing fast considering that we only have around 10-20 users active at about the same time I think.
Forgot to reply to this.
There is an index on type
+ sentence_lang
already. I don't think we need to add the user_id
in there, since it's not part of the WHERE.
There is also an index on datetime
, for the ORDER BY datetime LIMIT 200
part.
I think the LEFT JOIN thingy generated by CakePHP is fine, it's also a LEFT JOIN that should have been done since the beginning, because doing an INNER JOIN would actually remove a result for a user that has been deleted.
Anyway for what I've tested in prod, it's fast enough for French and English. Not sure why Japanese takes a bit longer, but it's still decent.
There is also an index on datetime, for the ORDER BY datetime LIMIT 200 part.
Look at the EXPLAIN again the indices are not used and if you wanna keep the left join you at least need an index on type + sentence_lang + datetime. Try looking into FORCE INDEX and USE INDEX and benchmark this then look at the EXPLAIN again. The LEFT JOIN without any filtering still goes through an awful lot of rows and the ORDER BY uses filesort on the entire table before any filtering is done... The ORDER BY...LIMIT issue is pretty weird and I don't think we can solve it without using the UNION hack, but I guess we can ignore it if mysql uses the indices properly.
When I try to see the latest contributions for a specific language, for instance :
http://tatoeba.org/jpn/contributions/latest/fra http://tatoeba.org/jpn/contributions/latest/eng
The page loads for some time and displays
504 Gateway Time-out
. I’ve been unable to get that page working during the two last days and I tried at various times of the day.Note that when you don’t filter by language, the page loads reasonably fast : http://tatoeba.org/jpn/contributions/latest.