Very bad performance of TagManager.related_for_model query using MySQL 5.0.x

GoogleCodeExporter commented 9 years ago

MySQL optimizer is easily fooled by subqueries. Given a query from
related_for_model:

        SELECT %(tag)s.id, %(tag)s.name%(count_sql)s
        FROM %(tagged_item)s INNER JOIN %(tag)s ON %(tagged_item)s.tag_id =
%(tag)s.id
        WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
          AND %(tagged_item)s.object_id IN
          (
              SELECT %(tagged_item)s.object_id
              FROM %(tagged_item)s, %(tag)s
              WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
                AND %(tag)s.id = %(tagged_item)s.tag_id
                AND %(tag)s.id IN (%(tag_id_placeholders)s)
              GROUP BY %(tagged_item)s.object_id
              HAVING COUNT(%(tagged_item)s.object_id) = %(tag_count)s
          )
          AND %(tag)s.id NOT IN (%(tag_id_placeholders)s)
        GROUP BY %(tag)s.id, %(tag)s.name
        %(min_count_sql)s
        ORDER BY %(tag)s.name ASC

MySQL optimizer executes inner query as dependent subquery. This results in
severe performance loss (approximately 100 times slower on tables with
~10000 items). To prevent this behavior you can modify this query a little,
by wrapping inner query in simple SELECT [1]:

        SELECT %(tag)s.id, %(tag)s.name%(count_sql)s
        FROM %(tagged_item)s INNER JOIN %(tag)s ON %(tagged_item)s.tag_id =
%(tag)s.id
        WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
          AND %(tagged_item)s.object_id IN
          (
            SELECT temporary.object_id 
            FROM (
                SELECT %(tagged_item)s.object_id
                FROM %(tagged_item)s, %(tag)s
                WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
                  AND %(tag)s.id = %(tagged_item)s.tag_id
                  AND %(tag)s.id IN (%(tag_id_placeholders)s)
                GROUP BY %(tagged_item)s.object_id
                HAVING COUNT(%(tagged_item)s.object_id) = %(tag_count)s
            ) AS temporary
          )
          AND %(tag)s.id NOT IN (%(tag_id_placeholders)s)
        GROUP BY %(tag)s.id, %(tag)s.name
        %(min_count_sql)s
        ORDER BY %(tag)s.name ASC

Links:
[1]
http://www.xaprb.com/blog/2006/04/30/how-to-optimize-subqueries-and-joins-in-mys
ql/

Original issue reported on code.google.com by mstepnio...@gmail.com on 9 Sep 2008 at 11:24

GoogleCodeExporter commented 9 years ago

I am a little concerned by this. The plan is to move these queries into the 
query
sets using the new queryset extra() support in django 1.0. I fear that the 
resulting
SQL will have the bad performance you describe.

Original comment by doug.nap...@gmail.com on 16 Sep 2008 at 4:50

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I solved this problem by extracting inner query (it's not depend on main 
query). So
this function works in two queries but much faster.

I did it on my modified version of tagging-app so can't give patch, but if 
somebody
very needs it - I can try to repeat this on trunk version.

Original comment by n.le...@gmail.com on 12 Nov 2008 at 1:59

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

n.leush -- can you please post your fix/code? 

related_for_model works fine on sqlite even with 50,000 tagged items but on my 
production server, on mysql, it is unacceptably slow. I have already 
implemented mstepniowski's suggestion, which definitely sped it up so that the 
query doesn't ALWAYS timeout, but its still much too slow and on the bigger 
tags times out!

Original comment by ilias.ka...@gmail.com on 7 Aug 2010 at 9:48

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

    def related_for_model2(self, tags, model, counts=False, min_count=None, filters=None,
                          sort_by_count=False, limit=None):
        """
        Obtain a list of tags related to a given list of tags - that
        is, other tags used by items which have all the given tags.

        If ``counts`` is True, a ``count`` attribute will be added to
        each tag, indicating the number of items which have it in
        addition to the given list of tags.

        If ``min_count`` is given, only tags which have a ``count``
        greater than or equal to ``min_count`` will be returned.
        Passing a value for ``min_count`` implies ``counts=True``.
        """
        if filters is None: filters = {}

        if min_count is not None: counts = True
        tags = get_tag_list(tags)
        tag_count = len(tags)

        if tag_count == 0:
            return []

        queryset = model._default_manager.filter()
        queryset.query.add_q(Q(**filters))

        extra_joins = ' '.join(queryset.query.get_from_clause()[0][1:])
        where, params = queryset.query.where.as_sql()
        if where:
            extra_criteria = 'AND %s' % where
        else:
            extra_criteria = ''

        tagged_item_table = qn(TaggedItem._meta.db_table)
        tag_name = qn(self.model._meta.db_table)
        model_table = qn(model._meta.db_table)
        model_pk = '%s.%s' % (model_table, qn(model._meta.pk.column))

        query = """
              SELECT %(tagged_item)s.object_id
              FROM %(tagged_item)s 
                INNER JOIN %(tag)s
                  ON %(tagged_item)s.tag_id = %(tag)s.id
                INNER JOIN %(model)s
                  ON %(tagged_item)s.object_id = %(model_pk)s
                %%s
              WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
                %%s
                AND %(tag)s.id IN (%(tag_id_placeholders)s)
              GROUP BY %(tagged_item)s.object_id%%s
              HAVING COUNT(%(tagged_item)s.object_id) = %(tag_count)s
            """ % {
            'tag': tag_name,
            'model': model_table,
            'model_pk': model_pk,
            'tagged_item': tagged_item_table,
            'content_type_id': ContentType.objects.get_for_model(model).pk,
            'tag_id_placeholders': ','.join(['%%s'] * tag_count),
            'tag_count': tag_count,
            }

        params_tags = [tag.pk for tag in tags]
        params.extend(params_tags)

        # TODO this is hardcode bugfix for some complicated query
        if extra_joins.find("INNER JOIN `comment_nodes`") != -1:
            extra_group = ", comment_nodes.id"
        else:
            extra_group = ""

        cursor = connection.cursor()
        cursor.execute(query % (extra_joins, extra_criteria, extra_group), params)
        object_ids = []
        for row in cursor.fetchall():
            object_ids.append(row[0])

        if len(object_ids) == 0:
            return []

        query = """
        SELECT %(tag)s.id, %(tag)s.name, %(tag)s.slug%(count_sql)s
        FROM %(tagged_item)s INNER JOIN %(tag)s ON %(tagged_item)s.tag_id = %(tag)s.id
        WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
          AND %(tagged_item)s.object_id IN
          (%(object_id_placeholders)s)
          AND %(tag)s.id NOT IN (%(tag_id_placeholders)s)
        GROUP BY %(tag)s.id, %(tag)s.name
        %(min_count_sql)s
        ORDER BY %(order_sql)s
        %(limit_sql)s""" % {
            'tag': tag_name,
            'model': model_table,
            'count_sql': counts and ', COUNT(%s.object_id) as c' % tagged_item_table or '',
            'tagged_item': tagged_item_table,
            'content_type_id': ContentType.objects.get_for_model(model).pk,
            'tag_id_placeholders': ','.join(['%s'] * tag_count),
            'object_id_placeholders': ','.join(['%s'] * len(object_ids)),
            'min_count_sql': min_count is not None and ('HAVING COUNT(%s.object_id) >= %s' % tagged_item_table) or '',

            'order_sql': sort_by_count and ('c DESC') or ('%s.name ASC' % tag_name),
            'limit_sql': limit and ('LIMIT %s' % limit) or ''
        }

        params = []
        params.extend(object_ids)
        params.extend(params_tags)
        if min_count is not None:
            params.append(min_count)        

        cursor = connection.cursor()
        cursor.execute(query, params)
        related = []
        for row in cursor.fetchall():
            tag = self.model(*row[:3])
            if counts is True:
                tag.count = row[3]
            related.append(tag)

        if sort_by_count:
            sort_tags_by_name(related)

        return related

Original comment by n.le...@gmail.com on 7 Aug 2010 at 11:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Any news on this? We're plagued by this exact problem.

Some more info, tho. After talking to a friend, he came up with an altered 
query that ran a lot faster (from 15-20 seconds to 0.2-0.4 seconds).. The 
approach creates a temporary table for the subquery, and then deletes it after 
the query.

RAW sql:
------------------------------------
CREATE TEMPORARY TABLE tmpsame (key (object_id)) SELECT object_id  FROM 
`tagging_taggeditem`, `tagging_tag` WHERE `tagging_taggeditem`.content_type_id 
= 14 AND `tagging_tag`.id = `tagging_taggeditem`.tag_id AND `tagging_tag`.id IN 
(19,12,108,131,189) GROUP BY `tagging_taggeditem`.object_id HAVING 
COUNT(`tagging_taggeditem`.object_id) = 5;

SELECT `tagging_tag`.id, `tagging_tag`.name, 
COUNT(`tagging_taggeditem`.object_id) FROM `tagging_taggeditem` INNER JOIN 
`tagging_tag` ON `tagging_taggeditem`.tag_id = `tagging_tag`.id   
INNER JOIN tmpsame ON `tagging_taggeditem`.object_id = tmpsame.object_id AND 
`tagging_tag`.id NOT IN (19,12,108,131,189) GROUP BY `tagging_tag`.id, 
`tagging_tag`.name ORDER BY `tagging_tag`.name ASC;

DROP TEMPORARY TABLE tmpsame;
------------------------------------

Of course, with this approach the temp table's name should be randomized. It 
might also be a solution to get the results of the first queries in python, and 
then pass it to the next query (and thus not needing a temp table).

Original comment by TheTerra...@gmail.com on 6 Apr 2011 at 10:58

Added labels: ****
Removed labels: ****

acangiani / django-tagging

Very bad performance of TagManager.related_for_model query using MySQL 5.0.x #160