dlamotte / django-tagging

Automatically exported from code.google.com/p/django-tagging
Other
0 stars 0 forks source link

Bad performance because of `HAVING COUNT()` #255

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Currently the `get_intersection_by_model` code does a `HAVING COUNT()`.

This regularly results in a very bad/slow query plan for large databases when 
using PostgreSQL (not sure about other databases, but it's slow there).

Example query:

    SELECT "entity_entity"."id" 
    FROM   "entity_entity", 
           "tagging_taggeditem" 
    WHERE  "tagging_taggeditem".content_type_id = 12 
           AND "tagging_taggeditem".tag_id IN ( 26, 13407, 992, 1848, 573 ) 
           AND "entity_entity"."id" = "tagging_taggeditem".object_id 
    GROUP  BY "entity_entity"."id" 
    HAVING COUNT("entity_entity"."id") = 5 

Example query plan:

     HashAggregate  (cost=33573.98..33766.62 rows=12843 width=4) (actual time=971.732..972.503 rows=2 loops=1)
       Filter: (count(entity_entity.id) = 5)
       ->  Hash Join  (cost=30450.93..33509.76 rows=12843 width=4) (actual time=872.651..947.297 rows=13565 loops=1)
             Hash Cond: (tagging_taggeditem.object_id = entity_entity.id)
             ->  Bitmap Heap Scan on tagging_taggeditem  (cost=260.84..3014.65 rows=12843 width=4) (actual time=1.931..25.149 rows=13565 loops=1)
                   Recheck Cond: (tag_id = ANY ('{26,13407,992,1848,573}'::integer[]))
                   Filter: (content_type_id = 12)
                   ->  Bitmap Index Scan on tagging_taggeditem_tag_id  (cost=0.00..257.63 rows=12843 width=0) (actual time=1.584..1.584 rows=13565 loops=1)
                         Index Cond: (tag_id = ANY ('{26,13407,992,1848,573}'::integer[]))
             ->  Hash  (cost=26392.82..26392.82 rows=303782 width=4) (actual time=870.620..870.620 rows=292905 loops=1)
                   ->  Seq Scan on entity_entity  (cost=0.00..26392.82 rows=303782 width=4) (actual time=0.007..428.755 rows=292905 loops=1)
     Total runtime: 975.058 ms

An alternative to using having is simply doing an inner join between all the 
tags:

    SELECT entity_entity.id 
    FROM   entity_entity 
           JOIN tagging_taggeditem t1 
             ON t1.object_id = entity_entity.id 
                AND t1.tag_id = 26 
                AND t1.content_type_id = 12 
           JOIN tagging_taggeditem t2 
             ON t2.object_id = entity_entity.id 
                AND t2.tag_id = 13407 
                AND t2.content_type_id = 12 
           JOIN tagging_taggeditem t3 
             ON t3.object_id = entity_entity.id 
                AND t3.tag_id = 992 
                AND t3.content_type_id = 12 
           JOIN tagging_taggeditem t4 
             ON t4.object_id = entity_entity.id 
                AND t4.tag_id = 1848 
                AND t4.content_type_id = 12 
           JOIN tagging_taggeditem t5 
             ON t5.object_id = entity_entity.id 
                AND t5.tag_id = 573 
                AND t5.content_type_id = 12 
    GROUP  BY entity_entity.id 

And the query plan:

     HashAggregate  (cost=229.27..229.28 rows=1 width=4) (actual time=2.788..2.790 rows=2 loops=1)
       ->  Nested Loop  (cost=102.17..229.26 rows=1 width=4) (actual time=2.386..2.778 rows=2 loops=1)
             ->  Nested Loop  (cost=102.17..220.92 rows=1 width=20) (actual time=2.374..2.747 rows=2 loops=1)
                   ->  Nested Loop  (cost=102.17..212.58 rows=1 width=16) (actual time=2.361..2.720 rows=2 loops=1)
                         ->  Nested Loop  (cost=102.17..204.24 rows=1 width=12) (actual time=2.345..2.688 rows=2 loops=1)
                               ->  Hash Join  (cost=102.17..195.92 rows=1 width=8) (actual time=2.326..2.649 rows=2 loops=1)
                                     Hash Cond: (t3.object_id = t5.object_id)
                                     ->  Bitmap Heap Scan on tagging_taggeditem t3  (cost=4.48..97.38 rows=25 width=4) (actual time=0.085..0.466 rows=176 loops=1)
                                           Recheck Cond: (tag_id = 992)
                                           Filter: (content_type_id = 12)
                                           ->  Bitmap Index Scan on tagging_taggeditem_tag_id  (cost=0.00..4.48 rows=25 width=0) (actual time=0.058..0.058 rows=176 loops=1)
                                                 Index Cond: (tag_id = 992)
                                     ->  Hash  (cost=97.38..97.38 rows=25 width=4) (actual time=1.934..1.934 rows=450 loops=1)
                                           ->  Bitmap Heap Scan on tagging_taggeditem t5  (cost=4.48..97.38 rows=25 width=4) (actual time=0.164..1.261 rows=450 loops=1)
                                                 Recheck Cond: (tag_id = 573)
                                                 Filter: (content_type_id = 12)
                                                 ->  Bitmap Index Scan on tagging_taggeditem_tag_id  (cost=0.00..4.48 rows=25 width=0) (actual time=0.111..0.111 rows=450 loops=1)
                                                       Index Cond: (tag_id = 573)
                               ->  Index Scan using entity_entity_pkey on entity_entity  (cost=0.00..8.31 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=2)
                                     Index Cond: (entity_entity.id = t3.object_id)
                         ->  Index Scan using tagging_taggeditem_tag_id_key on tagging_taggeditem t4  (cost=0.00..8.33 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=2)
                               Index Cond: ((t4.tag_id = 1848) AND (t4.content_type_id = 12) AND (t4.object_id = entity_entity.id))
                   ->  Index Scan using tagging_taggeditem_tag_id_key on tagging_taggeditem t2  (cost=0.00..8.33 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=2)
                         Index Cond: ((t2.tag_id = 13407) AND (t2.content_type_id = 12) AND (t2.object_id = entity_entity.id))
             ->  Index Scan using tagging_taggeditem_tag_id_key on tagging_taggeditem t1  (cost=0.00..8.33 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=2)
                   Index Cond: ((t1.tag_id = 26) AND (t1.content_type_id = 12) AND (t1.object_id = entity_entity.id))

The query is slightly more complex, but the runtime is 3ms instead of 980ms 
(average, I've tested the query 10 times)

I will write the patch if you consider implementing this.

Original issue reported on code.google.com by Wolphie on 24 Dec 2010 at 2:42