bbcarchdev / spindle

RES Linked Open Data aggregation engine
https://bbcarchdev.github.io/spindle/
Apache License 2.0
2 stars 1 forks source link

Basic collection-membership query can be very slow #92

Open nevali opened 7 years ago

nevali commented 7 years ago

Actual query executed by the Spindle module for Quilt:

SELECT "i"."id", "i"."classes", "i"."title", "i"."description", "i"."coordinates", "i"."modified" FROM "index" "i" INNER JOIN "membership" "cm" ON ("i"."id" = "cm"."id" AND "cm"."collection" = 'cbe0cdeca6b340b19ed57d7290fcceaf') WHERE "i"."score" <= 40 ORDER BY "modified" DESC LIMIT 26;

Notably, this actually returns an empty set (because nothing is a member of cbe0cdeca6b340b19ed57d7290fcceaf), but takes some time to do it. A query only on the membership table returns instantly, suggesting that the query planner is performing the join in a non-optimal order.

We should investigate how best to hint to the planner that the collection-membership constraint is one which has a low cost and high benefit in terms of filtering rows from the joined tables.

Note that reversing the join order doesn't appear to make a material difference to query performance.

Related to #72. Possibly the cause of some of the issues ascribed to #87.

Internal tracking: RESDATA-1093

nevali commented 7 years ago

Even after a VACUUM ANALYZE and ANALYZE of the index and membership tables this is still very slow on PostgreSQL.

CygnusAlpha commented 7 years ago

Would an index on 'score' and 'modified' help?

https://www.postgresql.org/docs/8.3/static/indexes-ordering.html

rjpwork commented 7 years ago
partir01=# explain SELECT "i"."id", "i"."classes", "i"."title", "i"."description", "i"."coordinates", "i"."modified" FROM "index" "i" INNER JOIN "membership" "cm" ON ("i"."id" = "cm"."id" AND "cm"."collection" = 'cbe0cdeca6b340b19ed57d7290fcceaf') WHERE "i"."score" <= 40 ORDER BY "modified" DESC LIMIT 26;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Limit  (cost=30246.79..30246.86 rows=26 width=160)
   ->  Sort  (cost=30246.79..30249.88 rows=1236 width=160)
         Sort Key: i.modified DESC
         ->  Nested Loop  (cost=71.43..30211.56 rows=1236 width=160)
               ->  Bitmap Heap Scan on membership cm  (cost=71.01..10953.88 rows=2896 width=16)
                     Recheck Cond: (collection = 'cbe0cdec-a6b3-40b1-9ed5-7d7290fcceaf'::uuid)
                     ->  Bitmap Index Scan on membership_collection  (cost=0.00..70.28 rows=2896 width=0)
                           Index Cond: (collection = 'cbe0cdec-a6b3-40b1-9ed5-7d7290fcceaf'::uuid)
               ->  Index Scan using index_id on index i  (cost=0.42..6.64 rows=1 width=160)
                     Index Cond: (id = cm.id)
                     Filter: (score <= 40)
(11 rows)

partir01=# drop index membership_collection;
DROP INDEX

partir01=# explain SELECT "i"."id", "i"."classes", "i"."title", "i"."description", "i"."coordinates", "i"."modified" FROM "index" "i" INNER JOIN "membership" "cm" ON ("i"."id" = "cm"."id" AND "cm"."collection" = 'cbe0cdeca6b340b19ed57d7290fcceaf') WHERE "i"."score" <= 40 ORDER BY "modified" DESC LIMIT 26;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.99..37092.72 rows=26 width=160)
   ->  Nested Loop  (cost=0.99..1763284.90 rows=1236 width=160)
         ->  Index Scan Backward using index_modified_temp on index i  (cost=0.42..79722.21 rows=257181 width=160)
               Filter: (score <= 40)
         ->  Index Only Scan using membership_pkey on membership cm  (cost=0.56..6.54 rows=1 width=16)
               Index Cond: ((id = i.id) AND (collection = 'cbe0cdec-a6b3-40b1-9ed5-7d7290fcceaf'::uuid))
(6 rows)
rjpwork commented 7 years ago

I think the basic problem here might be the 50M+ extraneous rows in membership which seem to be related to the billings collection bug.

nevali commented 7 years ago

fifty million is not an especially big number

rjpwork commented 7 years ago

Sure, if they're intraneous. If they're extraneous, it's a big number.

nevali commented 7 years ago

yes, but the point is — querying that table at that size shouldn't be nearly as expensive an operation as it is; tidying up the collections masks, but doesn't solve, the problem with this issue.