CenterForOpenScience / SHARE

SHARE is building a free, open, data set about research and scholarly activities across their life cycle.
http://share-research.readthedocs.io/en/latest/index.html
Apache License 2.0
100 stars 58 forks source link

optimize oai-pmh query #772

Closed aaxelb closed 4 years ago

aaxelb commented 4 years ago

sometimes, if a complex-enough query is run on a large-enough table, postgres' planner throws up its hands in defeat and falls back to a more naive execution plan than we might expect.

existing (slow) query:

SELECT
  table1.id,
  -- (...more fields)
  -- (...a bunch of aggregations with subqueries)
FROM table1
  -- (...joins)
WHERE
  -- (...conditions with subqueries)
LIMIT x

new (faster) query:

SELECT
  table1.id,
  -- (...more fields)
  -- (...a bunch of aggregations with subqueries)
FROM table1
WHERE table1.id in (
  SELECT t1.id
  FROM table1 AS t1
    -- (...joins)
  WHERE
    -- (...conditions with subqueries)
  LIMIT x
)

moving the LIMIT and complex WHERE clause into a subquery guarantees that postgres won't run those aggregations on more rows than necessary

chrisseto commented 4 years ago

Looks like you might need the limit on the outside query as well as the subquery 👀

EDIT: that's a lie

aaxelb commented 4 years ago

@jamescdavis fixed! the counts were off because the resumption token depends on the last id on the previous page and the outer query wasn't ordering by id, so we'd get duplicates across multiple pages. i updated the outer query to order by id, too.