ebmdatalab / openprescribing

A Django app providing a REST API and dashboards for the HSCIC's GP prescribing data
https://openprescribing.net
MIT License
97 stars 26 forks source link

Investigate replacing the views #306

Closed sebbacon closed 5 years ago

sebbacon commented 7 years ago

Establish if it might be possible to optimise everything to run off the main prescribing table.

For example, could the following query be made performant? Currently it uses a (pct, presentation_code) index and has to do an index scan. It would probably be more or less as efficient as anything else to have a presentation_code index, a pct index, a practice index, and a date index; then consider if we should CLUSTER by anything in particular; presumably presentation_code because searching by chapter/section/chemical is presumably the most common use case, and possibly by ccg as well to keep related entities closer together.

Start the investigation by checking where in the code we actually hit the table -- I think it's in very few places. If so, we can fairly safely play with dropping and recreating indexes.

To make the table use less space, we could play "alignment tetris" (ref) and put all the integer and date columns together - might save us 8 bytes per row.

We probably don't need presentation_name anywhere as these can be looked up in our BNF table; this is a very long column.

We also probably don't need SHA id; but we do need NIC.

 explain analyze
SELECT
  practice_id,
  processing_date,
  SUM(actual_cost) AS cost
FROM
  frontend_prescription
WHERE
  (presentation_code LIKE '0408010AEBE%'
    OR presentation_code LIKE '0408010AEDE%')
GROUP BY
  processing_date,
  practice_id;
sebbacon commented 7 years ago

To answer where we hit the table: only in the view for Area Teams (which I think we don't use) and in the API's spending_by_practice endpoint, when the code length is 10 or more, and either a date or one or more practice ids are specified.

A typical query looks like this:

SELECT
  pr.practice_id AS row_id,
  pc.name AS row_name,
  pc.setting AS setting,
  pc.ccg_id AS ccg,
  pr.processing_date AS date,
  SUM(pr.actual_cost) AS actual_cost,
  SUM(pr.total_items) AS items,
  CAST(SUM(pr.quantity) AS bigint) AS quantity
FROM
  frontend_prescription pr
JOIN
  frontend_practice pc
ON
  pr.practice_id=pc.code
WHERE
  (pr.presentation_code LIKE '0703021Q0BB%'
    OR pr.presentation_code LIKE '0908010A0AA%' )
  AND (pr.pct_id='03Y' )
GROUP BY
  pr.practice_id,
  pc.code,
  date
ORDER BY
  date,
  pr.practice_id

This is very fast (23ms) because the (pct, code) index can be used.

sebbacon commented 7 years ago

The current indexes on the prescription table are as follows:

    "frontend_prescription_201205_6ea07fe3" btree (practice_id)
    "frontend_prescription_201205_by_pct" btree (presentation_code, pct_id)
    "frontend_prescription_201205_by_pct_and_presentation" btree (pct_id, presentation_code varchar_pattern_ops)
    "frontend_prescription_201205_by_prac_date_code" btree (practice_id, processing_date, presentation_code)
    "frontend_prescription_201205_by_practice" btree (presentation_code, practice_id)
    "frontend_prescription_201205_by_practice_and_code" btree (practice_id, presentation_code varchar_pattern_ops)
    "frontend_prescription_201205_idx_date_and_code" btree (processing_date, presentation_code)

The only place in the code which is currently using any of these indexes uses conditions on

My hypothesis is that we will probably get most of the benefit from simply indexing those four things separately.

So we keep only:

    "frontend_prescription_201205_6ea07fe3" btree (practice_id)

And we add:

CREATE INDEX  idx_prescription_presentation ON frontend_prescription(presentation_code varchar_pattern_ops)
CREATE INDEX  idx_prescription_presentation ON frontend_prescription(pct_id)

Though we need to do this once for each partition.

"frontend_prescription_pct_id

sebbacon commented 7 years ago

The basic select above performs very well.

chemical_summary_by_ccg is an example of one of the views used in spending calculations.

EXPLAIN ANALYZE SELECT
  processing_date,
  pct_id,
  SUBSTR(presentation_code, 1, 9) AS chemical_id,
  SUM(total_items) AS items,
  SUM(actual_cost) AS cost,
  SUM(quantity) AS quantity
FROM
  frontend_prescription
WHERE presentation_code LIKE '0212000AA%'
GROUP BY
  processing_date,
  pct_id,
  chemical_id

At a chemical level it is adequately fast (3 - 4 seconds) but at a chapter level it's not, because so many tuples have to be visited in the heap.

The only way to speed something like this up without a view would be to add more RAM to the servers, sufficient for all the prescribing table to be cached.

The £960/month Digital Ocean offering gives 128Gb which would be more than enough. It would be interesting to try this out (downside: requires a restart) to assess the cost/benefits.

The main cost at the moment is:

If increasing the memory by that amount were sufficient to run everything off the main prescribing tables, this would be at a cost of about £500 / month.

sebbacon commented 7 years ago

Looking at some ANALYZE output, I don't think more memory will help in at least some important cases.

Querying for chapter 0212 per the above results in about 10 million rows. Even when these are fully cached it takes 27 seconds, including 9 seconds of retrieving data from the heap (even when in shared mem, because each child table takes approx 150ms). It's spending 8 seconds doing the aggregate function in memory, a mystery 8 seconds doing the "Result", etc (regarding the mystery time, I've asked a question here out of interest).

Fundamentally, it appears that aggregating 10 million rows is just slow, so the current strategy is probably the best.

sebbacon commented 7 years ago

According to the DBA stackexchange answer, much of the 27 seconds above is avoidable (it's from EXPLAIN ANALZYE instrumentation, and the SUBSTR function that can be avoided), but that query would still take 12 - 15 seconds.

It may be that reverting partitioning by date would be more performant in turn -- we did that because loading data was so slow, but dropping and recreating the index made it much faster.

Further examining the tradeoff between one massive table would be the next step, but I'll park this for now.

sebbacon commented 6 years ago

@evansd for the record (for when we start looking at this properly) some adhoc analysis I did ages ago

evansd commented 5 years ago

Obviated by the MatrixStore!