Closed sebbacon closed 5 years ago
To answer where we hit the table: only in the view for Area Teams (which I think we don't use) and in the API's spending_by_practice
endpoint, when the code length is 10 or more, and either a date or one or more practice ids are specified.
A typical query looks like this:
SELECT
pr.practice_id AS row_id,
pc.name AS row_name,
pc.setting AS setting,
pc.ccg_id AS ccg,
pr.processing_date AS date,
SUM(pr.actual_cost) AS actual_cost,
SUM(pr.total_items) AS items,
CAST(SUM(pr.quantity) AS bigint) AS quantity
FROM
frontend_prescription pr
JOIN
frontend_practice pc
ON
pr.practice_id=pc.code
WHERE
(pr.presentation_code LIKE '0703021Q0BB%'
OR pr.presentation_code LIKE '0908010A0AA%' )
AND (pr.pct_id='03Y' )
GROUP BY
pr.practice_id,
pc.code,
date
ORDER BY
date,
pr.practice_id
This is very fast (23ms) because the (pct, code)
index can be used.
The current indexes on the prescription table are as follows:
"frontend_prescription_201205_6ea07fe3" btree (practice_id)
"frontend_prescription_201205_by_pct" btree (presentation_code, pct_id)
"frontend_prescription_201205_by_pct_and_presentation" btree (pct_id, presentation_code varchar_pattern_ops)
"frontend_prescription_201205_by_prac_date_code" btree (practice_id, processing_date, presentation_code)
"frontend_prescription_201205_by_practice" btree (presentation_code, practice_id)
"frontend_prescription_201205_by_practice_and_code" btree (practice_id, presentation_code varchar_pattern_ops)
"frontend_prescription_201205_idx_date_and_code" btree (processing_date, presentation_code)
The only place in the code which is currently using any of these indexes uses conditions on
My hypothesis is that we will probably get most of the benefit from simply indexing those four things separately.
So we keep only:
"frontend_prescription_201205_6ea07fe3" btree (practice_id)
And we add:
CREATE INDEX idx_prescription_presentation ON frontend_prescription(presentation_code varchar_pattern_ops)
CREATE INDEX idx_prescription_presentation ON frontend_prescription(pct_id)
Though we need to do this once for each partition.
"frontend_prescription_pct_id
The basic select above performs very well.
chemical_summary_by_ccg
is an example of one of the views used in spending calculations.
EXPLAIN ANALYZE SELECT
processing_date,
pct_id,
SUBSTR(presentation_code, 1, 9) AS chemical_id,
SUM(total_items) AS items,
SUM(actual_cost) AS cost,
SUM(quantity) AS quantity
FROM
frontend_prescription
WHERE presentation_code LIKE '0212000AA%'
GROUP BY
processing_date,
pct_id,
chemical_id
At a chemical level it is adequately fast (3 - 4 seconds) but at a chapter level it's not, because so many tuples have to be visited in the heap.
The only way to speed something like this up without a view would be to add more RAM to the servers, sufficient for all the prescribing table to be cached.
The £960/month Digital Ocean offering gives 128Gb which would be more than enough. It would be interesting to try this out (downside: requires a restart) to assess the cost/benefits.
The main cost at the moment is:
If increasing the memory by that amount were sufficient to run everything off the main prescribing tables, this would be at a cost of about £500 / month.
Looking at some ANALYZE output, I don't think more memory will help in at least some important cases.
Querying for chapter 0212
per the above results in about 10 million rows. Even when these are fully cached it takes 27 seconds, including 9 seconds of retrieving data from the heap (even when in shared mem, because each child table takes approx 150ms). It's spending 8 seconds doing the aggregate function in memory, a mystery 8 seconds doing the "Result", etc (regarding the mystery time, I've asked a question here out of interest).
Fundamentally, it appears that aggregating 10 million rows is just slow, so the current strategy is probably the best.
According to the DBA stackexchange answer, much of the 27 seconds above is avoidable (it's from EXPLAIN ANALZYE
instrumentation, and the SUBSTR
function that can be avoided), but that query would still take 12 - 15 seconds.
It may be that reverting partitioning by date would be more performant in turn -- we did that because loading data was so slow, but dropping and recreating the index made it much faster.
Further examining the tradeoff between one massive table would be the next step, but I'll park this for now.
@evansd for the record (for when we start looking at this properly) some adhoc analysis I did ages ago
Obviated by the MatrixStore!
Establish if it might be possible to optimise everything to run off the main prescribing table.
For example, could the following query be made performant? Currently it uses a
(pct, presentation_code)
index and has to do an index scan. It would probably be more or less as efficient as anything else to have a presentation_code index, a pct index, a practice index, and a date index; then consider if we shouldCLUSTER
by anything in particular; presumablypresentation_code
because searching by chapter/section/chemical is presumably the most common use case, and possibly byccg
as well to keep related entities closer together.Start the investigation by checking where in the code we actually hit the table -- I think it's in very few places. If so, we can fairly safely play with dropping and recreating indexes.
To make the table use less space, we could play "alignment tetris" (ref) and put all the integer and date columns together - might save us 8 bytes per row.
We probably don't need presentation_name anywhere as these can be looked up in our BNF table; this is a very long column.
We also probably don't need SHA id; but we do need NIC.