Closed sebbacon closed 8 years ago
Data points:
Yes, will be timeout. We need to debug the raw queries and do more indexing if necessary. May be a side-effect of my deleting indexes. Probably urgent as will affect by-practice queries on analyse form.
Trying to work out which timeout is being triggered. Nginx has a 120s timeout for the backend proxy, so it's not that.
The reason we are getting 404s is because nginx is trying to render a 500 page, and can't find it, so is defaulting to the built-in 404. We can see the upstream is closing the connect. I will fix the 500 thing first; then we'll start getting errors reported by email, at least.
2016/10/03 11:22:54 [error] 26767#0: *607140 upstream prematurely closed connect
ion while reading response header from upstream, client: 141.101.98.155, server:
openprescribing.net, request: "GET /api/1.0/spending_by_practice/?format=json&c
ode=0208020Z0AA&org=00Q HTTP/1.1", upstream: "http://unix:/webapps/openprescribi
ng/run/gunicorn.sock:/api/1.0/spending_by_practice/?format=json&code=0208020Z0AA
&org=00Q", host: "openprescribing.net"
2016/10/03 11:22:54 [error] 26767#0: *607140 open() "/webapps/openprescribing/static/500.html" failed (2: No such file or directory), client: 141.101.98.155, server: openprescribing.net, request: "GET /api/1.0/spending_by_practice/?format=json&code=0208020Z0AA&org=00Q HTTP/1.1", upstream: "http://unix:/webapps/openprescribing/run/gunicorn.sock:/api/1.0/spending_by_practice/?format=json&code=0208020Z0AA&org=00Q", host: "openprescribing.net"
The timeout looks likely to be a gunicorn setting.
I will change that (so we don't timeout) and start logging slow queries in postgres so we can analyse them systematically.
Thanks! Have dim memory that we're already logging slow queries, but that might be wrong.
Yeah, I think it's caused by my deleting this index, oops - we can just add it back in since we've got plenty of space for the moment.
I wouldn't be too quick to blame removing that index for everything; there is evidence of similar issues going back to June. But yes, it clearly needs adding back in (will do it now).
Yes, we were logging slow queries (> 3000ms).
Some of these queries are now returning very quickly, so it is quite possible that it's related to server load (having to service 3 or 4 analyses concurrently, for example).
Postgres logs errors like the following. My interpretation is that this is again us seeing the postgres client being timed out by gunicorn before postgres can respond. This, of course, is likely to cause end users to reload their browsers and send the query again, potentially compounding the problem.
2016-10-02 11:33:55 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG: statement: SHOW default_transaction_isolation
2016-10-02 11:33:55 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG: statement: SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG: could not send data to client: Broken pipe
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: STATEMENT: SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: FATAL: connection to client lost
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: STATEMENT: SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id
End users should now be getting 500s rather then 404s. That misconfiguration may well have been responsible for us missing a number of errors in the past, so we might start seeing new errors soon.
I wonder if we should consider using NewRelic to collect errors etc. The starter is $13 a month and I've found it very useful in RoR environments.
Now running in a screen session:
CREATE INDEX frontend_prescription_by_pct_and_code on frontend_prescription (presentation_code,pct_id);
An identical(?) index already exists: "frontend_prescription_by_pct" btree (presentation_code, pct_id)
.
I think we might need one that's the other way around.
Thanks, I didn't see that. I've cancelled the operation.
I don't believe it matters which way round the columns are defined in a concatenated index, though I suppose the only way to be sure is to create both and run some tests (a bit difficult given we only have a production environment, but we can try it anyway...)
Let's look at the actual query being run:
SELECT
pr.practice_id AS row_id,
pc.name AS row_name,
pc.setting AS setting,
pc.ccg_id AS ccg,
pr.processing_date AS date,
SUM(pr.actual_cost) AS actual_cost,
SUM(pr.total_items) AS items,
CAST(SUM(pr.quantity) AS bigint) AS quantity
FROM frontend_prescription pr
JOIN frontend_practice pc
ON pr.practice_id = pc.code
WHERE (pr.presentation_code LIKE '0208020Z0AA%')
AND (pr.pct_id = '00Q')
GROUP BY pr.practice_id,
pc.code,
date
ORDER BY date, pr.practice_id
The EXPLAIN ANALYZE output is as follows
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=68515.51..68533.90 rows=613 width=69) (actual time=835.586..836.393 rows=599 loops=1)
Group Key: pr.processing_date, pr.practice_id, pc.code
-> Sort (cost=68515.51..68517.04 rows=613 width=69) (actual time=835.552..835.703 rows=904 loops=1)
Sort Key: pr.processing_date, pr.practice_id, pc.code
Sort Method: quicksort Memory: 152kB
-> Nested Loop (cost=0.57..68487.12 rows=613 width=69) (actual time=72.733..833.009 rows=904 loops=1)
-> Seq Scan on frontend_practice pc (cost=0.00..644.17 rows=14717 width=38) (actual time=0.014..10.908 rows=14717 loops=1)
-> Index Scan using frontend_prescription_by_practice_and_code on frontend_prescription pr (cost=0.57..4.60 rows=1 width=31) (actual time=0.055..0.055 rows=0 loops=14717)
Index Cond: (((practice_id)::text = (pc.code)::text) AND ((presentation_code)::text ~>=~ '0208020Z0AA'::text) AND ((presentation_code)::text ~<~ '0208020Z0AB'::text))
Filter: (((presentation_code)::text ~~ '0208020Z0AA%'::text) AND ((pct_id)::text = '00Q'::text))
Rows Removed by Filter: 17
Planning time: 5.707 ms
Execution time: 836.675 ms
(13 rows)
Note it is using the index frontend_prescription_by_practice_and_code
, which looks like this:
Index "public.frontend_prescription_by_practice_and_code"
Column | Type | Definition
-------------------+-----------------------+-------------------
practice_id | character varying(6) | practice_id
presentation_code | character varying(15) | presentation_code
btree, for table "public.frontend_prescription"
The index @annapowellsmith pointed out looks like this:
Index "public.frontend_prescription_by_pct"
Column | Type | Definition
-------------------+-----------------------+-------------------
presentation_code | character varying(15) | presentation_code
pct_id | character varying(3) | pct_id
btree, for table "public.frontend_prescription"
The planner has decided it is cheaper to use the (practice_id, presentation_code)
index than the (presentation_code, pct_id)
index, where the condition is WHERE (pr.presentation_code LIKE '0208020Z0AA%') AND (pr.pct_id = '00Q')
. While I don't completely understand this, it seems likely this is due to the JOIN
on the practice_id
column; presumably because getting the join key value from an index we have to load anyway is cheaper than scanning the table after applying the index.
So my hypothesis is that the index we actually need is on (presentation_code, pct_id, practice_id)
. However, this will be a pretty big index and won't be usable when the condition is WHERE presentation_code LIKE 'foo' AND practice_id = 'baz'
(i.e. no pct_id
specified). If we wanted the same index to be usable in both instances we'd have to add a pct_id condition to each query on practice_id (which is certainly possible, if a little ugly, by looking up all the CCGs a practice has every been part of, perhaps in a subselect).
Another alternative is to avoid the JOIN
completely in the query. It's presumably there to ensure we are using the current CCG membership for a practice, rather than a historic membership. (I'm not sure the logic is correct in any case as we filter on the CCG id recorded in the prescription but return the CCG id for the practice mentioned in the prescription, which would mean you can potentially get data from the API for CCGs that you think you've not asked about).
I can't currently think of a simple way of avoiding this JOIN, short of a new (huge) materialized view or similar strategy.
Pretty sure the ordering of the columns does matter, and it will speed things up to have one with pct_id
first.
The query planner is an enigma wrapped in a mystery etc, but I think it's choosing to use another index rather than frontend_prescription_by_pct
because this is a LIKE query - using the would return a huge chunk of the table on the first part of the query and then it would need to refine it down by PCT ID, and the planner probably thinks that would be slower than using an existing index. But the planner moves in mysterious ways.
We need the JOIN
in order to get practice setting, and we need that because we return practice setting in the API call so that we can hide non-standard practices in the charts.
I'm running this in a screen session now: create index frontend_prescription_by_pct_and_presentation on frontend_prescription(pct_id, presentation_code varchar_pattern_ops)
- will analyze and see if that improves things.
New Relic - definitely! CloudFlare offer it as an integrated service - if you look in the Apps section you'll see I had a go at setting it up at one point.
Oh, I know why it's not using frontend_prescription_by_pct
- it's because it is a LIKE
query, and that index hasn't been set up with varchar_pattern_ops
.
Couple of observations:
[presentation_code, pct_id]
it can be used when querying on presentation_code
, or presentation_code AND pct_id
, but not pct_id
. ("in general, a database can use a concatenated index when searching with the leading (leftmost) columns. An index with three columns can be used when searching for the first column, when searching with the first two columns together, and when searching using all columns") I don't think there is a distinction made by the query planner between the ordering of [presentation_code, pct_id]
versus [pct_id, presentation_code]
(or at least, I can't see any documentation about this, and I can't think of a reason why it would). I'm very happy to be proved wrong though!varchar_pattern_ops
only relates to how collation is used with respect to LIKE
queries; that is, LIKE
queries are implemented as range queries, and "_if you do use the C locale, you do not need the xxx_patternops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale". BNF codes would be fine sorted in the C locale, I think.Responses:
varchar_pattern_ops
and the fact that this was a LIKE
query: this does matter for us because we aren't using the C locale. I think re-creating the index on frontend_prescription(pct_id, presentation_code varchar_pattern_ops)
has solved the issue, but we'll see!
Closing, we can reopen if necessary.
These seem to be 404s that are cached in Cloudflare; when we purge them, they work. Example of one not yet purged:
Here is the full list of such urls (since June 2016 when the current access log starts).