ebmdatalab / openprescribing

A Django app providing a REST API and dashboards for the HSCIC's GP prescribing data
https://openprescribing.net
MIT License
97 stars 26 forks source link

Some API endpoints are returning 404s #209

Closed sebbacon closed 8 years ago

sebbacon commented 8 years ago

These seem to be 404s that are cached in Cloudflare; when we purge them, they work. Example of one not yet purged:

https://openprescribing.net/api/1.0/spending_by_practice/?format=json&code=0208020Z0AA&org=00Q

Here is the full list of such urls (since June 2016 when the current access log starts).

sebbacon commented 8 years ago

Data points:

annapowellsmith commented 8 years ago

Yes, will be timeout. We need to debug the raw queries and do more indexing if necessary. May be a side-effect of my deleting indexes. Probably urgent as will affect by-practice queries on analyse form.

sebbacon commented 8 years ago

Trying to work out which timeout is being triggered. Nginx has a 120s timeout for the backend proxy, so it's not that.

The reason we are getting 404s is because nginx is trying to render a 500 page, and can't find it, so is defaulting to the built-in 404. We can see the upstream is closing the connect. I will fix the 500 thing first; then we'll start getting errors reported by email, at least.

2016/10/03 11:22:54 [error] 26767#0: *607140 upstream prematurely closed connect
ion while reading response header from upstream, client: 141.101.98.155, server:
 openprescribing.net, request: "GET /api/1.0/spending_by_practice/?format=json&c
ode=0208020Z0AA&org=00Q HTTP/1.1", upstream: "http://unix:/webapps/openprescribi
ng/run/gunicorn.sock:/api/1.0/spending_by_practice/?format=json&code=0208020Z0AA
&org=00Q", host: "openprescribing.net"
2016/10/03 11:22:54 [error] 26767#0: *607140 open() "/webapps/openprescribing/static/500.html" failed (2: No such file or directory), client: 141.101.98.155, server: openprescribing.net, request: "GET /api/1.0/spending_by_practice/?format=json&code=0208020Z0AA&org=00Q HTTP/1.1", upstream: "http://unix:/webapps/openprescribing/run/gunicorn.sock:/api/1.0/spending_by_practice/?format=json&code=0208020Z0AA&org=00Q", host: "openprescribing.net"
sebbacon commented 8 years ago

The timeout looks likely to be a gunicorn setting.

I will change that (so we don't timeout) and start logging slow queries in postgres so we can analyse them systematically.

annapowellsmith commented 8 years ago

Thanks! Have dim memory that we're already logging slow queries, but that might be wrong.

annapowellsmith commented 8 years ago

Yeah, I think it's caused by my deleting this index, oops - we can just add it back in since we've got plenty of space for the moment.

sebbacon commented 8 years ago

I wouldn't be too quick to blame removing that index for everything; there is evidence of similar issues going back to June. But yes, it clearly needs adding back in (will do it now).

Yes, we were logging slow queries (> 3000ms).

Some of these queries are now returning very quickly, so it is quite possible that it's related to server load (having to service 3 or 4 analyses concurrently, for example).

Postgres logs errors like the following. My interpretation is that this is again us seeing the postgres client being timed out by gunicorn before postgres can respond. This, of course, is likely to cause end users to reload their browsers and send the query again, potentially compounding the problem.

2016-10-02 11:33:55 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG:  statement: SHOW default_transaction_isolation
2016-10-02 11:33:55 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG:  statement: SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: LOG:  could not send data to client: Broken pipe
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: STATEMENT:  SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: FATAL:  connection to client lost
2016-10-02 11:36:21 UTC:127.0.0.1(54945):prescribing@prescribing:[23347]: STATEMENT:  SELECT pr.practice_id AS row_id, pc.name AS row_name, pc.setting AS setting, pc.ccg_id AS ccg, pr.processing_date AS date, SUM(pr.actual_cost) AS actual_cost, SUM(pr.total_items) AS items, CAST(SUM(pr.quantity) AS bigint) AS quantity FROM frontend_prescription pr JOIN frontend_practice pc ON pr.practice_id=pc.code WHERE (pr.presentation_code LIKE '0208020Z0AA%' ) AND (pr.pct_id='03Q' ) GROUP BY pr.practice_id, pc.code, date ORDER BY date, pr.practice_id

End users should now be getting 500s rather then 404s. That misconfiguration may well have been responsible for us missing a number of errors in the past, so we might start seeing new errors soon.

sebbacon commented 8 years ago

I wonder if we should consider using NewRelic to collect errors etc. The starter is $13 a month and I've found it very useful in RoR environments.

sebbacon commented 8 years ago

Now running in a screen session:

CREATE INDEX frontend_prescription_by_pct_and_code on frontend_prescription (presentation_code,pct_id);
annapowellsmith commented 8 years ago

An identical(?) index already exists: "frontend_prescription_by_pct" btree (presentation_code, pct_id).

I think we might need one that's the other way around.

sebbacon commented 8 years ago

Thanks, I didn't see that. I've cancelled the operation.

I don't believe it matters which way round the columns are defined in a concatenated index, though I suppose the only way to be sure is to create both and run some tests (a bit difficult given we only have a production environment, but we can try it anyway...)

Let's look at the actual query being run:

SELECT
  pr.practice_id AS row_id,
  pc.name AS row_name,
  pc.setting AS setting,
  pc.ccg_id AS ccg,
  pr.processing_date AS date,
  SUM(pr.actual_cost) AS actual_cost,
  SUM(pr.total_items) AS items,
  CAST(SUM(pr.quantity) AS bigint) AS quantity
FROM frontend_prescription pr
JOIN frontend_practice pc
  ON pr.practice_id = pc.code
WHERE (pr.presentation_code LIKE '0208020Z0AA%')
AND (pr.pct_id = '00Q')
GROUP BY pr.practice_id,
         pc.code,
         date
ORDER BY date, pr.practice_id

The EXPLAIN ANALYZE output is as follows

                                                                                         QUERY PLAN                                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=68515.51..68533.90 rows=613 width=69) (actual time=835.586..836.393 rows=599 loops=1)
   Group Key: pr.processing_date, pr.practice_id, pc.code
   ->  Sort  (cost=68515.51..68517.04 rows=613 width=69) (actual time=835.552..835.703 rows=904 loops=1)
         Sort Key: pr.processing_date, pr.practice_id, pc.code
         Sort Method: quicksort  Memory: 152kB
         ->  Nested Loop  (cost=0.57..68487.12 rows=613 width=69) (actual time=72.733..833.009 rows=904 loops=1)
               ->  Seq Scan on frontend_practice pc  (cost=0.00..644.17 rows=14717 width=38) (actual time=0.014..10.908 rows=14717 loops=1)
               ->  Index Scan using frontend_prescription_by_practice_and_code on frontend_prescription pr  (cost=0.57..4.60 rows=1 width=31) (actual time=0.055..0.055 rows=0 loops=14717)
                     Index Cond: (((practice_id)::text = (pc.code)::text) AND ((presentation_code)::text ~>=~ '0208020Z0AA'::text) AND ((presentation_code)::text ~<~ '0208020Z0AB'::text))
                     Filter: (((presentation_code)::text ~~ '0208020Z0AA%'::text) AND ((pct_id)::text = '00Q'::text))
                     Rows Removed by Filter: 17
 Planning time: 5.707 ms
 Execution time: 836.675 ms
(13 rows)

Note it is using the index frontend_prescription_by_practice_and_code, which looks like this:

   Index "public.frontend_prescription_by_practice_and_code"
      Column       |         Type          |    Definition     
-------------------+-----------------------+-------------------
 practice_id       | character varying(6)  | practice_id
 presentation_code | character varying(15) | presentation_code
btree, for table "public.frontend_prescription"

The index @annapowellsmith pointed out looks like this:

          Index "public.frontend_prescription_by_pct"
      Column       |         Type          |    Definition     
-------------------+-----------------------+-------------------
 presentation_code | character varying(15) | presentation_code
 pct_id            | character varying(3)  | pct_id
btree, for table "public.frontend_prescription"

The planner has decided it is cheaper to use the (practice_id, presentation_code) index than the (presentation_code, pct_id) index, where the condition is WHERE (pr.presentation_code LIKE '0208020Z0AA%') AND (pr.pct_id = '00Q'). While I don't completely understand this, it seems likely this is due to the JOIN on the practice_id column; presumably because getting the join key value from an index we have to load anyway is cheaper than scanning the table after applying the index.

So my hypothesis is that the index we actually need is on (presentation_code, pct_id, practice_id). However, this will be a pretty big index and won't be usable when the condition is WHERE presentation_code LIKE 'foo' AND practice_id = 'baz' (i.e. no pct_id specified). If we wanted the same index to be usable in both instances we'd have to add a pct_id condition to each query on practice_id (which is certainly possible, if a little ugly, by looking up all the CCGs a practice has every been part of, perhaps in a subselect).

Another alternative is to avoid the JOIN completely in the query. It's presumably there to ensure we are using the current CCG membership for a practice, rather than a historic membership. (I'm not sure the logic is correct in any case as we filter on the CCG id recorded in the prescription but return the CCG id for the practice mentioned in the prescription, which would mean you can potentially get data from the API for CCGs that you think you've not asked about).

I can't currently think of a simple way of avoiding this JOIN, short of a new (huge) materialized view or similar strategy.

annapowellsmith commented 8 years ago

Pretty sure the ordering of the columns does matter, and it will speed things up to have one with pct_id first.

The query planner is an enigma wrapped in a mystery etc, but I think it's choosing to use another index rather than frontend_prescription_by_pct because this is a LIKE query - using the would return a huge chunk of the table on the first part of the query and then it would need to refine it down by PCT ID, and the planner probably thinks that would be slower than using an existing index. But the planner moves in mysterious ways.

We need the JOIN in order to get practice setting, and we need that because we return practice setting in the API call so that we can hide non-standard practices in the charts.

I'm running this in a screen session now: create index frontend_prescription_by_pct_and_presentation on frontend_prescription(pct_id, presentation_code varchar_pattern_ops) - will analyze and see if that improves things.

annapowellsmith commented 8 years ago

New Relic - definitely! CloudFlare offer it as an integrated service - if you look in the Apps section you'll see I had a go at setting it up at one point.

annapowellsmith commented 8 years ago

Oh, I know why it's not using frontend_prescription_by_pct - it's because it is a LIKE query, and that index hasn't been set up with varchar_pattern_ops.

sebbacon commented 8 years ago

Couple of observations:

annapowellsmith commented 8 years ago

Responses:

I think re-creating the index on frontend_prescription(pct_id, presentation_code varchar_pattern_ops) has solved the issue, but we'll see!

Closing, we can reopen if necessary.