ebmdatalab / price-per-dose

2 stars 1 forks source link

Handle formulation swaps #8

Closed sebbacon closed 7 years ago

sebbacon commented 7 years ago

We have come up with two possible formulation equivalents:

HelenCEBM commented 7 years ago

The following code extracts all items in the BNF having different possible formulations (but with all other parts of the presentation description being identical, i.e. same chemical and dose size, e.g.500mg tablets can only link up with other formulations of the same dose size, such as 500mg capsules).

Note:



WITH presentations AS (

SELECT
    presentation_code,
    presentation,
    REGEXP_EXTRACT(presentation, r'([^_]+)_') AS before_underscore,
    REGEXP_EXTRACT(presentation, r'[^_]+_(.*?)(?:[0-9]+.*)?$') AS after_underscore,
    REGEXP_EXTRACT(presentation, r'[^_]+_.*?([0-9]+.*)?$') AS remainder
  FROM
    ebmdatalab.hscic.bnf
  WHERE
    REGEXP_EXTRACT(presentation, r'([^_]+)_') NOT LIKE "%Gppe%"
    AND SUBSTR(presentation_code,10,2) = 'AA'
    AND SUBSTR(presentation_code,1,2) < '20'
  )

SELECT alts.*, sum(sept.quantity) AS Sept_Quantity, sum(net_cost) AS Sept_NetCost 
FROM 
  (  
  SELECT 
    a.presentation_code AS code,
    a.presentation,
    a.after_underscore AS formulation,
    b.presentation_code AS alt_code,
    b.presentation AS alt_presentation,
    b.after_underscore AS alt_formulation
  FROM presentations a
    INNER JOIN presentations b
    ON a.before_underscore = b.before_underscore and a.remainder= b.remainder
    AND a.after_underscore <> b.after_underscore
  ORDER BY code
  ) alts
INNER JOIN ebmdatalab.tmp_eu.prescribing_sept sept
ON alts.code = sept.bnf_code
GROUP BY code,
    presentation,
    formulation,
    alt_code,
    alt_presentation,
    alt_formulation

ORDER BY code
sebbacon commented 7 years ago

Pivotted data here.

@richiecroker, I suggest you go through the list (it's sorted by quantity) marking the final column Y or N for if the two formulations really are equivalent.

I think we probably should be maintaining a manually curated list of equivalents rather than trying to to this automatically, as there are reputational risks to things which aren't considered clinically equivalent slipping in (though you were sceptical there were any such things @richiecroker, we should go through the list to check anyway)

sebbacon commented 7 years ago

For testing purposes, I'm going to try the following equivalents against current data and see what happens:

richiecroker commented 7 years ago

Are we able to get the alternate formulations quantity on the same pivot?

richiecroker commented 7 years ago

The Google Sheet has now been updated:

I've made the following assumptions of whether this is true:

richiecroker commented 7 years ago

The new set of data is more granular than the old - the existing HSCIC dataset only has a 'total quantity', which is the sum of all the quantity x items for that presentation, whereas the newer set has the number of items for each different quantity. So, using our existing HSCIC data feed, we should only have to to calculate proxy_ADQ x quantity.