Closed sebbacon closed 7 years ago
The following code extracts all items in the BNF having different possible formulations (but with all other parts of the presentation description being identical, i.e. same chemical and dose size, e.g.500mg tablets
can only link up with other formulations of the same dose size, such as 500mg capsules
).
Note:
Heavy Cap
or Oral Susp
. REGEXP
to find all words appearing before a number (normally this is the dose size). If no number is present then all words after the underscore are returned.
Next steps:
WITH presentations AS (
SELECT
presentation_code,
presentation,
REGEXP_EXTRACT(presentation, r'([^_]+)_') AS before_underscore,
REGEXP_EXTRACT(presentation, r'[^_]+_(.*?)(?:[0-9]+.*)?$') AS after_underscore,
REGEXP_EXTRACT(presentation, r'[^_]+_.*?([0-9]+.*)?$') AS remainder
FROM
ebmdatalab.hscic.bnf
WHERE
REGEXP_EXTRACT(presentation, r'([^_]+)_') NOT LIKE "%Gppe%"
AND SUBSTR(presentation_code,10,2) = 'AA'
AND SUBSTR(presentation_code,1,2) < '20'
)
SELECT alts.*, sum(sept.quantity) AS Sept_Quantity, sum(net_cost) AS Sept_NetCost
FROM
(
SELECT
a.presentation_code AS code,
a.presentation,
a.after_underscore AS formulation,
b.presentation_code AS alt_code,
b.presentation AS alt_presentation,
b.after_underscore AS alt_formulation
FROM presentations a
INNER JOIN presentations b
ON a.before_underscore = b.before_underscore and a.remainder= b.remainder
AND a.after_underscore <> b.after_underscore
ORDER BY code
) alts
INNER JOIN ebmdatalab.tmp_eu.prescribing_sept sept
ON alts.code = sept.bnf_code
GROUP BY code,
presentation,
formulation,
alt_code,
alt_presentation,
alt_formulation
ORDER BY code
Pivotted data here.
@richiecroker, I suggest you go through the list (it's sorted by quantity) marking the final column Y or N for if the two formulations really are equivalent.
I think we probably should be maintaining a manually curated list of equivalents rather than trying to to this automatically, as there are reputational risks to things which aren't considered clinically equivalent slipping in (though you were sceptical there were any such things @richiecroker, we should go through the list to check anyway)
For testing purposes, I'm going to try the following equivalents against current data and see what happens:
0407010H0AAAAAA
) vs Paracet_Tab 500mg (0407010H0AAAMAM
)040702040AAACAC
) vs Tramadol HCl_Cap 100mg M/R (040702040AAAHAH
)0407010F0AAAHAH
) vs Co-Codamol_Cap 30mg/500mg, (0407010F0AAADAD
)Are we able to get the alternate formulations quantity on the same pivot?
The Google Sheet has now been updated:
I've made the following assumptions of whether this is true:
The new set of data is more granular than the old - the existing HSCIC dataset only has a 'total quantity', which is the sum of all the quantity x items
for that presentation, whereas the newer set has the number of items for each different quantity. So, using our existing HSCIC data feed, we should only have to to calculate proxy_ADQ x quantity
.
We have come up with two possible formulation equivalents:
Capsules and tablets
Suspensions and solutions
[x] First, write a query to work out how many possible paired tabs/caps there are. We think we just need to extract the text between the underscore and the first space, group by the remainder and then the extracted bit
[x] Make a decision about if we want to do this globally
[x] At a minimum, just do it for Tramadol