OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
854 stars 442 forks source link

Create model for dissecting the sig field into computable structured data #689

Open cgreich opened 1 month ago

cgreich commented 1 month ago

This is a placeholder, so a proposal could be worked out:

Instead of the free text sig field (which may even be in a non-English language), we need to have the actual frequency information. So, 2 mL tablets 3 times a day" would become something like:

sig_amount - numeric (2) sig_unit_concept_id - concept (concept_id for liquid units, most often "mL") frequency - numeric (3) frequency_unit_concept_id - concept (concept_id for "day")

marcel1334 commented 1 month ago

My suggestion would be to skip the sig_unit_concept_id attribute. Just like the existing quantity has the unit of the drug_strength denominator, we can use the same unit for the sig_amount. This will make calculations a lot easier since several numbers are all in the same unit.

cgreich commented 1 month ago

Makes sense, @marcel1334. As I said, this is not thought through yet.

Also, we may need to cut frequency into numerator and denominator, to allow for "1 tablet every other day".

marcel1334 commented 1 month ago

Its feature I wanted to bring up some years ago as well. So I'm happy to see that its on the table now. Thanks.

Regarding the "1 tablet every other day": in the Netherlands we have a special frequency table for the time-units. "per 2 days" is one of the possible units. Here we have 4 base attributes: frequency, frequencyunit, amount, amountunit (last one suggested to skip in CDM). And an additional a text attribute that can hold optional dose instructions. Example "ZN 1KDUB". ZN is code for "if needed" and 1KDUB means "first time double dose". But can also include codes for special dose schemas. Here in Netherlands its a simple text value, not database normalized but works good enough. Some of these special values are needed in the dose/day and/or duration calculations. Normalized would mean an extra table where every code is another record, but I think leaving the special codes out or use the text variant would work. Because not all instructions will end up in structured form, I think we need to keep the current sig attribute, or rename this to source_sig.

marcel1334 commented 1 month ago

Or do you mean to use "numerator and denominator" instead of the "timeunit" attribute? So that "1 per 2 days" is just 1/2. "2 times per month" is 2/30. and "5 times per year" is 5/365. In that case, we do not need a timeunit as a concept and a lookuptable to convert "per week" to 7 days in our calculations. Then we can simply calculate with the available numbers. I would vote for this.

There is a lot of dynamics in dose instructions. Putting everything in structured attributes will need a very complex set of attributes/tables. Another thing that I see in our data is "3-4 times per day 2-4 tablets". We take the averages and we simply calculate with these (in example 3.5x3=10.5 tablets per day). Some of the special dose codes we use in the ETL calculations are things like "first time double dose" or the famous "3 weeks and 1 stop week". Sometimes in our research we adjust for the "if need" part, but in most cases we ignore this. Other examples we ignore are "use before/after/during the meal", "use with water", etc. Two other very frequent used codes are: "as known" where we look for the dose instructions in previous prescriptions or "see product instructions" where we use the DDD from WHO. The original dose instructions including the details can be kept in the source_sig.

For our IPCI database we already extract the attributes from the incoming freetext instructions in our "non-CDM" database. We use in our research for many years). So from ETL point of view, we will be able to fill this for most our our drug_exposures.

So far some thought from our side. Looking forward to some additional attributes in the drug_exposure to store the dose instructions in a structured form.

Cheers, Marcel

cgreich commented 1 month ago

Yes, all these need to be considered. There are:

MelaniePhilofsky commented 1 month ago

If people are going to parse the sig field, shouldn't the data be put into the fields already in drug exposure table?

Example: take two pills twice a day for one week. Quantity = 14 start date - end date = 7 days

The meaning is the same.

cgreich commented 1 month ago

Totally. These are connected. Essentially: Days_supply = quantity / frequency per day.

The problem we are solving is that we get 2 out of 3 (sometimes 3 out of 3) in the data. In US prescriptions, it is days_supply and quantity. But in other countries that may be different, and you get the frequency (from the sig) and either days_supply or quantity. We want to have all such cases covered and always be able to calculate the dose.

It may also help with debugging.

MelaniePhilofsky commented 1 month ago

Then why do we need:

"sig_amount - numeric (2)" Isn't this number the quantity?

And "frequency" can be derived from the start date - end date (both mandatory fields) or days supply field and the quantity field. Or days supply can be derived if we have quantity and frequency.

cgreich commented 1 month ago

No. The quantity is the total quantity handed out to the patient for a period of time (day_supply). The sig_amount is the "3" in a sig "Take 3 times daily".

But you are right. A simplified proposal could be to get rid of the sig, use the source string to parse out the frequency and fill in days_supply and quantity. The problem is that the sig is often not that clean. See above for all the funny situations ("2-4 tablets", "as needed" etc.) Want to burden the ETL schmock to figure that all out?

tiozab commented 1 month ago

I Like the idea that information from sig trumps quantity and duration derived from other information, AS @MelaniePhilofsky did, especially For Dose, sig is the most reliable information. However, the "For how Long" May not always bei available in the sig fields (in which case the duration has to be derived from other fields). In that Way, days supply and duration May not always need to be the Same value, especially if we have the information that the package would have lasted For 14 days (days supply), but only 10 days of use was prescribed (= 10 days of duration).

I also Like to keep the Source sig because often it is nice to Double Check What the Source was.

Moreover, i think it does not hurt to create additional fields and spread out the Source sig information to two more fields (only two if we standardise the information, more if we dont) Possible standardisations: "Number" of "something id" PER DAY The "something" being a Dose Form Or Volume (using the rule of thumb that around 20 drops equal 1 ml). E.g. "2" "tablets" per Day E.g. "2" "ml" per Day

marcel1334 commented 1 month ago

Other examples related to this:

I would also vote to keep the sig as a source. Same for the other source attributes in the other tables. Escpecially with the dose_per_day and do calculations where it does not make any sense, the rough dose text can help to find out whats going on. Even if not you own language, often you recognise. Having the source_text attributes are important. Otherwise we fully run blind on the structured data and cannot validate whats going on. There are also multiple studies where the source fields are used in the queries.

Just to make sure how I stand in this: I'm NOT trying to get a very very complex database model to capture all the different situations I bring up. I just want to put it on the table to get more insight and help coming to a usable and pragmatic solution. If you do a study where the dose instructions are optional or gives the patient freedom you known the calculations will have issues. And for the Asthma exacerbation drugs you also known how it works and have to adjust for this.

I think adding a single "amount_per_day" as floatingpoint attribute where the unit is the same as drug_strength denominator (just like the quantity) will work in lots of situations and already make a lot of people very happy.

cgreich commented 1 month ago

Should we collect a good sample of sig strings and then come together to make a decision?

marcel1334 commented 1 month ago

I have 700.000+ unique sig texts for you. I will send the top 1000 (including the extracted parameters) to you by email.

tiozab commented 1 month ago

I think it would be nice to have it from more than 1 database? I will send to @cgreich also the 1000 top sigs from CPRD GOLD by email.