OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

Documentation on cohort stats tables #79

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

I'm not aware of any documentation that describes how to interpret the stats/attrition tables created during cohort generation.

@catalamarti decoded the inclusion_result table:

This table contains: 4 columns: cohort_definition_id, inclusion_rule_mask, person_count, mode_id

inclusion_rule_mask interpretation:

Each inclusion contributes 2^(inclusion_id) possible subsets of the cohort and the inclusion_rule_mask is the sum of the contributions. Each subset is a combination of inclusion rules.

Example: Let's say we have 3 inclusion rules (inclusion rule 0, inclusion rule 1, and inclusion rule 2). The first inclusion will contribute 2^0=1, the second 2^1=2, and the third inclusion: 2^2=4. So, for example, individuals that fulfill the third and second conditions, but not the first will be recorded in inclusion_rule_mask = 2 + 4 = 6.

See the table below for all combinations for the three rules example:

inclusion 0   inclusion 1   inclusion 2   inclusion_rule_mask
  no            no            no            0
  yes           no            no            1
  no            yes           no            2
  yes           yes           no            3
  no            no            yes           4
  yes           no            yes           5
  no            yes           yes           6
  yes           yes           yes           7

In this case, we will build our attrition as: all qualifying initial events: 0+1+2+3+4+5+6+7 satisfy inclusion 0: 1+3+5+7 satisfy inclusion 0 and 1: 3+7 satisfy inclusion 0, 1, and 2: 7

Can we add this to the CohortGenerator vignette or put it somewhere else?

anthonysena commented 1 year ago

Thanks @ablack3 for raising this issue and agree this would be useful to document, either in this package or in circe-be. Tagging @chrisknoll since I am unsure what resources (beyond your write up above) exist. If we could link to those resource(s) in this issue, we could then add it to the CohortGenerator package (or link to it from the CG package).

chrisknoll commented 1 year ago

Hi, everyone, Sorry for the late reply here, just wanted to clarify something:

In this case, we will build our attrition as: all qualifying initial events: 0+1+2+3+4+5+6+7 satisfy inclusion 0: 1+3+5+7 satisfy inclusion 0 and 1: 3+7 satisfy inclusion 0, 1, and 2: 7

The 'all qualifying' correct, in that if you want to know the count of people who had entry events, but you would just sum up all the rows (including the 0 row, we record number of people that matched 0 rules). I was a little confused when @ablack3 described it as 0+1+2+3+4+5+6+7, but that's all the combinations of 3 inclusion rules, so that's technically correct, there's just a simpler implementation: sum up the counts in inclusion_result.

To find the rows that match certain inclusion rules, you would use a binary operator & to see if the number from the inclurion_rule_mask column matches the inclusion rules you want to test. So, satisfy inclusion 0 means is the first bit (2^0 = 1) set? To find out, you would do inclusion_rule_mask & 1 = 1, In this case any number > 0 would indicate that flag is set, however, in the multi-bit test, it becomes more clear why you do this:

'satisfy inclusion 0 and 1' means that the bits you are looking for is 2^0 + 2^1 = 1+2 = 3 (same as inclusion_rule_mask from the above table). To find the rows that have those 2 bits set: inclusion_rule_mask & 3 = 3. Why the =3? because if you tested a row where inclusion_rule_mask was 1, the above bitwise-and would have 1 & 3 = 1 (ie: the 1 bit of the 3 is set)....what you want to ensure is the bits you are testing result in the same value as the bitwise-and. 3 & 3 = 3, 5 & 3 = 3, 7 & 3 = 3. The other rows are: 1 & 3 = 1, 2 & 3 = 2, 4 & 3 = 0, 6 & 3 = 2. Note how you get a > 0 number, but not the number you are trying to test for. But anything that is not 3 is not matching on the first AND second rule.

I hope this clarifies things, I was personally a little confused when I read "satisfy inclusion 0: 1+3+5+7", but I now understand that to mean you add up the rows where inclusion rule mask = 1 or 3 or 5 or 7. Mechanically, that is filter(inclusion_rule_mask & 1 = 1) %>% sum(person_count) (in R dplyr pseudocode :) )

ablack3 commented 1 year ago

Thanks @chrisknoll. This would be really helpful to add to a vignette or some other documentation. So actually I have to give @catalamarti credit for decoding this. I just read over it and posted his description here.

pa-nathaniel commented 10 months ago

Jumping in this thread as I got here while trying to figure out how to obtain a cohort attrition table from cohorts created by CohortGenerator::generateCohortSet().

Essentially we're trying to figure how to generate a table, where each row represents an inclusion criteria, with a column that shows the number of persons remaining in the cohort after the application of the inclusion criteria.

Have there been updates to the documentation here on how to create this?

FYI I also posted a similar question in https://forums.ohdsi.org/t/how-to-get-attrition-table-in-hades/19746.

chrisknoll commented 10 months ago

Hi, I think there has been a request to provide a function that can read the cohort generation stats tables (that store the individual inclusion rule matches, and the combination of inclusion rules described above).

I've seen implementations of building this attrition table in JavaScript (Atlas does it this way) but also we've done it internally using R code (I believe @gowthamrao and Joel Swerdel have implemented this). I think it would make sense to expose a CohortGeneratior function to read the results int he generation stats tables and produce attrition tables, and I'm happy to help make a PR to implement this feature.

pa-nathaniel commented 10 months ago

Thanks @chrisknoll ! We're trying right now to come up with something (will share if we get it right), but until then would love to see what you and others have come up with.

anthonysena commented 9 months ago

Relating this to #123 even though they are a bit different but the documentation should cover both approaches.