Clamp `minCellCount` entries in `*.csv` results files

OHDSI / CohortIncidence

Contains the Java and R assets to perform Incidence calculations on a CDM

https://ohdsi.github.io/CohortIncidence/

6 stars 2 forks source link

Clamp `minCellCount` entries in `*.csv` results files #31

Open msuchard opened 1 year ago

msuchard commented 1 year ago

Most data sources cannot release CohortIncidence results files as they contain entries with counts < minCellCount. Please implement before next round of network studies.

chrisknoll commented 1 year ago

@msuchard , I have prioritized this, but I need to rules of minCell count. Is it the number of outcomes? persons at risk? pre-exclude persons at risk? Please let me know what the min cell rules are and I'll implement.

The reason why I left this out is that different contexts may have different min-cell rules and I was considering leaving it to the publishers to trim their results and not depend on a specific design choice in the tool.

chrisknoll commented 1 year ago

@msuchard , i'm still waiting for your response. There is some confusion about how minCell is being applied in different HADES tools: in some cases, you remove the records that don't satisfy min cell (I believe feature extract works this way). In other cases, we see the mincell value being replaced by a value that represents a negative of the mincell paramater. These are very different behaviors, and so I'd like to understand what the standard approach is.

Alternatively, we can leave this alone in this tool, and leave it to the users of the tool to trim their own results based on their own rules.

chrisknoll commented 1 year ago

Spoke to @anthonysena and we're going to put this logic in CohortIncidenceModule where we have the CSV/dataframe to filter.

chrisknoll commented 1 year ago

This has been addressed in CohortIncidenceModule.

msuchard commented 1 year ago

thank you @chrisknoll !!!

msuchard commented 1 year ago

Hi @anthonysena, @chrisknoll and @pbr6cornell --

I don't think the solution currently in CohortIncidenceModule will pass muster. It is still possible to back calculate patient or event counts that are less than minCellCount through the reported incidence_proportion_p100p and incidence_rate_p100py.

As a hypothetical example, suppose I have 789 patients at risk and a (currently not truncated in the output) incidence prop of 0.12531328. Even though my outcome count is truncated to -10, I can still back-calculate the count to 1.

I think both the reported proportion and rate also need to be truncated.

When the outcome count is truncated, then the truncated proportion / rate is < minCellCount / reported patients.

When the patient count is truncated and the outcome count is not 0, then things get a little undefined in terms of > or < ... so can we just return rates of NA here?

Let me know when the Main.R in the module is updated and I'll give it a try.

chrisknoll commented 1 year ago

These are excellent, albeit late, points!

How about we return NA on any calculated column where one of the source columns were truncated? So, if persons at risk or outcomes < minCell value, those will be set to -minCellValue, and the proportions and rates will be set to NA?