OHDSI / CohortMethod

An R package for performing new-user cohort studies in an observational database in the OMOP Common Data Model.
https://ohdsi.github.io/CohortMethod
82 stars 58 forks source link

Feature request: exclude highly correlated covariates from propensity score calculation #155

Closed SulevR closed 1 month ago

SulevR commented 10 months ago

Currently, CohortMethod::createPs() checks whether any of the covariates are highly correlated with the treatment. It is a very useful feature as whenever such a correlation is found, I think propensity scores become highly biased and hardly usable for matching. However, there is no automatic way to disable such covariates from the propensity score calculation. The only way to exclude such covariates is to manually add these to the exclude list, which is... painful as it interrupts the automated flow. Furthermore, there is no good way to identify and remove the highly correlated covariates (CohortMethod::createPs() only shows these out).

I tried to solve this problem by relying on cohort definitions - and excluded concept_id-s given there from the CohortMethodData. However, some highly correlated events cannot be removed this way (for instance, when the diagnostic test result is the index event, but the examination/taking the test (not given in the cohort definition) is also highly correlated).

As a second attempt, I reused some code from CohortMethod::createPs() function to identify the correlated covariates. But there are several problems with it - 1) I have to build studyPopulation twice, which is time-consuming (first, to identify correlations, second, to use exclude list); 2) some correlations do not have concept_id-s to exclude (e.g index year may be correlated) and require more advanced "exclusion mechanism" (need to turn off covariateSetting flag); 3) it is an ugly copy-code-hack; 4) does not work with multi-analysis approach.

So, I'm having trouble finding the best method to automate this step and asking for advice.

It seems most reasonable to have a feature in CohortMethod package that automatically excludes the highly correlated covariates from propensity score fitting. I know that doing too much in the background will make us lazy, but to conduct many analyses with dozens of cohorts in a row, this manual step in the middle is very inconvenient.

schuemie commented 10 months ago

I don't think automatically removing them is a good idea. We need to review whether they are actual relevant differences between target and comparator. If so, we may need to redesign the study, or stop altogether.

I do think we should do a better job of recording the high-correlation covariates. Also the Comparator Selection Tool does a great job of identifying high-correlation covariates beforehand.

SulevR commented 10 months ago

I'm still urging you to create a separate function for determining correlated covariates (and return these to the user). This would help to identify these covariates BEFORE fitting the propensity score models, prevent the analysis from running into a fatal error, and let the user decide what to do with them. Receiving the list would allow us to remove these manually or automatically. I think that it would be helpful to have still an option to exclude such covariates automatically if they appear.

I need to do this automatically because I'm not studying two cohorts + outcome only but a large set of cohorts, and I'm trying to determine the potential risk factors (cohorts) of the outcome. I'm also using a large set of outcomes. So, I'm trying to use the CohortMethod package for dirty work to identify such risk factors/cohorts from a large set of cohorts and later continue working on these in more detail. Therefore, requiring manual exclusion for each cohort pair in this dirty work stage is very inconvenient, and cannot see a reason why the package cannot have an attribute to do this automatically (with necessary warning messages). When the potential candidate risk factors have been found, one should review the highly correlated covariates one by one, for sure.

schuemie commented 10 months ago

That is what the createPs() function currently does: It runs a separate analysis to identify high-correlation covariates, and reports these back to the user. But we will need to stop the analysis if high-correlation covariates are found, because we need user input to decide whether we are ok with removing the covariates.

Automatically removing the covariates may lead to invalid causal estimates, which I don't want to encourage.