OHDSI / PatientLevelPrediction

An R package for performing patient level prediction in an observational database in the OMOP Common Data Model.
https://ohdsi.github.io/PatientLevelPrediction
188 stars 88 forks source link

Perfect multicollinearity #352

Open AniekMarkus opened 1 year ago

AniekMarkus commented 1 year ago

Is your feature request related to a problem? Please describe. LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information).   This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely).   Describe the solution you'd like It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables.   Describe alternatives you've considered Alternatively, we could;

egillax commented 1 year ago

A little bit of data to map out this issue.

When extracting a target cohort on the IPCI database with default covariate settings from FeatureExtraction. I get 12185 features, out of which 48% are perfectly correlated to at least one other feature. When looking at high instead of perfect correlation (above 0.8) the percentage rises to 82%.

I checked this for various settings below:

covariateSettings Total Features Ratio of features with perfect correlation to at least one other Ratio of features with high correlation (>0.8) to at least one other
default set from FE 12185 48% 82%
age, gender, conditions, procedures, and drug exposures in three time windows 16828 17% 50%
age, gender, conditions, procedures, and drug exposures in one time window 5990 6.4% 8.3%
age, gender and condition occurrence 1191 20% 23%
age, gender and condition era 1209 21% 23%
age, gender and condition group era 2519 65% 81%
age, gender and drug exposures 4765 2.7% 4.6%
age, gender and drug era 1368 6.8% 12.2%
age, gender and drug era group 2306 31% 57%

A few notes about this.