Open AniekMarkus opened 1 year ago
A little bit of data to map out this issue.
When extracting a target cohort on the IPCI database with default covariate settings from FeatureExtraction
. I get 12185 features, out of which 48% are perfectly correlated to at least one other feature. When looking at high instead of perfect correlation (above 0.8) the percentage rises to 82%.
I checked this for various settings below:
covariateSettings | Total Features | Ratio of features with perfect correlation to at least one other | Ratio of features with high correlation (>0.8) to at least one other |
---|---|---|---|
default set from FE | 12185 | 48% | 82% |
age, gender, conditions, procedures, and drug exposures in three time windows | 16828 | 17% | 50% |
age, gender, conditions, procedures, and drug exposures in one time window | 5990 | 6.4% | 8.3% |
age, gender and condition occurrence | 1191 | 20% | 23% |
age, gender and condition era | 1209 | 21% | 23% |
age, gender and condition group era | 2519 | 65% | 81% |
age, gender and drug exposures | 4765 | 2.7% | 4.6% |
age, gender and drug era | 1368 | 6.8% | 12.2% |
age, gender and drug era group | 2306 | 31% | 57% |
A few notes about this.
Is your feature request related to a problem? Please describe. LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information). This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely). Describe the solution you'd like It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables. Describe alternatives you've considered Alternatively, we could;