dhimmel / elevcan

Elevation and Cancer Incidence
https://doi.org/10.7717/peerj.705
Other
2 stars 0 forks source link

Incorrect R-squared values for lasso models #2

Closed dhimmel closed 8 years ago

dhimmel commented 8 years ago

The glmnet upgrade to version 2 introduced a bug where the methods package is not properly loaded. In the course of diagnosing that issue, I discovered a second issue which was brought to light by the upgrade. I briefly mentioned the second issue before knowing its cause:

However, if I run the analysis by launching an R session from the project's root directory and then run source('./code/run.R'), the code progresses past create-models.R before another error occurs.

Now, I have tracked down the cause. We were improperly computing R2 values for our lasso models. I corrected the faulty code after evaluating several methods for the R2 computation.

Prior to the fix, we were extracting R2 values directly from a cv.glmnet object. This Class is poorly documented and the glmnet vignette now cautions:

We do not encourage users to extract the components directly except for viewing the selected values of λ.

So essentially, we were reporting an R2 for a model based on a λ evaluated during cross-validation, but not the model with the optimal λ that we intended. Our faulty method for extracting R2 started throwing an error due to a glmnet update that brought a:

Major upgrade to CV; let each model use its own lambdas, then predict at original set.

We will keep this thread updated with information on this issue.

dhimmel commented 8 years ago

Corrected R2 values

I updated our analysis with the correct lasso R2 values. The old (incorrect) and new (correct) values are:

Cancer Old Lasso R2 New Lasso R2
Lung 67.1% 68.9%
Breast 51.3% 54.5%
Colorectal 27.4% 31.9%
Prostate 7.8% 15.0%

For all four cancers, the faulty method underestimated the lasso R2. The underestimation was minimal for lung cancer and largest for prostate cancer. The new values are more concordant with the best-subset R2 values. As expected, the best-subset values are still higher, but now the discrepancy is smaller.

The conclusions of our study are not affected by this change. To contextualize the change, the old values suggest that the best-subset approach overfit more compared to the new values. However, the main conclusions we drew from the lasso approach were based on the models, which were not affected by this issue. Essentially, the lasso models now appear to explain slightly more variation in cancer incidence.

Errors in the publication

Accordingly, the following paragraph of the paper has errors. The bolded values should respectively be replaced with 69%, 55%, 32%, 15%:

The lasso (and best subset) models explained 67% (70%) of variation in lung cancer incidence, 51% (57%) in breast, 29% (34%) in colorectal, and 9% (19%) in prostate, (Tables 3 and 2) mirroring a previously described trend in fraction of risk attributable to modifiable factors for each of the four cancers (Danaei et al., 2005).

In addition, the R2 column of Table 3 should be updated according to the above table.

You may notice that the old lasso R2 values for colorectal and prostate models differ minimally between the paragraph and table. The table contains the correct incorrect values—the two paragraph values were not properly updated in the manuscript text at 4352de689efa8b19d71c87849d361ebe9a5b234a.

dhimmel commented 8 years ago

Comments added on PeerJ

I added comments on the online PeerJ article using the questions feature. Now both Table 3 and Paragraph 37 reference the inaccuracy and link to this issue.