dhimmel commented 8 years ago

The glmnet upgrade to version 2 introduced a bug where the methods package is not properly loaded. In the course of diagnosing that issue, I discovered a second issue which was brought to light by the upgrade. I briefly mentioned the second issue before knowing its cause:

However, if I run the analysis by launching an R session from the project's root directory and then run source('./code/run.R'), the code progresses past create-models.R before another error occurs.

Now, I have tracked down the cause. We were improperly computing R² values for our lasso models. I corrected the faulty code after evaluating several methods for the R² computation.

Prior to the fix, we were extracting R² values directly from a cv.glmnet object. This Class is poorly documented and the glmnet vignette now cautions:

We do not encourage users to extract the components directly except for viewing the selected values of λ.

So essentially, we were reporting an R² for a model based on a λ evaluated during cross-validation, but not the model with the optimal λ that we intended. Our faulty method for extracting R² started throwing an error due to a glmnet update that brought a:

Major upgrade to CV; let each model use its own lambdas, then predict at original set.

We will keep this thread updated with information on this issue.

dhimmel commented 8 years ago

Corrected R² values

I updated our analysis with the correct lasso R² values. The old (incorrect) and new (correct) values are:

Cancer	Old Lasso R²	New Lasso R²
Lung	67.1%	68.9%
Breast	51.3%	54.5%
Colorectal	27.4%	31.9%
Prostate	7.8%	15.0%

For all four cancers, the faulty method underestimated the lasso R². The underestimation was minimal for lung cancer and largest for prostate cancer. The new values are more concordant with the best-subset R² values. As expected, the best-subset values are still higher, but now the discrepancy is smaller.

The conclusions of our study are not affected by this change. To contextualize the change, the old values suggest that the best-subset approach overfit more compared to the new values. However, the main conclusions we drew from the lasso approach were based on the models, which were not affected by this issue. Essentially, the lasso models now appear to explain slightly more variation in cancer incidence.

Errors in the publication

Accordingly, the following paragraph of the paper has errors. The bolded values should respectively be replaced with 69%, 55%, 32%, 15%:

The lasso (and best subset) models explained 67% (70%) of variation in lung cancer incidence, 51% (57%) in breast, 29% (34%) in colorectal, and 9% (19%) in prostate, (Tables 3 and 2) mirroring a previously described trend in fraction of risk attributable to modifiable factors for each of the four cancers (Danaei et al., 2005).

In addition, the R² column of Table 3 should be updated according to the above table.

You may notice that the old lasso R² values for colorectal and prostate models differ minimally between the paragraph and table. The table contains the correct incorrect values—the two paragraph values were not properly updated in the manuscript text at 4352de689efa8b19d71c87849d361ebe9a5b234a.

dhimmel commented 8 years ago

Comments added on PeerJ

I added comments on the online PeerJ article using the questions feature. Now both Table 3 and Paragraph 37 reference the inaccuracy and link to this issue.

dhimmel / elevcan

Incorrect R-squared values for lasso models #2

Corrected R² values

Errors in the publication

Comments added on PeerJ

dhimmel / elevcan

Incorrect R-squared values for lasso models #2

Corrected R2 values

Errors in the publication

Comments added on PeerJ

Corrected R² values