Closed benkeser closed 6 years ago
Thanks David, this seems fair to me. Also helpful to know that CC_* are being used.
Out of curiosity, is there an advantage to using the CC_ algorithms over NNLS? Maybe the fact that it's a true convex combination makes it theoretically preferable?
Sure thing.
I haven't noticed an appreciable difference in performance comparing NNLS vs. CC (though I haven't studied it directly). I think the CC method is better motivated by the theory in that the convex weights minimize cross-validated risk directly. NNLS (as far as I can tell) minimizes cv-risk over non-negative weights and then re-scales by the sum. However, coef/sum(coef)
isn't necessarily the minimizer of cv-risk over convex weights (I don't think). Obviously, simulations have shown that NNLS works out pretty darn well, so probably not a big deal.
It occurs to me that if you're interested in deploying this, it might be worth benchmarking the line
colDup <- which(duplicated(Z, MARGIN = 2))
in situations with large Z. duplicated
is pretty fast, but might be slow in large data sets.
A faster option is to check for duplicated values in cvRisk
. Of course, pathological situations might occur where two distinct methods give identical cv risks. I'll leave it to you guys to determine what's preferable.
Thanks for the response!
Good point, maybe the best approach is check for duplication in cvRisk, and if there is duplication use duplicated() on Z to find the duplicated algorithms. Could even restrict the duplicated() analysis to the subset of Z with duplicates in cvRisk.
Yep, that's probably the smart way to do it. Just a bit of bookkeeping to add.
I just ran into a bug with the CC methods. They don't like it when the columns of
Z
are duplicated.solve.QP
throws an error becauseD
is not pd. While it could be user error if methods end up duplicated (see my example below), it can also pop up e.g., whenSL.glm
andSL.step
are used andSL.step
lands on the full model. Here's an example that reproduces:Here's my proposed fix -- get rid of duplicated columns in
Z
before callingcompute
, throw a warning, and add in 0 weight for algorithms that are duplicated.