Banding branch - Githubissues

bcollica commented 4 years ago

Here are the additions in this branch:

bandingEst() was added to estimators.R. As of right now, it only takes a single value for k, not a vector. This can be modified later if we want to expand the function and have it be useful outside of the cvCovEst() function. Also, I included argument checks that raise errors if k is either negative or not an integer value.
Two simple tests for bandingEst() were added to test-estimators.R. These test for the two extreme values of the hyper-parameter, k. If you like, I can add additional tests that expect warnings when k is either negative or non-integer.
I included bandingEst() in test-cvCovEst.R. At first, I got an error that the column 'estimator' did not exist in dplyr::group_by(). But when I called devtools::document() and ran devtools::check(), all the tests passed.

Also, I started my R project by cloning the repository from github, and it asked me to name my project. I named it CovEst_Project.Rproj, but now that file shows up in the repository on my machine in addition to cvCovEst.Rproj. I'm not sure if these are conflicting files, or if one is being changed and not the other. Let me know if I should modify or correct this somehow, I'm still figuring out how it all works.

nhejazi commented 4 years ago

This looks great, Brian. Just a few notes to add to Phil’s comments above

Re: .Rproj files, it’s good not to have two of those floating around, but we should keep at least one of them in the repo, I think. With the cvCovEst.Rproj file, I think you’d be able to open up the repo as an existing project in RStudio. It’s also useful for other project management tools like here (https://cran.r-project.org/web/packages/here/index.html) and (I believe) for identifying the top level of a package in devtools::check
I agree that removing the checks in the function is a good idea to avoid them being repeatedly run. It’d be worth preserving this information in the note in the Roxygen documentation so that we (and, eventually, users) can refer to it.
Mostly stylistic, but the general R idiom for looping is expressed in the *apply family of functions, so we should use these where possible. Importantly, if we can easily vectorize some aspects of the nested for loop above, we could potentially see dramatic performance gains (especially since this whole process is itself repeated V times in cross-validation). For now, when re-writing the loops, lapply or vapply (instead of sapply) are both good ways to go; lapply will return a list while vapply allows you to specify the output class (sapply, by contrast, tries to guess a reasonable output class, making it less desirable for package-grade code). If a refresher on converting from for loops to apply style loops might be useful, I’d recommend https://www.jottr.org/2019/01/11/parallelize-a-for-loop-by-rewriting-it-as-an-lapply-call/

Hope this helps, and thanks again for the excellent work on this PR.

PhilBoileau commented 4 years ago

Thanks for all the explanations and corrections, Nima! I had no clue that .Rproj files are what make here and devtools::check tick. I'll add the original project file back to the repo.

Brian, you should be able to work from cvCovEst.Rproj in RStudio by clicking the icon in the top right corner, clicking on "Open Project", navigating to the location where the cvCovEst pkg is cloned, and then selecting cvCovEst.Rproj. Thanks again for your great work!

bcollica commented 4 years ago

Hey guys,

Thanks for the feedback and explanations! I've made those changes and committed them to the branch. Will that update the current pull request, or do I need to submit another one? Just getting used to the flow of Github.

The bandingEst() function definitely works faster now using lapply(). I also tried to exploit symmetry and only perform calculations for what would be the "lower triangular" covariance matrix. This way, each subsequent loop over a column has one less calculation to perform. Then it simply reflects the matrix about the diagonal. I tested it on two simulated data frames, one with 4,000 variables and another with 10,000. The 4,000 one took about 2 seconds, the 10,000 one took about 80 seconds. Not sure if that's good or bad?

Let me know if there's anything you'd like me to change or if you think the performance should be improved somehow.

Brian

PhilBoileau commented 4 years ago

Thanks, Brian! This looks great. Fantastic idea to exploit the symmetry of these estimates! I noticed that the bandingEst function was outputting a message that caused some tests to fail. I patched that up, and pushed to the branch. Once all checks pass on Travis, I'll merge your PR.

PhilBoileau / cvCovEst

Banding branch #6