hfuchs5 commented 1 year ago

What changes are proposed in this pull request? Avoid use of pivot_*() functions in this because it was taking too long. I used base R matrices instead. Now it takes 10 seconds to run according to tictoc package.

If there is an GitHub issue associated with this pull request, please provide link.

259

Reviewer Checklist (if item does not apply, mark is as complete)

[x] PR branch has pulled the most recent updates from main branch. Ensure the pull request branch and your local version match and both have the latest updates from the main branch.
[x] If a new function was added, function included in _pkgdown.yml
[x] If a bug was fixed, a unit test was added for the bug check
[x] Run pkgdown::build_site(). Check the R console for errors, and review the rendered website.
[x] Code coverage is suitable for any new functions/features. Review coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). Begin in a fresh R session without any packages loaded.
[x] R CMD Check runs without errors, warnings, and notes
[x] usethis::use_spell_check() runs with no spelling errors in documentation

When the branch is ready to be merged into master:

[x] Update NEWS.md with the changes from this pull request under the heading "# cbioportalR (development version)". If there is an issue associated with the pull request, reference it in parentheses at the end update (see NEWS.md for examples).
- [x] Run codemetar::write_codemeta()
- [x] Run usethis::use_spell_check() again
- [x] Approve Pull Request
- [x] Merge the PR

michaelcurry1123 commented 1 year ago

in summarize_by_gene() on line 35 when you name the columns it is possible that they have the same name. After this you cannot use mutate or other dplyr functions. If you are trying to name the columns after the sample_id you can maybe do something like this: all_bin2[rownames(all_bin2)== "sample_id",]

karissawhiting commented 1 year ago

@hfuchs5 I made some updates to simplify how to handle sample_id (you're right that rownames can't be avoided unfortunately!) and add a few speed enhancements (e.g. using table() instead of summarise). I also added a check for numeric columns up front instead of transforming later which seems to speed things even more!

        test replications                elapsed            relative   user.self   sys.self
1      summarize_by_gene_new           500 161.339     1.000   156.353      5.092
2      summarize_by_gene               500 229.061    1.420   222.734      6.407

Could you test it out with your large data set to make sure it still is faster?

Thanks for your work on this. I think we are close!

hfuchs5 commented 1 year ago

It ran through my dataset with 1000+ samples in 1.14 seconds! It looks good to me and I'm going to merge.

MSKCC-Epi-Bio / gnomeR

Faster summarize by gene #266

259