MSKCC-Epi-Bio / gnomeR

Package to wrangle and visualize genomic data in R
https://mskcc-epi-bio.github.io/gnomeR/
Other
26 stars 19 forks source link

Faster summarize by gene #266

Closed hfuchs5 closed 1 year ago

hfuchs5 commented 1 year ago

What changes are proposed in this pull request? Avoid use of pivot_*() functions in this because it was taking too long. I used base R matrices instead. Now it takes 10 seconds to run according to tictoc package.

If there is an GitHub issue associated with this pull request, please provide link.

259


Reviewer Checklist (if item does not apply, mark is as complete)

When the branch is ready to be merged into master:

michaelcurry1123 commented 1 year ago

in summarize_by_gene() on line 35 when you name the columns it is possible that they have the same name. After this you cannot use mutate or other dplyr functions. If you are trying to name the columns after the sample_id you can maybe do something like this: all_bin2[rownames(all_bin2)== "sample_id",]

karissawhiting commented 1 year ago

@hfuchs5 I made some updates to simplify how to handle sample_id (you're right that rownames can't be avoided unfortunately!) and add a few speed enhancements (e.g. using table() instead of summarise). I also added a check for numeric columns up front instead of transforming later which seems to speed things even more!

        test replications                elapsed            relative   user.self   sys.self
1      summarize_by_gene_new           500 161.339     1.000   156.353      5.092
2      summarize_by_gene               500 229.061    1.420   222.734      6.407

Could you test it out with your large data set to make sure it still is faster?

Thanks for your work on this. I think we are close!

hfuchs5 commented 1 year ago

It ran through my dataset with 1000+ samples in 1.14 seconds! It looks good to me and I'm going to merge.