MSKCC-Epi-Bio / gnomeR

Package to wrangle and visualize genomic data in R
https://mskcc-epi-bio.github.io/gnomeR/
Other
26 stars 16 forks source link

tbl_genomic: freq_cutoff issue #253

Closed hfuchs5 closed 1 year ago

hfuchs5 commented 1 year ago

Works great when using gene_subset (3 genes and 10 genes are fine)

Not sure it works with freq_cutoff (with my data at least). I originally thought it was because I was using too low of a cutoff, but even with freq_cutoff = 0.9 or 0.95, it might take >10 mins to run.

freq_cutoff_by_gene = F is of interest to me, but unclear if it will work since the freq_cutoff argument wasn’t working in general before

michaelcurry1123 commented 1 year ago

try to create a helper function to calculate frequency cutoff for genes

michaelcurry1123 commented 1 year ago

create helper function to filter data by user specified genes

karissawhiting commented 1 year ago

@carokos could you give an example of when it didn’t work as expected? This may be a messaging/documentation issue on our part. When freq_cutoff_by_gene = F it will decide whether the gene is in the resulting table based on overall gene alteration frequency > the cutoff (aggregating Alt/del/fus to make the cut) but it will include all types of alterations in the table that contributed to that calculation even if they are under the cutoff like low frequency fusions. Not sure if this was the issue you were coming across or if it was a bug.

We are open to changing this to make It more useful and less confusing! How would this function be most useful to you in your workflows?

carokos commented 1 year ago

I ran binary_matrix %>% tbl_genomic(freq_cutoff = 0.30) and it took 46 minutes to produce the table with only two genes that reach that cutoff. It just seems like this function takes too long for what it's doing, but maybe I'm just too impatient... I can send the data that I'm using privately, but I have about 500 samples and about 250 genes.

I haven't tested the freq_cutoff_by_gene argument yet, since I was just now able to fully run tbl_genomic(freq_cutoff = 0.30) (due to the time). I'd love to play around with this feature more in the future! (I don't have any great suggestions or comments at this point - sorry!)

karissawhiting commented 1 year ago

We decided to remove these arguments from tbl_genomic() and instead now have a pre-processing helper function subset_by_frequency()

michaelcurry1123 commented 1 year ago

@karissawhiting new job whose this