MSKCC-Epi-Bio / gnomeR

Package to wrangle and visualize genomic data in R
https://mskcc-epi-bio.github.io/gnomeR/
Other
26 stars 16 forks source link

Add summarize by patient #346

Open jalavery opened 1 month ago

jalavery commented 1 month ago

What changes are proposed in this pull request?

Let me know if there's anything you want me to add or modify once you have a chance to take a look!

If there is an GitHub issue associated with this pull request, please provide link.

345


Reviewer Checklist (if item does not apply, mark is as complete)

When the branch is ready to be merged into master:

karissawhiting commented 2 weeks ago

Thank you @jalavery this looks fabulous!!! Code looks perfect, though through the processing of reviewing I thought about two potential amendments to functionality that I wanted to run by you:

1) I took out the extract_patient_id() part within the function and instead require users to input a data frame that already has a patient ID column. This will allow it to be more flexible to non IMPACT samples. If the input data frame doesn't have a patient ID column already, I added this suggestion in the error message:

To extract patient IDs from IMPACT sample IDs (e.g. P-XXXXXX-TXX-IMX), use gnomeR::extract_patient_id(data$sample_id)

The one annoying side effect is that in other functions, if you have patient_id in your data, you have to explicitly then pass it to other_vars argument or you get an error when it's present in the input data:

Error in `.abort_if_not_numeric()` at gnomeR/R/summarize-by-gene.R:61:3:
! All alterations in your gene binary must be numeric and only can have values of 0, 1, or NA. Please
  coerce the following columns to numeric or pass them to the `other_vars` argument before proceeding:
  patient_id

We could consider adding it as ignored automatically in other functions if it's present? Idk...

2) At the end of the function we join back the other_vars from the input data frame, however, other variables may be on sample level and this will cause the resulting data frame to not be one unique observation per patient. Do we think this will be confusing? Maybe we could just add a warning with number of unique patients and number of resulting df rows if they differ?

jalavery commented 2 weeks ago

Thank you for taking a look! The updates all sound good to me - For #1: I like the modification to make it more general than just handling IMPACT samples. As a user I'd be okay having to specify the patient id in the other_vars argument. For #2: Great call. I think a warning would be informative. I prob won't have a chance to tinker with this today/tomorrow before going OOO next week, but can take a look once I'm back!