Fix bug in pivot_cna_longer and add option for CNA high level only

What changes are proposed in this pull request? This PR fixes a bug in CNA processing in pivot_cna_longer().

Additionally, I changed the way CNA is coded internally and added an argument high_level_cna_only to allow users to only annotate high level dels/amps. By default it will count any type of alt as an event.

Another change is that pivot_cna_longer() now only returns events and all neutral events are filtered out.

Reviewer Checklist (if item does not apply, mark is as complete)

[ ] PR branch has pulled the most recent updates from main branch. Ensure the pull request branch and your local version match and both have the latest updates from the main branch.
[ ] If a new function was added, function included in _pkgdown.yml
[ ] If a bug was fixed, a unit test was added for the bug check
[ ] Run pkgdown::build_site(). Check the R console for errors, and review the rendered website.
[ ] Code coverage is suitable for any new functions/features. Review coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). Begin in a fresh R session without any packages loaded.
[ ] R CMD Check runs without errors, warnings, and notes
[ ] usethis::use_spell_check() runs with no spelling errors in documentation

When the branch is ready to be merged into master:

[ ] Update NEWS.md with the changes from this pull request under the heading "# cbioportalR (development version)". If there is an issue associated with the pull request, reference it in parentheses at the end update (see NEWS.md for examples).
- [ ] Run codemetar::write_codemeta()
- [ ] Run usethis::use_spell_check() again
- [ ] Approve Pull Request
- [ ] Merge the PR

Thank you so much for this! I re-tested and now get consistent results when running create_gene_binary() based on CNA data pulled from the portal versus CNA data that was pivoted with pivot_cna_longer() 🎉

The only discrepancy is that there is a difference of 2 variables returned when I run it with the CNA data from the portal vs genie CNA data, but the columns are totally blank so not sure how much this matters? The column ADGRA2.Amp is only returned when create_gene_binary() is run based on the portal data, and the column GPR124.Amp is only returned when create_gene_binary() is run based on the transposed GENIE CNA data. Any ideas why that might be? Neither ADGRA2 or GPR124 is on the underlying CNA files.

Super minor: I also tested the messaging when putting in the GENIE CNA data before it's pivoted. The error returned is Error in sanitize_cna_input() at gnomeR/R/create-gene-binary.R:132:4: ! The following required columns are missing in your mutations data: sample_id and alteration. Is your data in wide format? If so, it must be long format. See gnomeR::pivot_cna_long() to reformat

Rather than cols missing in mutations data should it say CNA data?

Thanks @jalavery!

It looks like these genes are aliases of each other

library(cbioportalR)
library(gnomeR)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> 
#> Attaching package: 'gnomeR'
#> The following object is masked from 'package:cbioportalR':
#> 
#>     impact_gene_info

set_cbioportal_db("public")
#> ✔ You are successfully connected!
#> ✔ base_url for this R session is now set to "www.cbioportal.org/api"

get_alias("ADGRA2")
#> # A tibble: 6 × 2
#>   hugo_symbol alias        
#>   <chr>       <chr>        
#> 1 ADGRA2      DKFZp434C211 
#> 2 ADGRA2      DKFZp434J0911
#> 3 ADGRA2      FLJ14390     
#> 4 ADGRA2      GPR124       
#> 5 ADGRA2      KIAA1531     
#> 6 ADGRA2      TEM5

But neither are in the IMPACT panels:

which_impact_panel("ADGRA2")
#> # A tibble: 1 × 5
#>   genes_in_panel IMPACT341 IMPACT410 IMPACT468 IMPACT505
#>   <chr>          <chr>     <chr>     <chr>     <chr>    
#> 1 ADGRA2         no        no        no        no
which_impact_panel("GPR124")
#> # A tibble: 1 × 5
#>   genes_in_panel IMPACT341 IMPACT410 IMPACT468 IMPACT505
#>   <chr>          <chr>     <chr>     <chr>     <chr>    
#> 1 GPR124         no        no        no        no

vetted_alias <- gnomeR::impact_alias_table %>% 
  tidyr::unnest(everything())

vetted_alias %>% dplyr::filter(hugo_symbol %in% c("ADGRA2", "GPR124"))
#> # A tibble: 0 × 4
#> # … with 4 variables: hugo_symbol <chr>, alias <chr>, entrez_id <int>,
#> #   alias_entrez_id <int>
vetted_alias %>% dplyr::filter(alias %in% c("ADGRA2", "GPR124"))
#> # A tibble: 0 × 4
#> # … with 4 variables: hugo_symbol <chr>, alias <chr>, entrez_id <int>,
#> #   alias_entrez_id <int>

They are actually different within the raw data itself:

# It is ADGRA2 in cBioPortal data
x <- all_cna_from_portal %>%
  filter(hugoGeneSymbol %in% c("GPR124", "ADGRA2")) %>%
  select(hugoGeneSymbol, sampleId)
x
#> # hugoGeneSymbol sampleId               
#> # <chr>          <chr>                  
#> # 1 ADGRA2         GENIE-VICC-199259-unk-1

# It is GPR124 in GENIE data
y <- nsclc_public_cna %>%
  filter(Hugo_Symbol %in% c("GPR124", "ADGRA2"))

y %>% select(Hugo_Symbol, which(map(y, ~sum(as.numeric(.x), na.rm = TRUE)) > 0))

#> # Hugo_Symbol GENIE.VICC.199259.unk.1
#> # 1      GPR124                       2

This is where alias resolution gets really important, but unfortunately I don't think it's feasible to support every panel for vetted aliases. Right now we have vetted aliases for all IMPACT panels only and use gnomeR::impact_alias_table as our reference alias dictionary. It contains the main IMPACT genes, but some aliases for non-IMPACT genes will not be caught.

Here are my suggestions for moving forward:

1) Make sure the current limitations of the alias functionality is explicitly documented (maybe add something to the alias message in console as well saying IMPACT genes were checked? I think it says "common" right now.) 2) I remembered that I actually originally wrote the recode_alias and resolve_alias functions with future expansion in mind. There is an alias_table argument that is currently using gnomeR::impact_alias_table but in the future we can explore providing more comprehensive alias lists (or connecting to another service/database that does this already? - turn into an issue?

Let me know if you have any questions or thoughts.

Thanks!

Thank you for looking into this! I realized I was searching for the genes on the CNA files incorrectly, which is why I thought that they weren't on there, sorry about that. What you have above makes sense to me.

The current messaging about recoding is: To ensure gene with multiple names/aliases are correctly grouped together, the following genes in your dataframe have been recoded (you can prevent this with recode_aliases = FALSE):. What about adding a sentence to the documentation under recode_alias? Something like: "Currently, alias recoding is only available for genes that are on MSK IMPACT panels"?

I think opening an issue for this could be good, though it's comforting that only 1 gene was affected by this, so it doesn't feel like such a big issue if there isn't an easy way to support non-IMPACT genes.

Let me know if it's helpful to chat about this!

Thanks @jalavery! I have updated the docs with your suggestion and opened an issue #225 to address expanding this functionality to non IMPACT genes. Feel free to edit/add details there if you have any suggestions or would like to work on it.

MSKCC-Epi-Bio / gnomeR

Fix bug in pivot_cna_longer and add option for CNA high level only #221