PoisonAlien / maftools

Summarize, Analyze and Visualize MAF files from TCGA or in-house studies.
http://bioconductor.org/packages/release/bioc/html/maftools.html
MIT License
443 stars 218 forks source link

when adding cntable to maf tools object, if no mutations in gene the Start_Position End_Position are blank leading to errors #1013

Closed grantn5 closed 4 months ago

grantn5 commented 6 months ago

Describe the issue If you add a cnTable to a maftools object in the read.maf() function, the resulting object will contain NAs in Start_Position and End_Position columns for genes that do not have any mutations, they are simply annotated as AMP or DEL in the @data part of the maf object. this means when trying to subset maf based on a range you get an error. Is there a solution to avoid this by adding in the start and end of the gene?

Command Please post your commands and the output (errors or any unexpected output)

my_maf <- read_maf(
    "path_to_maf", 
    clincalData = df, 
    cnTable = CNV_df
)

subsetMaf(maf = my_maf, ranges = my_bed)

! NA values in data.table 'x' start column: 'Start_Position'. All rows with NA values in the range columns must be removed for foverlaps() to work.
Last error traceback:

I think it would be good to add functionality to the cnTable argument so you can pass chromosome, Start_Postion and End_postion in the cnTable to avoid this error.

Session info Run sessionInfo() and post the output below

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] maftools_2.18.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.2        cli_3.6.1          knitr_1.43         rlang_1.1.1       
 [5] xfun_0.39          processx_3.8.1     targets_1.4.0      data.table_1.14.8 
 [9] glue_1.6.2         backports_1.4.1    ps_1.7.5           fansi_1.0.4       
[13] grid_4.3.0         tibble_3.2.1       base64url_1.4      yaml_2.3.7        
[17] lifecycle_1.0.3    DNAcopy_1.76.0     compiler_4.3.0     codetools_0.2-19  
[21] igraph_1.6.0       RColorBrewer_1.1-3 pkgconfig_2.0.3    lattice_0.21-8    
[25] digest_0.6.31      R6_2.5.1           tidyselect_1.2.0   utf8_1.2.3        
[29] splines_4.3.0      pillar_1.9.0       callr_3.7.3        magrittr_2.0.3    
[33] Matrix_1.5-4       tools_4.3.0        survival_3.5-5    
PoisonAlien commented 5 months ago

Hi,

Thanks for the issue. Good point indeed. I will make the changes.

PoisonAlien commented 5 months ago

Hi,

Thanks again for the issue. I have pushed a fix for handling NAs. You can just add three more columns to your CNV_df - Chromosome Start_Position and End_Position and it should work. Also, I have added keepNA argument to subsetMaf that should either remove or keep rows with NAs post-sub-setting for ranges.

Example with the data from vignette

#path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
#clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')
laml = read.maf(maf = laml.maf,
                clinicalData = laml.clin,
                verbose = FALSE)

set.seed(seed = 1024)
barcodes = as.character(getSampleSummary(x = laml)[,Tumor_Sample_Barcode])
#Random 20 samples
dummy.samples = sample(x = barcodes,
                       size = 20,
                       replace = FALSE)

#Genarate random CN status for above samples
cn.status = sample(
  x = c('ShallowAmp', 'DeepDel', 'Del', 'Amp'),
  size = length(dummy.samples),
  replace = TRUE
)

custom.cn.data = data.frame(
  Gene = "DNMT3A",
  Sample_name = dummy.samples,
  CN = cn.status,
  stringsAsFactors = FALSE
)

#Adding start and end position to cn data
custom.cn.data$Start_Position = 25450743
custom.cn.data$End_Position = 25565459

 head(custom.cn.data)
    Gene  Sample_name         CN Start_Position End_Position
1 DNMT3A TCGA-AB-2898 ShallowAmp       25450743     25565459
2 DNMT3A TCGA-AB-2879        Del       25450743     25565459
3 DNMT3A TCGA-AB-2920        Amp       25450743     25565459
4 DNMT3A TCGA-AB-2866        Del       25450743     25565459
5 DNMT3A TCGA-AB-2892        Del       25450743     25565459
6 DNMT3A TCGA-AB-2863 ShallowAmp       25450743     25565459

# MAF with cndata including start and end position
laml.plus.cn.withLoci = read.maf(maf = laml.maf,
                        cnTable = custom.cn.data,
                        verbose = FALSE)

# MAF with cndata minus the start and end position
laml.plus.cn.noLoci = read.maf(maf = laml.maf,
                        cnTable = custom.cn.data[,c("Gene", "Sample_name", "CN")],
                        verbose = FALSE)

#Subset for ranges
maftools::subsetMaf(maf = laml.plus.cn.withLoci, ranges = data.frame(chromosome = 2, start = 25450743, end = 25565459))
54 variants within provided ranges
-Processing clinical data
An object of class  MAF 
                  ID          summary  Mean Median
              <char>           <char> <num>  <num>
1:        NCBI_Build               37    NA     NA
2:            Center genome.wustl.edu    NA     NA
3:           Samples               48    NA     NA
4:            nGenes                1    NA     NA
5:   Frame_Shift_Del                4 0.083      0
6: Missense_Mutation               39 0.812      1
7: Nonsense_Mutation                5 0.104      0
8:       Splice_Site                6 0.125      0
9:             total               54 1.125      1

#When loci info not available, it throws a warning.
maftools::subsetMaf(maf = laml.plus.cn.noLoci, ranges = data.frame(chromosome = 2, start = 25450743, end = 25565459))
54 variants within provided ranges
-Processing clinical data
An object of class  MAF 
                  ID          summary  Mean Median
              <char>           <char> <num>  <num>
1:        NCBI_Build               37    NA     NA
2:            Center genome.wustl.edu    NA     NA
3:           Samples               48    NA     NA
4:            nGenes                1    NA     NA
5:   Frame_Shift_Del                4 0.083      0
6: Missense_Mutation               39 0.812      1
7: Nonsense_Mutation                5 0.104      0
8:       Splice_Site                6 0.125      0
9:             total               54 1.125      1
Warning message:
In maftools::subsetMaf(maf = laml.plus.cn.noLoci, ranges = data.frame(chromosome = 2,  :
  Removed 20 rows with no loci info.

#Keep variants with missing loci
maftools::subsetMaf(maf = laml.plus.cn.noLoci, ranges = data.frame(chromosome = 2, start = 25450743, end = 25565459), keepNA = TRUE)
54 variants within provided ranges
-Processing clinical data
An object of class  MAF 
                   ID          summary     Mean Median
               <char>           <char>    <num>  <num>
 1:        NCBI_Build               37       NA     NA
 2:            Center genome.wustl.edu       NA     NA
 3:           Samples               64       NA     NA
 4:            nGenes                1       NA     NA
 5:           DeepDel                4 0.062500      0
 6:   Frame_Shift_Del                4 0.062500      0
 7: Missense_Mutation               39 0.609375      1
 8: Nonsense_Mutation                5 0.078125      0
 9:        ShallowAmp                6 0.093750      0
10:       Splice_Site                6 0.093750      0
11:             total               64 1.000000      1
12:               Amp                4 0.062500      0
13:               Del                6 0.093750      0
14:         CNV_total               10 0.156250      0
Warning message:
In maftools::subsetMaf(maf = laml.plus.cn.noLoci, ranges = data.frame(chromosome = 2,  :
  Added back 20 rows with no loci info.

You will have to install from GitHub for the changes. Please let me know if this works for you.

grantn5 commented 5 months ago

Hi, that is amazing, thank you so much for sorting so quickly.

I will install from Git Hub and let you know if I run into any issues!

grantn5 commented 4 months ago

Hi @PoisonAlien Just following up on this I ran into no issues using the command however the function documentation and package wiki is now out of date and needs to be updated.

PoisonAlien commented 4 months ago

Hi, Thank you for testing. I have updated the package documentation and vignette, it has not been pushed to Bioconductor yet. I will close the issue - please feel free to reopen if needed.