PoisonAlien / maftools

Summarize, Analyze and Visualize MAF files from TCGA or in-house studies.
http://bioconductor.org/packages/release/bioc/html/maftools.html
MIT License
448 stars 222 forks source link

Help about error with oncoplot function #154

Closed Jasonmbg closed 6 years ago

Jasonmbg commented 6 years ago

Dear Anand,

i would like to ask you some important questions regarding your great R package maftools, and especially regarding the function oncoplot. I tried through the vignette to proceed, but without success.

In detail, through the R package TCGAbiolinks i have used a small gene signature (12 genes) with the COAD TCGA RNA-Seq dataset, that i used with consensus clustering, and showed some interesting grouped survival patterns-

Now, my collaborators asked me if i can found any mutational patterns identified for the 12 genes, and if feasible plot them with the same samples that resulted from the clustering analysis, with the exact cluster membership (groups HC column below)-

Thus, my questions and issues are the following:

A) My process for downloading mutational data:

library(dplyr)
library(DT)
library(maftools)
library(TCGAbiolinks)

maf.COAD <- GDCquery_Maf("COAD", pipelines = "muse") %>% read.maf

selected.signature <-  c("NCOA1","CCT7","UGP2"...)

maf.COAD
An object of class  MAF 
                        ID                    summary    Mean Median
 1:             NCBI_Build                     GRCh38      NA     NA
 2:                 Center BCM;WUGSC;WUGSC;BCM;BCM;BI      NA     NA
 3:                Samples                        399      NA     NA
 4:                 nGenes                      17616      NA     NA
 5:      Missense_Mutation                     118256 296.381     81
 6:      Nonsense_Mutation                       9381  23.511      7
 7:       Nonstop_Mutation                         95   0.238      0
 8:            Splice_Site                       2412   6.045      2
 9: Translation_Start_Site                        152   0.381      0
10:                  total                     130296 326.556     89

dim(pdat.clusters) # the annotated data frame from the transcriptomic data, with the rows as the TCGA patient names, and in the columns some important clinical information:

[1] 456   5

head(pdat.clusters)
                                        subtype_MSI_status subtype_expression_subtype
TCGA-3L-AA1B-01A-11R-A37K-07       NotAvailable               NotAvailable
TCGA-T9-A92H-01A-11R-A37K-07       NotAvailable               NotAvailable
TCGA-CA-6716-01A-11R-1839-07       NotAvailable               NotAvailable
TCGA-CM-6680-01A-11R-1839-07       NotAvailable               NotAvailable
TCGA-D5-6928-01A-11R-1928-07       NotAvailable               NotAvailable
TCGA-A6-A565-01A-31R-A28H-07       NotAvailable               NotAvailable
                                            subtype_histological_type groupsHC
TCGA-3L-AA1B-01A-11R-A37K-07              NotAvailable      EC1
TCGA-T9-A92H-01A-11R-A37K-07              NotAvailable      EC1
TCGA-CA-6716-01A-11R-1839-07              NotAvailable      EC1
TCGA-CM-6680-01A-11R-1839-07              NotAvailable      EC1
TCGA-D5-6928-01A-11R-1928-07              NotAvailable      EC1
TCGA-A6-A565-01A-31R-A28H-07              NotAvailable      EC1
                                                Tumor_Sample_Barcodes
TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-T9-A92H-01A-11R-A37K-07 TCGA-T9-A92H-01A-11R-A37K-07
TCGA-CA-6716-01A-11R-1839-07 TCGA-CA-6716-01A-11R-1839-07
TCGA-CM-6680-01A-11R-1839-07 TCGA-CM-6680-01A-11R-1839-07
TCGA-D5-6928-01A-11R-1928-07 TCGA-D5-6928-01A-11R-1928-07
TCGA-A6-A565-01A-31R-A28H-07 TCGA-A6-A565-01A-31R-A28H-07

So, my issue that clearly my above data frame from the RNA-Seq transcriptomic dataset, includes more patients (456) than the clinical mutation data(399).

Thus, my goal is-if feasible-to create with the oncoplot for the same patients in both types of data, and plot these with the selected genes, as also annotate with the groupsHC column from above, even if some data are missing.

On this premise, i naively tried:

oncoplot(maf = maf.COAD, genes=selected.signature,
 removeNonMutated = FALSE, annotationDat=pdat.clusters,clinicalFeatures = 'groupsHC',sortByAnnotation = TRUE)

Error in `[.data.frame`(annotationDat, , c("Tumor_Sample_Barcode", clinicalFeatures)) : 
  undefined columns selected

So, in your opinion how this could fixed ?

Thank you in advance,

Efstathios-Iason

ShixiangWang commented 6 years ago

@Jasonmbg I think you should read the doc of oncoplot function, it says annotationDat argument is used to provide a custom clinical data, and specify which columns to be drawn using clinicalFeatures arguments.

Once you provide data to annotationDat to oncoplot function, it will not use clinical data in the MAF object. The problem is that the function can only process one dataframe used to annotate while you provide two. I suggest you merge them into one, like MAF@clinicalData <- merge(MAF@clinicalData, pdat.cluster) or something else like this and then use the oncoplot function.

Jasonmbg commented 6 years ago

Dear @ShixiangWang , thank you for your quick answer and comment-of course i have read initially the description of the function oncoplot, but because due to my specific goal desribed above, I'm a bit struggling to set this correctly, as also keep only the same patients between the transcriptomic data-pdat.clusters-and the plotted clinical samples of the relative MAF file. Moreover, i tried your suggestion, but still returned an error:

maf.COAD@clinical.data
             Tumor_Sample_Barcode
  1: TCGA-AA-A010-01A-01D-A17O-10
  2: TCGA-CA-6717-01A-11D-1835-10
  3: TCGA-AZ-4315-01A-01D-1408-10
  4: TCGA-AA-3984-01A-02D-1981-10
  5: TCGA-AA-A00N-01A-02D-A17O-10
 ---                             
395: TCGA-F4-6704-01A-11D-1835-10
396: TCGA-AA-3972-01A-01W-0995-10
397: TCGA-A6-5664-01A-21D-1835-10
398: TCGA-CA-5255-01A-11D-1835-10
399: TCGA-AZ-4323-01A-21D-1835-10

head(pdat.clusters)
                             subtype_MSI_status subtype_expression_subtype
TCGA-3L-AA1B-01A-11R-A37K-07       NotAvailable               NotAvailable
TCGA-T9-A92H-01A-11R-A37K-07       NotAvailable               NotAvailable
TCGA-CA-6716-01A-11R-1839-07       NotAvailable               NotAvailable
TCGA-CM-6680-01A-11R-1839-07       NotAvailable               NotAvailable
TCGA-D5-6928-01A-11R-1928-07       NotAvailable               NotAvailable
TCGA-A6-A565-01A-31R-A28H-07       NotAvailable               NotAvailable
                             subtype_histological_type groupsHC
TCGA-3L-AA1B-01A-11R-A37K-07              NotAvailable      EC1
TCGA-T9-A92H-01A-11R-A37K-07              NotAvailable      EC1
TCGA-CA-6716-01A-11R-1839-07              NotAvailable      EC1
TCGA-CM-6680-01A-11R-1839-07              NotAvailable      EC1
TCGA-D5-6928-01A-11R-1928-07              NotAvailable      EC1
TCGA-A6-A565-01A-31R-A28H-07              NotAvailable      EC1
                                     Tumor_Sample_Barcode
TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-3L-AA1B-01A-11R-A37K-07
TCGA-T9-A92H-01A-11R-A37K-07 TCGA-T9-A92H-01A-11R-A37K-07
TCGA-CA-6716-01A-11R-1839-07 TCGA-CA-6716-01A-11R-1839-07
TCGA-CM-6680-01A-11R-1839-07 TCGA-CM-6680-01A-11R-1839-07
TCGA-D5-6928-01A-11R-1928-07 TCGA-D5-6928-01A-11R-1928-07
TCGA-A6-A565-01A-31R-A28H-07 TCGA-A6-A565-01A-31R-A28H-07

maf.COAD@clinicalData <- merge(maf.COAD@clinicalData, pdat.clusters)
Error in merge(maf.COAD@clinicalData, pdat.clusters) : 
  no slot of name "clinicalData" for this object of class "MAF"
PoisonAlien commented 6 years ago

Hi Jason,

Sorry for late reply. Yes you can do it.. You will have to pass your annotation file (from RNA seq) while reading your maf file. Something like this

#Doanload maf to local file
maf.COAD <- GDCquery_Maf("COAD", pipelines = "muse", save.csv = T, directory = "./")
#pass pdat.clusters as clinical data
coad = read.maf(maf = "70cb1255-ec99-4c08-b482-415f8375be3f/TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf", 
clinicalData = pdat.clusters #These are your annotations
#Draw
oncoplot(maf = maf.COAD, genes=selected.signature, clinicalFeatures = "groupsHC")
)

If you want to do some enrichment for within groupsHC groups

coad.ce = clinicalEnrichment(maf = coad.maf, clinicalFeature = "groupsHC")
plotEnrichmentResults(enrich_res = coad.ce, pVal = 0.05)

Let me know if you have any issues. Also please post your sessionInfo , it would be easier to know if youre using latest version.

P.S: Thanks @ShixiangWang , you were almost close.

ShixiangWang commented 6 years ago

@PoisonAlien Great!

@Jasonmbg I just not use real code..

Maybe you can use the following code

maf.COAD@clinical.data = dplyr::full_join(maf.COAD@clinical.data, pdat.clusters, by="Tumor_Sample_Barcode") %>% data.table::data.table()

to merge the feature data.

I used maf built in maftools to test the code.

Jasonmbg commented 6 years ago

Dear @PoisonAlien ,

thank you for your suggestions and advice-however, when i runned your code chunk:

maf.COAD <- GDCquery_Maf("COAD", pipelines = "muse", save.csv = T, directory = "./")

File created: .//TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv
Warning message:
Unknown or uninitialised column: 'sample'. 

coad = read.maf(maf = "TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf", clinicalData = pdat.clusters)
reading maf..

Error in data.table::fread(input = maf, sep = "\t", stringsAsFactors = FALSE,  : 
  File 'TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf' does not exist; getwd()=='C:/Users/stathis/Desktop/COAD.Mutations'. Include correct full path, or one or more spaces to consider the input a system command.

So, what do you think of this error ? something with windows ?

getwd()
[1] "C:/Users/stathis/Desktop/COAD.Mutations"
list.files()
[1] "pdata.COAD.clusters.reordered.txt"                                          
[2] "TCGA-COAD"                                                                  
[3] "TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv"
Jasonmbg commented 6 years ago

@ShixiangWang thank you for your time-iit still returned an error

Error in dplyr::full_join(maf.COAD@clinical.data, pdat.clusters, by = "Tumor_Sample_Barcode") : 
  trying to get slot "clinical.data" from an object (class "tbl_df") that is not an S4 object
PoisonAlien commented 6 years ago

Hi, point to the downloaded file. I guess its inside "TCGA-COAD" directory ?

Jasonmbg commented 6 years ago

Dear PoisonAlien,

the TCGA-COAD directory has:

(TCGA-COAD>harmonized>Simple_Nucleotide_Variation>Masked_Somatic_Mutation>70cb1255-ec99-4c08-b482-415f8375be3f>TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf)

but the relative file is in WinRAR mode-

and also, there is directly a relative csv without entering the above file, with name:

"TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv"

like i printed above:

list.files() [1] "pdata.COAD.clusters.reordered.txt"
[2] "TCGA-COAD"
[3] "TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv"

PoisonAlien commented 6 years ago

Do you get this error for both the files ? (the one inside COAD-COAD directory and the csv file) Becaus your error says file not found, maybe you entered wrong file name ?

coad = read.maf(maf = "TCGA-COAD>harmonized>Simple_Nucleotide_Variation>Masked_Somatic_Mutation>70cb1255-ec99-4c08-b482-415f8375be3f>TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf")

#or

coad = read.maf(maf = "TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv")

Make sure the file names and paths are correct.

Jasonmbg commented 6 years ago

@PoisonAlien , i tried some modifications with setting current paths with Rstudio-perhaps a bug of Rstudio-rerun again with the relative .csv file, but still now another error appeared:

coad = read.maf(maf = "TCGA.COAD.muse.70cb1255-ec99-4c08-b482-415f8375be3f.DR-10.0.somatic.maf.csv", clinicalData = pdat.clusters)
reading maf..
Error in validateMaf(maf = maf, isTCGA = isTCGA, rdup = removeDuplicatedVariants,  : 
  missing required fields from MAF: Chromosome
PoisonAlien commented 6 years ago

Omg. so many errors ! Can you give it a last try, and read the compressed file within the TCGA-COAD directory..

If you still get an error, can you share your MAF file and pdata file ? Its hard to guess the issue without raw/reproducible file.. You can email me at anandmt3@gmail.com if you don't want to attach it here.

Jasonmbg commented 6 years ago

Dear @PoisonAlien ,unfortunately i will contact you on email because various things continue to appear-