BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
291 stars 111 forks source link

GDCprepare gene models for RNA-seq data #91

Open mattiasaine opened 7 years ago

mattiasaine commented 7 years ago

I have been looking into using the RNA-seq data but have some questions on the gene annotations for the V2-RNA-seq data that I get from running your GDC-prepare-function.

How do the 21022 genes that one obtains expression for via your GDCprepare-function map to the gene models in the original TCGA-gaf available here? It seems the coordinates for each gene given by your package have not been obtained from the GAF as they simply do not match the start:end coordinates for many if not most genes.

Using the "TCGA.hg19.June2011.gaf" there are 20806 unique gene identifiers of the form symbol|entrez. Your data however returns expression for 21022 genes mapping to 19947 unique symbol|entrez combos. Every row can obtain a unique identifier by adding the ENSG-id as well so I guess that is how you arrive at 21022 but I downloaded some TCGA data via the ftp before they shut it down and then the "rsem.genes.results"-file had 20531 data rows and the symbol|entrez combos were unique for each row. How can your pipeline generate data for more genes than the original GAF had models for?

Opening one of the downloaded files manually instead of running the GDCprepare I can see that the "rsem.genes.results"-file is still 20531 entries long but somehow your pipeline has extended this in some unexplained way. Does the GDC-prepare function duplicate some data or does it aggregate estimates across genes depending on how the tx-id's map to your (presumably updated) ensembl gene models?

I really love the package for downloading and processing the data but if it does too much weird stuff under the hood it might cause reproducibility issues compared to getting the data from other avenues. Why did you not simply use the gene models from the GAF and could this be added as an option for 100% reverse-compatibility with other ways of getting the data as well as prior publications?

tiagochst commented 7 years ago

Hello,

If I'm not wrong you are talking about this file: TCGA.hg19.June2011.gaf. The hg19 from June 2011 is the GRCh37.p3 (see pictures below from ensemble database).

screen shot 2017-05-17 at 8 57 14 am screen shot 2017-05-17 at 8 56 59 am

The genome of reference had some fixes over the time and we decided to use the last version available to prepare the SummarizedExperiment. For legacy data (data aligned to hg19) TCGAbiolinks is using GRCh37.p13 and for harmonized data (data aligned to hg38) now it is using GRCh38.p7.

I believe that using an old version that has been fixed over the time is not correct. That's why we prefered to use the last version available. Also, it is very important that in a publication the genome of reference used in the analysis are highlighted by the authors for reproducibility.

In terms of our code, if you set in GDCprepare function the argument summarizedExperiment to FALSE you will get a flat table without any annotation or mapping. Then the user can map on its own way.

Finnaly, You are right about the numbers (code below). The problem is simple, we used entrez gene id to map (for this case "ABCD4|5826" we select 5826 to annotate), but they might be mapped to multiple ENSG-id as showed below.

screen shot 2017-05-17 at 9 14 21 am

Unfortunately yes, the final final SummarizedExperiment id will have three lines with the same gene expression levels for this entrez gene.

library(TCGAbiolinks)
query <- GDCquery(project = "TCGA-GBM",
                  legacy = TRUE,
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq",
                  file.type = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c( "TCGA-06-2569-01A-01R-1849-01","TCGA-06-AABW-11A-31R-A36H-07" ))
GDCdownload(query, method = "api")
data <- GDCprepare(query,summarizedExperiment = FALSE)
#> dim(data)
#[1] 20531     2
se <- GDCprepare(query,summarizedExperiment = TRUE)
#> se
#class: RangedSummarizedExperiment 
#dim: 21022 2 
#metadata(0):
#    assays(1): normalized_count
#rownames(21022): A1BG A2M ... TICAM2 SLC25A5-AS1
#rowData names(3): gene_id entrezgene ensembl_gene_id
#colnames(2): TCGA-06-2569-01A-01R-1849-01 TCGA-06-AABW-11A-31R-A36H-07
#colData names(106): patient barcode ... subtype_Telomere.length.estimate.in.blood.normal..Kb. subtype_Telomere.length.estimate.in.tumor..Kb.

Best regards, Tiago

mattiasaine commented 7 years ago

Thank you for the answer, really appreciate it!

Good to know about the summarizedExperiment = FALSE option. I think for me personally this is better as I do a lot of integration with other data so most things need to be custom mapped anyway but maybe put a disclaimer somewhere as I'm not sure everyone will think of these issues automatically. While I'm all for making the TCGA/GDC-data more accessible, the more user friendly you make this package the more likely that someone will make a mistake somewhere.

A related question for me then would be why this type of remapping doesn't seem to have been performed for the methylation data? As far as I can see the gene annotations are still the Illumina defaults when running GDCprepare on 450K-data. Now I don't know whether your analysis modules use the hg19-coordinates for the methylation probes to map to genes or whether you just use the metadata "Gene_Symbol" column, but this could introduce some inconsistencies in addition to making the extra annotations from the 450K platform (e.g. TSS, UTR, exon) inaccurate as these are mapped to different transcript models.

Also it wasn't 100% clear to me if your update changed the symbols of the symbol|entrez combos as well? I know you mapped using entrez, but did you then update the symbols as well or only add the ENSG-id with it's new coordinates?

I guess this may be another instance where the user needs to tweak stuff to their liking right now but you could add an updated 450K annotation to the GDC-prepare function to get the correct probe context with respect to your new gex mappings.

Code below is my check of the annotations for your 450K pipeline showing that illumina defaults are used.

library(TCGAbiolinks)
library(SummarizedExperiment)

##get some 450K data
meth.query <- GDCquery(project="TCGA-GBM",data.category = "DNA methylation",
platform = "Illumina Human Methylation 450",
legacy = TRUE,
access = "open",
barcode = c( "TCGA-06-5856","TCGA-26-5134" )
)

##download methylation data
GDCdownload(meth.query,chunks.per.download = 5,method = 'api')

dat<-GDCprepare(meth.query,save = FALSE)

##fetch and parse Illumina probe info for Infinium 450 platform from GEO
download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL13nnn/GPL13534/suppl/GPL13534_HumanMethylation450_15017482_v.1.1.csv.gz","GPL13534_HumanMethylation450_15017482_v.1.1.csv.gz")

pAnnot<-readLines(con="GPL13534_HumanMethylation450_15017482_v.1.1.csv.gz")

head(pAnnot,10)
# [1] "Illumina, Inc.,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                                   
# [2] "[Heading],,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                                       
# [3] "Descriptor File Name,BS0010894-AQP_content.bpm,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                   
# [4] "Assay Format,Infinium 2,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                          
# [5] "Date Manufactured,6/11/2008,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                      
# [6] "Loci Count ,485553,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                               
# [7] "[Assay],,,,,,,,,,,,,,,,,,,,,,,,,,"                                                                                                                                                                                                                                                                                                                                                                                                                                         
# [8] "IlmnID,Name,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,Infinium_Design_Type,Next_Base,Color_Channel,Forward_Sequence,Genome_Build,CHR,MAPINFO,SourceSeq,Chromosome_36,Coordinate_36,Strand,Probe_SNPs,Probe_SNPs_10,Random_Loci,Methyl27_Loci,UCSC_RefGene_Name,UCSC_RefGene_Accession,UCSC_RefGene_Group,UCSC_CpG_Islands_Name,Relation_to_UCSC_CpG_Island,Phantom,DMR,Enhancer,HMM_Island,Regulatory_Feature_Name,Regulatory_Feature_Group,DHS"           
# [9] "cg00035864,cg00035864,31729416,AAAACACTAACAATCTTATCCACATAAACCCTTAAATTTATCTCAAATTC,,,II,,,AATCCAAAGATGATGGAGGAGTGCCCGCTCATGATGTGAAGTACCTGCTCAGCTGGAAAC[CG]AATTTGAGATAAATTCAAGGGTCTATGTGGACAAGACTGCTAGTGTCTCTCTCTGGATTG,37,Y,8553009,AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG,Y,8613009,F,,,,,TTTY18,NR_001550,TSS1500,,,,,,,,,"                                                                                                                                   
#[10] "cg00050873,cg00050873,32735311,ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA,31717405,ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG,I,A,Red,TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTGGAGTGGGTGGACCC[CG]GCCAAGACGGCCTGGGCTGACCAGAGACGGGAGGCAGAAAAAGTGGGCAGGTGGTTGCAG,37,Y,9363356,CGGGGTCCACCCACTCCAAAAACCACCACAGTTGTGCGTTGCCTCCTCGC,Y,9973356,R,,,,,TSPY4;FAM197Y2,NM_001164471;NR_001553,Body;TSS1500,chrY:9363680-9363943,N_Shore,,,,Y:9973136-9976273,,,"

pAnnot<-pAnnot[-1:-7]
pHeader<-pAnnot[1]
pAnnot<-pAnnot[-1]

pHeader<-unlist(strsplit(pHeader,","))

length(pHeader)
#[1] 33

##match to biolinks
length( match(rownames(dat),sub(",.+","",pAnnot) ) )
#[1] 485577

pAnnot<-pAnnot[ match(rownames(dat),sub(",.+","",pAnnot) ) ]

identical( rownames(dat),sub(",.+","",pAnnot) )
#[1] TRUE

pAnnot<-lapply(pAnnot,function(x) unlist(strsplit(x,",")) )

unique( unlist( lapply(pAnnot,length) ) )
#[1] 33 32

table( unlist( lapply(pAnnot,length) ) )
#    32     33
#425661  59916

pAnnot<-lapply(pAnnot,function(x) c(x,rep("",33-length(x))) )

unique( unlist( lapply(pAnnot,length) ) )
#[1] 33

pAnnot<-do.call("rbind",pAnnot)
colnames(pAnnot)<-pHeader
rm(pHeader)
rownames(pAnnot)<-pAnnot[,"IlmnID"]

all.equal(rownames(pAnnot),rownames(dat))
#[1] TRUE

##Are coordinates equal
table(pAnnot[,"MAPINFO"]==as.character(start(dat)))
# FALSE   TRUE
#    65 485512 

##the rs-probes have no coordinates
pAnnot[pAnnot[,"MAPINFO"]=="","MAPINFO"]<-"0"

all(as.integer(pAnnot[,"MAPINFO"])==start(dat))
#[1] TRUE

##Are symbols from default illumina file?
table(pAnnot[,"UCSC_RefGene_Name"]==elementMetadata(dat)$Gene_Symbol)
# FALSE   TRUE
#144886 340626

##Duplicate symbol removal
newSyms<-unlist(lapply(pAnnot[,"UCSC_RefGene_Name"],function(x) { paste(unique(unlist(strsplit(x,";"))),collapse=";") }))

all(newSyms==elementMetadata(dat)$Gene_Symbol)
#[1] TRUE
tiagochst commented 7 years ago

I'll add a notice in the documentation about the summarizedExperiment = FALSE and that you are how we are using our own annotation.

You are right, there was no update for DNA methylation metadata. You can get the most updated version here: http://zwdzwd.github.io/InfiniumAnnotation.The hg38 annotation from GDC is using it, but not the hg19.

There was a reason why we did not update the hg19 annotation, I believe because when we did there were almost no changes. But you are right, they should be updated. I'm updating those information on another package, i just did not had time to add the code in this package, but I'll try do it soon.

For the symbol|entrez, only entrez are kept, everything else is updated. Because, entrez gene id were supposed to be fixed while symbols might change over the time (this is one of the reasons we have mapping like ?|entrez)

ghost commented 6 years ago

I am beginner of TCGAbiolinks. I have got following error after running the command


PRAD <- GDCquery(project ="TCGA-PRAD",
                 data.category = "Transcriptome Profiling",
                 data.type = "miRNA Expression Quantification",
                 experimental.strategy = "miRNA-Seq",
                 workflow.type = "BCGSC miRNA Profiling",
                 access = "open",
                 file.type = "mirbase21",
                 sample.type = c("Primary solid Tumor","Recurrent Solid Tumor","Solid Tissue Normal","Recurrent Blood Derived Cancer - Peripheral Blood"))

#download the above query data which is "PRAD"
GDCdownload(PRAD)                 

# prepare data
data <- GDCprepare(PRAD)
dim(data)

#save data 
#data <- GDCprepare(data, dir = "Prostate", save = TRUE, filename = "PRAD_trial.rda")                 
data<-GDCprepare(data, directory = "GDCdata", summarizedExperiment = TRUE, save=TRUE, save.filename = "PRAD_trial.rda")

The error message is ######### Error in if (any(duplicated(query$results[[1]]$cases)) & query$data.type != : argument is of length zero ########## How to solve the problem ?

tiagochst commented 6 years ago

@Heissluft Your code is right until #save data, after that you do GDCprepare(data) but it should be GDCprepare(query) (in your case GDCprepare(PRAD)) If you change the last line to: data<-GDCprepare(PRAD, directory = "GDCdata", summarizedExperiment = TRUE, save=TRUE, save.filename = "PRAD_trial.rda"), it should work.

ghost commented 6 years ago

@tiagochst thanks....you saved lot of my time. Would you like to suggest any reference for following steps after preparing .rda file. For example. I would like to compare all clinical parameters with DE mirnas in controls and prostate cancer patients based on their PSA value or Gleason score ? I used following command query<-TCGAquery_subtype(tumor = "PRAD") which gave me a table of clinical features. How to link the clinical data with .rda file to find out DE miRNAs Any reference which is sort of "step by step guide" would be appreciated. Thanks

ghost commented 6 years ago

@tiagochst It creates the file but when I run

`

>data

` afterwards, I don't get anything such as

#> dim(data)
#[1] 20531     2
se <- GDCprepare(query,summarizedExperiment = TRUE)
#> se
#class: RangedSummarizedExperiment 
#dim: 21022 2 
#metadata(0):
#    assays(1): normalized_count
#rownames(21022): A1BG A2M ... TICAM2 SLC25A5-AS1
#rowData names(3): gene_id entrezgene ensembl_gene_id
#colnames(2): TCGA-06-2569-01A-01R-1849-01 TCGA-06-AABW-11A-31R-A36H-07
#colData names(106): patient barcode ... subtype_Telomere.length.estimate.in.blood.normal..Kb. subtype_Telomere.length.estimate.in.tumor..Kb.

This means it did not create a matrix required for edgeR ?

tiagochst commented 6 years ago

Hello,

TCGAquery_subtype returns a table with data from TCGA marker papers. It is not the same object as the one returned by GDCquery.

screen shot 2018-01-02 at 11 27 56 am

The information you get fromTCGAquery_subtype is automatically added to the ColData matrix in the Summarized Experiment with subtype_ prefix. However, for miRNAs, as we are not able to map all miRNA to genomic positions, TCGAbiolinks does not create a Summarized Experiment. It will be always a data frame.

screen shot 2018-01-02 at 11 25 49 am

What I understood, you are trying to do a DE (Tumor vs Normal) and correlate the DE miRNA with clinical features. The best way to visualize would be to create a heatmap for the DE miRNA adding the clinical information on top of each sample. @claudiacava works with miRNA and has the SpidermiR package that has function to perform DE for miRNA. I believe she is the best person to help you.

Code


query <- GDCquery(project ="TCGA-PRAD",
                 data.category = "Transcriptome Profiling",
                 data.type = "miRNA Expression Quantification",
                 experimental.strategy = "miRNA-Seq",
                 workflow.type = "BCGSC miRNA Profiling",
                 access = "open",
                 file.type = "mirbase21",
                 sample.type = c("Primary solid Tumor", 
                                 "Recurrent Solid Tumor","
                                 Solid Tissue Normal",
                                 "Recurrent Blood Derived Cancer - Peripheral Blood"))

# download the above query data which is "PRAD"
GDCdownload(query)

# save data
data <- GDCprepare(query, 
                   directory = "GDCdata", 
                   summarizedExperiment = TRUE, 
                   save = TRUE, 
                   save.filename = "PRAD_trial.rda")
torongs82 commented 6 years ago

Hi @Heissluft and @tiagochst happy new year 2018. If you are interested to perform Differentially Expression Analysis with microRNA you can follow an example in our vignette : http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#mirna_expression_data:_downstream_analysis_brca .

require(TCGAbiolinks)
CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","miRNA_gene_quantification",".rda")

query.miR <- GDCquery(project = CancerProject, 
                  data.category = "Gene expression",
                  data.type = "miRNA gene quantification",
                  file.type = "hg19.mirna",
                  legacy = TRUE)

samplesDown.miR <- getResults(query.miR,cols=c("cases"))

dataSmTP.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "TP")

dataSmNT.miR <- TCGAquery_SampleTypes(barcode = samplesDown.miR,
                                  typesample = "NT")
dataSmTP_short.miR <- dataSmTP.miR[1:10]
dataSmNT_short.miR <- dataSmNT.miR[1:10]

queryDown.miR <- GDCquery(project = CancerProject, 
                      data.category = "Gene expression",
                      data.type = "miRNA gene quantification",
                      file.type = "hg19.mirna",
                      legacy = TRUE,
                      barcode = c(dataSmTP_short.miR, dataSmNT_short.miR))

GDCdownload(query = queryDown.miR,
            directory = DataDirectory)

dataAssy.miR <- GDCprepare(query = queryDown.miR, 
                           save = TRUE, 
                           save.filename = FileNameData, 
                           summarizedExperiment = TRUE,
                           directory =DataDirectory )
rownames(dataAssy.miR) <- dataAssy.miR$miRNA_ID

# using read_count's data 
read_countData <-  colnames(dataAssy.miR)[grep("count", colnames(dataAssy.miR))]
dataAssy.miR <- dataAssy.miR[,read_countData]
colnames(dataAssy.miR) <- gsub("read_count_","", colnames(dataAssy.miR))

dataFilt <- TCGAanalyze_Filtering(tabDF = dataAssy.miR,
                                  method = "quantile", 
                                  qnt.cut =  0.25)   

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,dataSmNT_short.miR],
                            mat2 = dataFilt[,dataSmTP_short.miR],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            method = "glmLRT")  

Can you explain with more details what do you mean for 'Would you like to suggest any reference for following steps after preparing *.rda file. For example. I would like to compare all clinical parameters with DE mirnas in controls and prostate cancer patients based on their PSA value or Gleason score ?' I can think about how to help you.

Best, Antonio.

ghost commented 6 years ago

@torongs82 Thanks for detailed R code. If you look at the first terminal screen shot shared by @tiagochst , there is "Reviewed_Gleason_Category". I would like to split them Gleanson category as "below 7" and "above 7", then do DE analysis ? There might be other variables in the clinical data as well such as PSA (prostate specific antigen) value to be used for similar analysis. I have one additional question about TCGA biobank which you may know better about. What is the difference between "Harmonized data; Data. category -Transcriptome profiling; Data.type- miRNA expression quantification" AND "Legacy data; Data.category- Gene expression; Data.type-miRNA gene quantification". I am interested in getting raw counts (as we get in miRNA sequencing) to run into edgeR OR DEseq ?
Thanks

ghost commented 6 years ago

hI @torongs82 @tiagochst Thanks for detailed R code. I managed to follow your code and find results. Now, I would like to compare the DE miRNAs using vital information. e.g. only patients with available data (dead or alive) in clinical.info. I would like to exclude those patients without vital information. How can I do it while downloading "miRNA gene quantification" data. E.g. dataSmTP.miR data includes miRNA only from those with dead/ alive information. Thanks

ghost commented 5 years ago

Hi @torongs82 @tiagochst. I also really thank you for detailed R code. When I follow up your code I had one question. I wonder whether I need Preprocessing or Normalization process using TCGAanalyze_Preprocessing or TCGA analyze_Normalization in the analysis of DE-miRNA. Because I tried two processes above, but It didn't work well (I think miRNA data have dataframe type and two functions above do not input dataframe type.) If you give me reply on what I asked, It will be greatly appreciated. I look forward to hearing you. Thank you.