How to use TCGAbiolinks to download raw RSEM gene counts for specific type of cancer (COAD RNA-Seq dataset)

Jasonmbg commented 7 years ago

Dear Antonio,

after our discussion, i also start a post here in order to get full suggestions and feedback !! Briefly, based on my current specific project regarding colorectal cancer, i would like to download either the COAD RNA-Seq dataset/or the READ dataset, in order to perform some differential gene expression (if also any normal samples are available) and also some comparisons if any molecular subtypes are also present. So, my specific questions are the following:

1) in order to download "raw RSEM gene counts", for further analysis:

query.exp <- GDCquery(project= "TCGA-COAD", 

data.category = "Gene expression", 
data.type = "Gene expression quantification",
 platform = "Illumina HiSeq", experimental.strategy = "RNA-Seq"
 legacy = TRUE)

GDCdownload(query, method = "api")
expdat <- GDCprepare(query = query.exp,
save = TRUE, save.filename = "exp.rda")

however, in the above section, as the dataset has the label "legacy" (also present in the tutorial as info), i should also include the argument file.type= "results" ?

or my above implementation is wrong ?

2) Secondly, the argument legacy also changes the reference genome alignment results ? for instance, if i do not use for the READ dataset the argument lecagy=TRUE ? and essentially it would be more approapriate to use the approach alignment against hg38 ?

3) Moreover,, if i would like to see the sample_type for the downloaded data-for istance if there are any normal samples, except tumors-, i should use:

clin <- GDCquery_clinic("TCGA-COAD", type = "clinical", 
save.csv = TRUE) ??

4) Finally, it is better to use/install the latest version of github, in order to implement the latest capabilities ?

Please excuse me for any naive questions (i have also read the paper and relative vignette), but i would like to be certain for any approach in order to use appropriately the data !!

tiagochst commented 7 years ago

Hi,

1) Yes, you will need to specify file.type= "results". The data you aligned against hg19 which is in the legacy archive and this is the only way to filter the files for this case. Just to complement, gene expression quantification for IlluminaHiSeq_RNASeqV2 platform we has the following files source:

.rsem.genes.results
.rsem.genes.normalized_results

# Gene expression aligned against hg19.
query.exp.hg19 <- GDCquery(project = "TCGA-COAD",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "results",
                           experimental.strategy = "RNA-Seq",
                           legacy = TRUE)
GDCdownload(query, method = "api")
expdat <- GDCprepare(query = query.exp, save = TRUE, save.filename = "exp.rda")
sample.info <- SummarizedExperiment::colData(expdat)

2) Yes, it will change. If you set legacy to FALSE it will download from the harmonized database. The description is below:

There are two available sources to download GDC data using TCGAbiolinks:

GDC Legacy Archive: provides access to an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh37 (hg19) and GRCh36 (hg18).
GDC harmonized database: data available was harmonized against GRCh38 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data.

3) No. The clinical data only accounts for the patient information and not its samples, for samples information please use either the biospecimen biospecimen <- GDCquery_clinic("TCGA-COAD", type = "biospecimen") ). But the best way it to access the samples matrix in the object prepare with SummarizedExperiment::colData(expdat) to see the samples you downloaded. The column definition has the information you want.

4) Yes, GitHub has the most updated version of the code. However, it might has some bugs. We are trying to keep an updated version in Bioconductor devel version, but it takes some days to be available.

Sorry for the delay.

Jasonmbg commented 7 years ago

Dear Tiago,

thank you for your comprehensive answer and valuable comments !! Just some thoughts on this matter, to be certain that i have understood you correctly:

1) If im correct, the legacy option refers to other alignment and preprocessing options, as there have been to the older TCGA repository? That is as you have mentioned hg19 (and or hg18 regarding the specific dataset queried)?

2) Also regarding your answer to the first question: leaving file.type= "results", refers to the rsem gene counts, not the normalized ones, right ?

3) On the other hand, if i would like to use the harmonized version-with the more recent genomes, etc-, i should use the following command for the same dataset as above:

query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
                              data.category = "Trascriptome Profiling",
                              data.type = "Gene expression quantification",
                              workflow.type="HTSeq-Counts)

? without specifying anything else ? and then the same commands for download you used above?

4) Regarding the samples information, such as which type of sample (i.e normal or cancer), the function SummarizedExperiment::colData(expdat) you proposed is something alternative than the GDCquery_clinic ? Moreover, if i would like to access information about the patients (i.e. sex, age etc), i then have necessary to use the function i included in my post ? and then somehow merge it with the sample information, for a complete-like phenodata table ?

5) Finally, to access possible molecular subtype information, i should use something like the following:

COAD_path_subtypes <- TCGAquery_subtype(tumor = "coad")

? and then merge the information with the other sample information ?

Thank you for your consideration and valuable help on this matter !!

tiagochst commented 7 years ago

1) Yes, setting Legacy to TRUE will access the same data that was available in the TCGA repository. The data was aligned mainly against Genome Reference Consortium GRCh37 (hg19), but there were also some aligned against hg18.

2) Right.

file.type= "results" will select the .rsem.genes.results files
file.type= "normalized_results" will select the .rsem.genes.normalized_results files

3) Right. For RNA-Seq gene expression, the files accessible in the GDC Data Portal are described in GDC documentation. The code for each one is below.

Raw Read Counts: The number of reads aligned to each protein-coding gene, calculated by HT-Seq.

query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
                              data.category = "Trascriptome Profiling",
                              data.type = "Gene expression quantification",
                              workflow.type="HTSeq-Counts")

FPKM: A normalized expression value that takes into account each protein-coding gene length and the number of reads mappable to all protein-coding genes.

query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
                              data.category = "Trascriptome Profiling",
                              data.type = "Gene expression quantification",
                              workflow.type="HTSeq - FPKM-UQ")

FPKM-UQ: A normalized raw read count in which gene expression values, in FPKM, are divided by the 75th percentile value.

query.exp.hg38 <- GDCquery(project = "TCGA-COAD",
                              data.category = "Trascriptome Profiling",
                              data.type = "Gene expression quantification",
                              workflow.type="HTSeq - FPKM")

4) SummarizedExperiment::colData(expdat) is the complete phenodata table. We merged clinical, sample and subtype information on that table. GDCquery_clinic is inside this table (it is a subset), so you don't actually need to use GDCquery_clinic.

5) Subtype information is in the sample matrix SummarizedExperiment::colData(expdat) will also give that matrix. Just to summarize:

SummarizedExperiment::colData(expdat) = GDCquery_clinic+ TCGAquery_subtype + Sample information

woodhaha commented 7 years ago

Thanks @tiagochst for your works on develop the pkg TCGAbiolinks and TCGAbiolinksGUI.

venkan commented 5 years ago

Hi @tiagochst

I'm interested in downloading .rsem.genes.normalized_results files for TCGA-BRCA data using TCGAbiolinks from Harmonized database.

I tried this way but there is an error. Could you please check the following and tell me what I have to change for downloading.

query <- GDCquery(project = "TCGA-BRCA",
                  experimental.strategy = "RNA-Seq",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  file.type  = "normalized_results",
                  platform = "Illumina HiSeq",
                  legacy = FALSE)

Following is the error I got.

Error in GDCquery(project = "TCGA-BRCA", experimental.strategy = "RNA-Seq",  : 
  Please set a valid platform argument from the list below:
  =>

I want RSEM normalized counts for both protein coding genes and lncRNAs also.

torongs82 commented 5 years ago

Hi @venkan thank you for using our tool TCGAbiolinks and for writing here.

For harmonized database you can retrieve the htseq counts:

DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")

query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols = c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

queryDown <- GDCquery(project = CancerProject, 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))

GDCdownload(query = queryDown)

dataPrep1 <- GDCprepare(query = queryDown, 
                        save = TRUE, 
                        save.filename = "TCGA_BRCA_HTSeq_Countds.rda")

is this fine for you?

venkan commented 5 years ago

Hi @torongs82 thanq for the reply.

I already downloaded HTSeq raw counts from the harmonized database using TCGAbiolinks. But for some analysis I'm interested in RSEM normalized counts.

Is there a way to download rsem normalized counts for both pc and lncRNA genes from harmonized database?

torongs82 commented 5 years ago

hi @venkan for the harmonized GDC databases (mRNA) you can see here the different datafiles you can retrieve using TCGAbiolinks:

https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

is the FPKM, or FPKM-UQ fine for you?

you can retrieve the RSEM normalized counts by for the data aligned against hg19 (legacy = TRUE)

venkan commented 5 years ago

Hi @torongs82 I actually need RSEM normalized counts for the data aligned against hg38. Anyways thanq for the reply.

torongs82 commented 5 years ago

Hi @venkan I see your point according to my knowledge we have only RSEM normalized counts for hg19 in GDC, but I can be wrong, @tiagochst please can you double check?

tiagochst commented 5 years ago

You are right @torongs82. We have RSEM only for hg19. For hg38 only FPKM.

venkan commented 5 years ago

ok no problem. Thank you @tiagochst @torongs82

husain223 commented 4 years ago

I am working on TCGA data set of LUAD. i have downloaded the data mi RNA and clinical data in CSV format can anyone explain me how can i preprocess and normalize data for further work using TCGAbiolinks package

w2felix commented 4 years ago

@tiagochst do you know by chance, whether one could also retrieve the FPKM values for the hg19 aligned samples?

khalvatib commented 3 years ago

hi,i want download rna seq lung cancer expression data from gdc . can anyone explain me how can i preprocess and normalize data for further work using TCGAbiolinks package thankx .

khalvatib commented 3 years ago

source code out should be represented gene expression matrix row: sample column: gene

BioinformaticsFMRP / TCGAbiolinks

How to use TCGAbiolinks to download raw RSEM gene counts for specific type of cancer (COAD RNA-Seq dataset) #59