BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
298 stars 112 forks source link

raw count not integer #190

Open arrayprofile opened 6 years ago

arrayprofile commented 6 years ago

Hi, I downloaded ovarian cancer RNA-seq data, but find out not all raw count data are integers, can someone explain why?


query <- GDCquery(project = "TCGA-OV",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "results",
                  experimental.strategy = "RNA-Seq",
                  legacy = TRUE)

GDCdownload(query, method = "api", files.per.chunk = 10)
data <- GDCprepare(query)

library(SummarizedExperiment)
count<-assay(data,"raw_count")

 count[1:5,1:5]
            TCGA-24-0975-01A-02R-1565-13 TCGA-36-1575-01A-01R-1566-13
A1BG|1                            141.00                       127.02
A2M|2                           13424.91                      5503.76
NAT1|9                            235.00                       147.00
NAT2|10                             1.00                         1.00
SERPINA3|12                      1550.00                        77.00
            TCGA-13-0799-01A-01R-1564-13 TCGA-24-1556-01A-01R-1566-13
A1BG|1                           1106.68                       298.13
A2M|2                            4694.70                    123026.35
NAT1|9                            296.00                       335.00
NAT2|10                             6.00                         4.00
SERPINA3|12                        28.00                      2535.00
            TCGA-24-1548-01A-01R-1566-13
A1BG|1                            125.90
A2M|2                           17995.42
NAT1|9                            333.00
NAT2|10                            11.00
SERPINA3|12                       461.00

Gene "A2M" has non-integer raw counts, why?

Also what does the digit after the gene symbol mean (e.g., "2" in A2M|2)?

torongs82 commented 6 years ago

Hi @arrayprofile thank your for using TCGAbiolinks. If you consider data.type as “Gene expression quantification" and file.type as "results" you are using Level 3 expression data that uses MapSplice (Wang et al., 2010) to do the alignment and RSEM to perform the quantitation (Li et al., 2010).

The digit after the gene symbol is the GeneID For the https://www.ncbi.nlm.nih.gov/gene/?term=2%5Buid%5D

tiagochst commented 6 years ago

Hello,

A2M|2 means: Gene Symbol| Entrez Gene ID

screen shot 2018-02-02 at 8 44 05 pm

Also, I check the files and the raw counts are not integers.

screen shot 2018-02-02 at 8 42 50 pm

I think because they are the "estimated counts" produced by RSEM. Source: https://www.biostars.org/p/253526/

arrayprofile commented 6 years ago

Thank you! Is it possible to get real raw counts (integers) using TCGAbiolinks?

torongs82 commented 6 years ago

thanks @tiagochst , yes @arrayprofile you can consider the HT-seq and data harmonized against GRCh38 if you want real raw counts (integers). Please consider our vignette https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html

and the section : HTSeq data: Downstream analysis BRCA

arrayprofile commented 6 years ago

Thanks. The harmonized data has 3 options for workflow.type: HTSeq - Counts, HTSeq - FPKM-UQ and HTSeq - FPKM.

What's the difference between the 3 options?

tiagochst commented 6 years ago

There is a description in this link: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

arrayprofile commented 6 years ago

Great, thank you!

One question, is RSEM estimated counts already normalized for both seq depth and seq length?

arrayprofile commented 6 years ago

I tried harmonized database:

query <- GDCquery(project = "TCGA-OV", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts") GDCdownload(query, method = "api", files.per.chunk = 10) data <- GDCprepare(query, save=T, save.filename="TCGA.OV.raw.RData", remove.files.prepared=T)

count.raw<-assay(data)

but now I get ~50000 rows, instead of ~20000 rows when I used legacy database, and the first 2 rows are: TCGA-24-0982-01A-01R-1565-13 TCGA-24-1567-01A-01R-1566-13 ENSG00000000003 1965 8746 ENSG00000000005 14 7 ENSG00000000419 4023 5091 TCGA-25-1320-01A-01R-1565-13 TCGA-09-2045-01A-01R-1568-13 ENSG00000000003 6883 2174 ENSG00000000005 2 0 ENSG00000000419 4050 1158 TCGA-WR-A838-01A-12R-A406-31 ENSG00000000003 6136 ENSG00000000005 8 ENSG00000000419 2954

Is that because the raw counts from harmonized database is not on gene level? How do I go from here to get gene-level counts? Thank you!

arrayprofile commented 6 years ago

what does "ENSG00000000003" represent? Can anyone help? Thanks!

arrayprofile commented 6 years ago

I figured out "ENSG00000000003" is ensemble ID, rowData() will extract the gene name. But harmonized dataset have 55388 unique genes, while legacy dataset have 19947 unique genes! Why is there such a huge difference just because of diffreent genome annotation?

cccfran commented 5 years ago

Any followup on the difference of number of genes between legacy and harmonized dataset? Similarly for TCGA-BRCA, legacy dataset returns 20531 genes while the harmonized returns 56537 genes for workflow.type = "HTSeq - Counts". Thanks.