Open arrayprofile opened 6 years ago
Hi @arrayprofile thank your for using TCGAbiolinks. If you consider data.type as “Gene expression quantification" and file.type as "results" you are using Level 3 expression data that uses MapSplice (Wang et al., 2010) to do the alignment and RSEM to perform the quantitation (Li et al., 2010).
The digit after the gene symbol is the GeneID For the https://www.ncbi.nlm.nih.gov/gene/?term=2%5Buid%5D
Hello,
A2M|2 means: Gene Symbol| Entrez Gene ID
Also, I check the files and the raw counts are not integers.
I think because they are the "estimated counts" produced by RSEM. Source: https://www.biostars.org/p/253526/
Thank you! Is it possible to get real raw counts (integers) using TCGAbiolinks?
thanks @tiagochst , yes @arrayprofile you can consider the HT-seq and data harmonized against GRCh38 if you want real raw counts (integers). Please consider our vignette https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html
and the section : HTSeq data: Downstream analysis BRCA
Thanks. The harmonized data has 3 options for workflow.type: HTSeq - Counts, HTSeq - FPKM-UQ and HTSeq - FPKM.
What's the difference between the 3 options?
There is a description in this link: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
Great, thank you!
One question, is RSEM estimated counts already normalized for both seq depth and seq length?
I tried harmonized database:
query <- GDCquery(project = "TCGA-OV", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts") GDCdownload(query, method = "api", files.per.chunk = 10) data <- GDCprepare(query, save=T, save.filename="TCGA.OV.raw.RData", remove.files.prepared=T)
count.raw<-assay(data)
but now I get ~50000 rows, instead of ~20000 rows when I used legacy database, and the first 2 rows are: TCGA-24-0982-01A-01R-1565-13 TCGA-24-1567-01A-01R-1566-13 ENSG00000000003 1965 8746 ENSG00000000005 14 7 ENSG00000000419 4023 5091 TCGA-25-1320-01A-01R-1565-13 TCGA-09-2045-01A-01R-1568-13 ENSG00000000003 6883 2174 ENSG00000000005 2 0 ENSG00000000419 4050 1158 TCGA-WR-A838-01A-12R-A406-31 ENSG00000000003 6136 ENSG00000000005 8 ENSG00000000419 2954
Is that because the raw counts from harmonized database is not on gene level? How do I go from here to get gene-level counts? Thank you!
what does "ENSG00000000003" represent? Can anyone help? Thanks!
I figured out "ENSG00000000003" is ensemble ID, rowData() will extract the gene name. But harmonized dataset have 55388 unique genes, while legacy dataset have 19947 unique genes! Why is there such a huge difference just because of diffreent genome annotation?
Any followup on the difference of number of genes between legacy and harmonized dataset? Similarly for TCGA-BRCA, legacy dataset returns 20531 genes while the harmonized returns 56537 genes for workflow.type = "HTSeq - Counts". Thanks.
Hi, I downloaded ovarian cancer RNA-seq data, but find out not all raw count data are integers, can someone explain why?
Gene "A2M" has non-integer raw counts, why?
Also what does the digit after the gene symbol mean (e.g., "2" in A2M|2)?