BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
295 stars 112 forks source link

*URGENT* Gene ID totally missing in downloaded expression data.. A bug or a problem with GDC data? #454

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello everybody. I am running some analyses in these days on different types of tumors (BRCA, COAD) and I noticed that when I download gene expression data I have always the same gene that is missing in the gene IDs, which is UHRF1. EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.

I used this code:

query.exp <- GDCquery(project = "TCGA-BRCA",
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      experimental.strategy = "RNA-Seq",
                      sample.type = c("Solid Tissue Normal", "Primary Tumor"))

# Collect the barcodes of only TP samples
dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP")

# Collect the barcodes of only NT samples
dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")

# Filter:
dataNT.filt <- dataNT[1:50] # keep only 50 normals
dataTP.filt <- dataTP[1:50]# keep only 50 tumors

exp.barcodes <- c(dataNT.filt, dataTP.filt)

query.exp <- GDCquery(project = TCGAprj,
                      legacy = TRUE,
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type = "results",
                      barcode = exp.barcodes,
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

BRCA.exp <- GDCprepare(query.exp, 
                       save = TRUE, 
                       summarizedExperiment = TRUE, 
                       save.filename = "BRCAexp.rda")

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCAMatrix <- assay(BRCA.exp,"raw_count")

If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (it should be named: "UHRF1|29128" ).

If I perform normalization and then try to search for UHRF1 gene I get this:

dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp,
                                      geneInfo = geneInfo,
                                      method = "gcContent") 

> dataNorm["UHRF1", 1]
> Error in dataNorm["UHRF1", 1] : subscript out of bounds

I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.

Thank you!!