Hello everybody. I am running some analyses in these days on different types of tumors (BRCA, COAD) and I noticed that when I download gene expression data I have always the same gene that is missing in the gene IDs, which is UHRF1.
EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.
I used this code:
query.exp <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
experimental.strategy = "RNA-Seq",
sample.type = c("Solid Tissue Normal", "Primary Tumor"))
# Collect the barcodes of only TP samples
dataTP <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"TP")
# Collect the barcodes of only NT samples
dataNT <- TCGAquery_SampleTypes(getResults(query.exp,cols="cases"),"NT")
# Filter:
dataNT.filt <- dataNT[1:50] # keep only 50 normals
dataTP.filt <- dataTP[1:50]# keep only 50 tumors
exp.barcodes <- c(dataNT.filt, dataTP.filt)
query.exp <- GDCquery(project = TCGAprj,
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "results",
barcode = exp.barcodes,
experimental.strategy = "RNA-Seq")
GDCdownload(query.exp)
BRCA.exp <- GDCprepare(query.exp,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "BRCAexp.rda")
# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCAMatrix <- assay(BRCA.exp,"raw_count")
If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (it should be named: "UHRF1|29128" ).
If I perform normalization and then try to search for UHRF1 gene I get this:
dataNorm <- TCGAanalyze_Normalization(tabDF = BRCA.exp,
geneInfo = geneInfo,
method = "gcContent")
> dataNorm["UHRF1", 1]
> Error in dataNorm["UHRF1", 1] : subscript out of bounds
I should expect the corresponding value after normalization but there is no trace of this gene.
I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it.
I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing...
Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.
Hello everybody. I am running some analyses in these days on different types of tumors (BRCA, COAD) and I noticed that when I download gene expression data I have always the same gene that is missing in the gene IDs, which is UHRF1. EDIT 12.03.21: The gene is missing only when using legacy data (hg19), with GCh38 the gene is present.
I used this code:
If I inspect BRCAMatrix searching for UHRF1 gene, there is no trace of it (it should be named: "UHRF1|29128" ).
If I perform normalization and then try to search for UHRF1 gene I get this:
I should expect the corresponding value after normalization but there is no trace of this gene. I also searched for UHRF1 aliases (I.e., ICBP90, RNF106, TDRD22, HUHRF1, hUHRF1, Np95, NP95, HuNp95, HNP95, HNp95...) but I wasn't able to find it. I don't know if this is a bug in the code or a problem with GDC, but I am also thinking if there are other genes that are missing... Can somebody help me? Maybe I am doing something wrong or there is another "name" or ID that I should search for.
Thank you!!