BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
297 stars 112 forks source link

geneInfoHT table used by TCGAanalyze_Normalization is missing information for many genes #492

Closed HelenePetersen closed 2 years ago

HelenePetersen commented 2 years ago

Hi, I have experienced a problem with the function TCGAanalyze_Normalization and geneInfoHT table. I am trying to download, prepare and normalize BRCA RNASeq data from TCGA and therefore I’m running the following:

#download and prepare data
dataClin <- GDCquery_clinic("TCGA-BRCA", type = "clinical")
#Only include female samples
female_barcodes <- dataClin$submitter_id[which(dataClin$gender == "female")]

query.exp <- GDCquery("TCGA-BRCA",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      sample.type = c("Primary solid Tumor","Solid Tissue Normal"),
                      legacy = FALSE,
                      barcode = female_barcodes)

#Un-comment if data is not yet downloaded
#GDCdownload(query.exp)

#data prepared into a summarizedExperiment
exp.data <- GDCprepare(query = query.exp, directory = "GDCdatas")

######################################################################
#Preprocessing
dataPrep_BRCA <- TCGAanalyze_Preprocessing(
  object = exp.data,
  cor.cut = 0.6
)

##################################################################
#Normalization with GC-content
dataNorm.gc_BRCA <- TCGAanalyze_Normalization(
  tabDF = dataPrep_BRCA,
  geneInfo = geneInfoHT,
  method = "gcContent"
)

The output after normalization contains way less genes than I would expect:

#Investigate which genes have been removed
gene.dif.GC <- setdiff(rownames(dataPrep_BRCA),rownames(dataNorm.gc_BRCA))
write.table(gene.dif.GC, file = "Removed.genes.GC.BRCA.txt")
write.table(geneInfoHT, file = "geneInfoHT.txt")
$ wc Removed.genes.GC.BRCA.txt
33229  66457 852826 Removed.genes.GC.BRCA.txt

33228 genes are removed out of 56380 genes in the original dataset. A similar issue is also reported in #164.

By investigating the issue it looks like this is because of missing information about many of the ENSEMBL IDs in the geneInfoHT table, since the table only contains information about 23486 genes:

$ wc geneInfoHT.txt
23487  70460 953160 geneInfoHT.txt

As a solution I have created a new table by calculating GC content normalization and gene length values for all ENSEMBL IDs in biomart using the getGeneLengthAndGCContent() function from EDASeq. This creates a complete drop-in replacement table for the existing geneInfoHT when using TCGAanalyze_Normalization, making it so that no genes are now removed when performing GC content normalization.

We have used the following code to obtain the complete table (~8 hours of running time)

library(EDASeq)
library(biomaRt)
#get ensembl gene IDs for hg38
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
biomart_getID <- getBM(attributes = c("ensembl_gene_id"), mart = ensembl)
#get gene length and GC content for all IDs
getdata <- getGeneLengthAndGCContent(biomart_getID$ensembl_gene_id , org="hsa", mode = c("biomart"))
saveRDS(getdata, file = "getGLGC_download.RDS")
save(getdata, file = "getGLGC_download.rda")
#Save output as data frame with correct header names
geneInfoHT <- data.frame(geneLength = getdata[,1] ,
                         gcContent = getdata[,2])
#Save final table
save(geneInfoHT, file = "geneInfoHT.rda")

If this looks fine to you I can open a PR to replace the current geneInfoHT table with this one. Please let me know, thank you!

ryanhchung commented 2 years ago

Hi Helene,

I applied your suggestion and got an error at the point 'getdata <- getGeneLengthAndGCContent(biomart_getID$ensembl_gene_id , org="hsa", mode = c("biomart"))'

which shows: Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 240556795 out of -1 bytes received

Did you also face the same error? If so, how did you deal with it?

Thanks!

HelenePetersen commented 2 years ago

We have not experienced this issue, looks like it has something to do with the connection to biomart

tiagochst commented 2 years ago

@HelenePetesen Thank you so much for this. I work on updating it.

tiagochst commented 2 years ago

@HelenePetesen By any change coud you send the data ?

tiagochst commented 2 years ago

@HelenePetesen The data should be fixed now.