BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

[HELP] How to match hg38 ENSG code with gene name after DEA ? #400

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi everybody, I found myself stuck with an issue regarding gene names. I have downloaded data mapped on hg38 for which, at the end of the DEA, I obtain only ENSG codes instead of gene names. I would like to find a way to match the two in my data frame with the resulting DEGs to understand them better, and also to be able to plot them in an understandable way. Can somebody help me? I write here some code just as example:

# query for NT & TP PRAD expr:
PRAD.query <- GDCquery(project = "TCGA-PRAD",
                       data.category = "Transcriptome Profiling",
                       data.type = "Gene Expression Quantification",
                       workflow.type = "HTSeq - Counts",
                       barcode = PRAD.barcodes,
                       sample.type = c("Solid Tissue Normal", "Primary Tumor"))
# Download:
GDCdownload(PRAD.query)
# Prepare data:
PRAD.final <- GDCprepare(PRAD.query, 
                         save = TRUE, 
                         summarizedExperiment = TRUE, 
                         save.filename = "PRADfinal.rda")
# normalization of genes
dataNorm <- TCGAanalyze_Normalization(tabDF = PRAD.final, 
                                      geneInfo = geneInfoHT, 
                                      method = "gcContent")
# quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25)

# Diff.expr.analysis (DEA) NT vs ERG+
DEG_CTRLvsERG <- TCGAanalyze_DEA(mat1 = dataFilt[,barcodes.NT],
                                 mat2 = dataFilt[,barcodes.TP.ERG],
                                 Cond1type = "Normal",
                                 Cond2type = "TP_ERG+",
                                 fdr.cut = 0.01 ,
                                 logFC.cut = 1,
                                 method = "glmLRT")

After this point, I obtain a data frame with logFC, FDR etc as columns and ensamble id as rows, like ENSG00000000005, ENSG00000002726, ENSG00000004468... Is there a way in the TCGAbiolinks package to automate the process of matching with gene name?

Thank you!!

theroseven commented 4 years ago

The following code may help you get gene name: gene.cut <- which(row.names(PRAD.final) %in% row.names(DEG_CTRLvsERG)) DEG_CTRLvsERG$Gene_name <- PRAD.final@rowRanges$external_gene_name[gene.cut]

ghost commented 4 years ago

@theroseven It's working! Thank you!