YuLab-SMU / clusterProfiler

:bar_chart: A universal enrichment tool for interpreting omics data
https://yulab-smu.top/biomedical-knowledge-mining-book/
1.02k stars 256 forks source link

clusterProfiler bitr()与bioMart getBM()转换得到的gene list不同 #531

Open Youpu-Chen opened 1 year ago

Youpu-Chen commented 1 year ago

余老师您好,我在使用clusterProfiler和bioMart进行Entrez gene ID转Ensembl ID时发现一个问题,即clusterProfiler bitr()与bioMart getBM()转换得到的gene list不同

workflow如下,

数据来源:GAD database

以下脚本为提取GAD数据中的Entrez ID,

# GAD数据处理
gene.attri.edge <- read.delim('GAD High Level Gene-Disease Associations/gene_attribute_edges.txt', sep = '\t', header = T)
names(gene.attri.edge) <- c('GeneSym', 'Unknown', 'GeneID', 'Disease', 'DiseaseClass', 'Unknown', 'weight')
gene.attri.edge <- gene.attri.edge[-1,]
gene.attri.edge$weight = round(as.numeric(gene.attri.edge$weight), 0)
unique(gene.attri.edge$Disease)

table(gene.attri.edge$weight == 1)
length(unique(gene.attri.edge$GeneID))  # 8016 for high-level

gene.id <- data.frame('Gene.id' = unique(gene.attri.edge$GeneID))
write.table(gene.id, 'geneid.highlevel.txt', quote = F, sep = '\t', row.names = F, col.names = T)

提取与GWAS catalog中Ensembl ID进行匹配,

rm(list = ls())
options(stringsAsFactors = F)

# load dataset
gwas.catalog.df <- read.delim('gwas_catalog_v1.0.2-associations_e108_r2023-01-14.tsv', sep = '\t', header = T, comment.char = "#", quote = "")
gad.gene.list <- read.delim('../GAD/geneid.highlevel.txt', sep = '\t', header = T)

# load pakages
# install.packages('tidyverse')
library(tidyverse)
confirm.ensembl.geneid <- gwas.catalog.df$SNP_GENE_IDS
confirm.ensembl.geneid <- confirm.ensembl.geneid[ confirm.ensembl.geneid != '' ]

confirm.ensembl.geneid.df <- data.frame(
  'id' = confirm.ensembl.geneid
)

seperate.confirm.ensembl.geneid.df <- confirm.ensembl.geneid.df %>% separate_rows(id, sep = ',')
unique.seperate.confirm.ensembl.geneid.df <- unique(seperate.confirm.ensembl.geneid.df)

### Entrez to Ensembl
# convert.1
entrez2ensembl.1 = bitr(gad.gene.list$Gene.id, fromType="ENTREZID", toType="ENSEMBL", OrgDb="org.Hs.eg.db")

# convert.2
entrez2ensembl.2 <- getBM(attributes = c("entrezgene_id", "ensembl_gene_id"),
                          filters = "entrezgene_id",
                          values = gad.gene.list$Gene.id,
                          mart = ensembl.geneid)
d.entrez2ensembl.2 <- entrez2ensembl.1[duplicated(entrez2ensembl.1$entrezgene_id), ]
### conbine
names(entrez2ensembl.1) <- c("entrezgene_id", "ensembl_gene_id")
# names(d.entrez2ensembl.2)
c.df <- merge(entrez2ensembl.1, d.entrez2ensembl.2, by='entrezgene_id', all=F)
names(c.df)
table(c.df$ensembl_gene_id.x == c.df$ensembl_gene_id.y)

最终交集为1592(GAD数据集为8000个gene)

huerqiang commented 1 year ago

clusterProfiler进行人类Entrez id和Ensembl ID转换所用的数据来源于org.Hs.eg.db包,我试了一下8016个gene有7943个可以成功转换,结果可靠。