TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

ENSEMBL genes are not being merged together #153

Open gaurav opened 1 year ago

gaurav commented 1 year ago

As noted in https://github.com/NCATSTranslator/Feedback/issues/340, Babel has multiple distinct Ensembl gene identifiers for the TNF gene:

Here are the RENCI-dev results: https://nodenormalization-sri.renci.org/1.3/get_normalized_nodes?curie=NCBIGene:7124&CHEMBL1825&curie=ENSEMBL%3AENSG00000230108&curie=ENSEMBL%3AENSG00000228849&conflate=true

gaurav commented 1 year ago

I haven't had a chance to dig deeply into this yet, but it looks like the NCBIGeneENSEMBL concord is built from babel_downloads/NCBIGene/gene2ensembl.gz, which only maps NCBIGene:7124 to ENSG00000232810:

gene2ensembl.gz:9606    7124    ENSG00000232810 NM_000594.4 ENST00000449264.3   NP_000585.2 ENSP00000398698.2

However, babel_downloads/ENSEMBL/hsapiens_gene_ensembl/BioMart.tsv has many more mappings, including to the identifiers specified above, which appear to all link to HGNC:11892:

havana  TNF ENSP00000389265 DIF CHR_HSCHR6_MHC_APD_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228978
havana  TNF ENSP00000389265 TNF-alpha   CHR_HSCHR6_MHC_APD_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228978
havana  TNF ENSP00000389265 TNFA    CHR_HSCHR6_MHC_APD_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228978
havana  TNF ENSP00000389265 TNFSF2  CHR_HSCHR6_MHC_APD_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228978
ensembl_havana  TNF ENSP00000365290 DIF CHR_HSCHR6_MHC_COX_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000204490
ensembl_havana  TNF ENSP00000365290 TNF-alpha   CHR_HSCHR6_MHC_COX_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000204490
ensembl_havana  TNF ENSP00000365290 TNFA    CHR_HSCHR6_MHC_COX_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000204490
ensembl_havana  TNF ENSP00000365290 TNFSF2  CHR_HSCHR6_MHC_COX_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000204490
ensembl_havana  TNF ENSP00000389490 DIF CHR_HSCHR6_MHC_MCF_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000223952
ensembl_havana  TNF ENSP00000389490 TNF-alpha   CHR_HSCHR6_MHC_MCF_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000223952
ensembl_havana  TNF ENSP00000389490 TNFA    CHR_HSCHR6_MHC_MCF_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000223952
ensembl_havana  TNF ENSP00000389490 TNFSF2  CHR_HSCHR6_MHC_MCF_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000223952
ensembl_havana  TNF ENSP00000410668 DIF CHR_HSCHR6_MHC_DBB_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228849
ensembl_havana  TNF ENSP00000410668 TNF-alpha   CHR_HSCHR6_MHC_DBB_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228849
ensembl_havana  TNF ENSP00000410668 TNFA    CHR_HSCHR6_MHC_DBB_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228849
ensembl_havana  TNF ENSP00000410668 TNFSF2  CHR_HSCHR6_MHC_DBB_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228849
ensembl_havana  TNF ENSP00000392858 DIF CHR_HSCHR6_MHC_MANN_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228321
ensembl_havana  TNF ENSP00000392858 TNF-alpha   CHR_HSCHR6_MHC_MANN_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228321
ensembl_havana  TNF ENSP00000392858 TNFA    CHR_HSCHR6_MHC_MANN_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228321
ensembl_havana  TNF ENSP00000392858 TNFSF2  CHR_HSCHR6_MHC_MANN_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000228321
ensembl_havana  TNF ENSP00000389492 DIF CHR_HSCHR6_MHC_SSTO_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000230108
ensembl_havana  TNF ENSP00000389492 TNF-alpha   CHR_HSCHR6_MHC_SSTO_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000230108
ensembl_havana  TNF ENSP00000389492 TNFA    CHR_HSCHR6_MHC_SSTO_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000230108
ensembl_havana  TNF ENSP00000389492 TNFSF2  CHR_HSCHR6_MHC_SSTO_CTG1    protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000230108
ensembl_havana  TNF ENSP00000372988 DIF CHR_HSCHR6_MHC_QBL_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000206439
ensembl_havana  TNF ENSP00000372988 TNF-alpha   CHR_HSCHR6_MHC_QBL_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000206439
ensembl_havana  TNF ENSP00000372988 TNFA    CHR_HSCHR6_MHC_QBL_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000206439
ensembl_havana  TNF ENSP00000372988 TNFSF2  CHR_HSCHR6_MHC_QBL_CTG1 protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000206439
ensembl_havana  TNF ENSP00000398698 DIF 6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000398698 TNF-alpha   6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000398698 TNFA    6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000398698 TNFSF2  6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000514308 DIF 6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000514308 TNF-alpha   6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000514308 TNFA    6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810
ensembl_havana  TNF ENSP00000514308 TNFSF2  6   protein_coding  HGNC Symbol 7124.0  tumor necrosis factor [Source:HGNC Symbol;Acc:HGNC:11892]   ENSG00000232810

So those might be the missing mappings we need to include in the NCBIGeneENSEMBL concord.

gaurav commented 1 year ago

It looks like there is some code in Babel for generating an ENSEMBL concord:

https://github.com/TranslatorSRI/Babel/blob/f3748b881082f7f573409e8e75822cd02b6becb5/src/snakefiles/gene.snakefile#L69-L75 https://github.com/TranslatorSRI/Babel/blob/f3748b881082f7f573409e8e75822cd02b6becb5/src/createcompendia/gene.py#L25-L74

However, this code is not currently being run, because gene_concords doesn't include "ENSEMBL":

https://github.com/TranslatorSRI/Babel/blob/f3748b881082f7f573409e8e75822cd02b6becb5/config.json#L19

@cbizon Do you know if this was deactivated deliberately? I'm currently trying to re-run Babel after adding "ENSEMBL" to the list of concords to see if the concord can be generated correctly and if it includes the mappings we're looking for.

cbizon commented 1 year ago

I seem to recall that the ensembl mappings led to some very unpleasant merges. We'll want to be careful with them.