RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Monoclonal antibodies all lumped together #1211

Closed dkoslicki closed 1 year ago

dkoslicki commented 3 years ago

Check out the KG for this: https://arax.ncats.io/?r=666 SME says they shouldn't all be lumped just because they're antibodies

saramsey commented 3 years ago

Can you paste a screencap in the issue? I don't see any results in that linked JSON

Screen Shot 2021-02-02 at 10 32 27 AM
edeutsch commented 3 years ago

I'm guessing the objection is to this?? image

dkoslicki commented 3 years ago

@saramsey I didn't resultify(), just looked at the KG (as Eric mentioned)

saramsey commented 3 years ago

In KG2.5.1, the node UniProtKB:P01857 looks fine to me:

Screen Shot 2021-02-02 at 2 11 16 PM

I wonder if this is a KG2c issue or synonymization issue?

dkoslicki commented 3 years ago

Yes, this is probably a KG2C issue

amykglen commented 3 years ago

this is a weird one - I think there's another instance of #1074 going on here.

for example, synonymizer.get_canonical_curies("UMLS:C0966225") results in:

{
  "UMLS:C0966225":{
    "preferred_curie":"UniProtKB:P01857",
    "preferred_name":"IGHG1",
    "preferred_type":"protein"
  }
}

but when you look up the equivalent curies for that preferred curie (synonymizer.get_equivalent_nodes("UniProtKB:P01857")), UMLS:C0966225 doesn't appear in them:

{
  "UniProtKB:P01857":{
    "HGNC:5525":"KG2",
    "ORPHANET:122589":"KG2",
    "PR:P01857":"KG2",
    "PR:000008959":"KG2",
    "UniProtKB:P01857":"KG1,KG2",
    "ENSEMBL:ENSG00000211896":"KG2",
    "ENSEMBL:ENSG00000277633":"KG2",
    "NCBIGene:3500":"KG2"
  }
}

this shouldn't be possible, in my understanding of the synonymizer.

the same thing is true for most of the other concepts listed in the all_names @edeutsch posted a screenshot of (they map to a preferred curie of UniProtKB:P01857, but they don't appear in UniProtKB:P01857's equivalent curies, according to the synonymizer). interestingly, the curies that the synonymizer does return as equivalent curies for UniProtKB:P01857 all appear to be good things to merge: Screen Shot 2021-02-03 at 10 11 58 AM

but the synonymizer is lying in a sense, about what its equivalent curies really are for this concept. :)

all of these curies map to the preferred curie UniProtKB:P01857, but don't appear in its equivalent curies:

["UMLS:C0966225", "ATC:R03DX05", "UMLS:C0728747", "ATC:L01XC03", "UMLS:C0393022", "ATC:L01XC02", "UMLS:C0676831", "ATC:L04AC02", "UMLS:C0663182", "ATC:L04AC01", "UMLS:C1122087", "ATC:L04AB04", "UMLS:C0666743", "ATC:L04AB02", "UMLS:C5201282", "CPT:80145", "UMLS:C5201962", "CPT:80230", "UMLS:C0879399", "DRUGBANK:DB00081", "DRUGBANK:DB00051", "DRUGBANK:DB00043", "DRUGBANK:DB00072", "DRUGBANK:DB00074", "DRUGBANK:DB00073", "DRUGBANK:DB00065", "DRUGBANK:DB00111", "LOINC:LP173592-9", "LOINC:LP220253-1", "LOINC:LP35095-6", "LOINC:MTHU047400", "LOINC:MTHU060561", "LOINC:MTHU061217", "UMLS:C0295415", "UMLS:C1314901", "MESH:D000069283", "MESH:D000069285", "MESH:D000069444", "UMLS:C3887777", "UMLS:C4287770", "UMLS:C4308993", "UMLS:C0281549", "UMLS:C0382306", "MESH:D000077561", "UMLS:C0757163", "MESH:D000077552", "UMLS:C4764375", "UMLS:C3886461", "UMLS:C0910794", "UMLS:C4704670", "UMLS:C4048611", "UMLS:C4704669", "MESH:D000068879", "UMLS:C4764377", "MESH:D000068878", "UMLS:C4764398", "UMLS:C4087073", "UMLS:C4688627", "UMLS:C4732952", "UMLS:C4048259", "UMLS:C4048257", "UMLS:C4048258", "UMLS:C4727807", "UMLS:C4727806", "UMLS:C4732958", "UMLS:C5207089", "UMLS:C1522155", "UMLS:C4287794", "UMLS:C3887778", "UMLS:C4745344", "UMLS:C4331946", "UMLS:C4331945", "UMLS:C3467755", "UMLS:C1522477", "NCIT:C65216", "NCIT:C52186", "UMLS:C4721708", "NCIT:C29299", "NCIT:C1569", "NCIT:C1647", "NCIT:C1702", "NCIT:C1789", "NCIT:C2543", "NDDF:001331", "NDDF:007640", "NDDF:009871", "NDDF:007236", "PDQ:CDR0000037818", "PDQ:CDR0000042006", "PDQ:CDR0000038698", "PDQ:CDR0000042265", "PDQ:CDR0000699067", "PDQ:CDR0000459956", "PDQ:CDR0000042613", "PDQ:CDR0000791684", "RXNORM:196102", "RXNORM:263010", "RXNORM:302379", "RXNORM:190353", "RXNORM:224905", "RXNORM:191831", "RXNORM:121191", "RXNORM:327361", "SNOMED:407318006", "SNOMED:407317001", "SNOMED:386977009", "SNOMED:386978004", "SNOMED:386891004", "SNOMED:386919002", "SNOMED:387003001", "SNOMED:406443008", "VANDF:4021140", "VANDF:4021126", "VANDF:4021113", "VANDF:4021104", "VANDF:4021402", "VANDF:4021399", "VANDF:4021370", "VANDF:4021083", "CHEBI:63583", "CHEBI:64357", "CHEMBL.COMPOUND:CHEMBL1201576", "CHEMBL.COMPOUND:CHEMBL1201585", "CHEMBL.COMPOUND:CHEMBL1201605", "CHEMBL.COMPOUND:CHEMBL1201581", "CHEMBL.COMPOUND:CHEMBL1201439", "CHEMBL.COMPOUND:CHEMBL1201580", "CHEMBL.COMPOUND:CHEMBL1201589", "CHEMBL.COMPOUND:CHEMBL1201604", "PathWhiz.ProteinComplex:826"]

amykglen commented 3 years ago

not fixed with KG2.6.7.1c: https://arax.ncats.io/?term=Trastuzumab

rtroper commented 3 years ago

Hi All - I'm wondering how actively you're working on this issue. This is important for Workflow C. We'd like to be able to look at PMIDs on NGD edges as part of the demo. Here's an example of some results from query C.1: https://arax.ncats.io/?r=12173. Interestingly, the PMIDs for natalizumab and ocrelizumab link to legitimate research results. But trastuzumab links to papers on rituximab, adalimumab, and daclizumab. Not sure why some are impacted and others are not.....

edeutsch commented 3 years ago

err, ahh, not very actively, I suppose. Until you reminded us!

@amykglen @saramsey I'm seeing this:

$ grep -i UNIPROTKB:P01857 kg2_equivalencies.tsv
DRUGBANK:DB00074    UniProtKB:P01857
DRUGBANK:DB00051    UniProtKB:P01857
REACT:R-HSA-1478805 UniProtKB:P01857
DRUGBANK:DB00072    UniProtKB:P01857
DRUGBANK:DB00043    UniProtKB:P01857
DRUGBANK:DB00073    UniProtKB:P01857
DRUGBANK:DB00065    UniProtKB:P01857
DRUGBANK:DB00111    UniProtKB:P01857
DRUGBANK:DB00081    UniProtKB:P01857

$ egrep -i 'DRUGBANK:DB00074|DRUGBANK:DB00051|DRUGBANK:DB00072|DRUGBANK:DB00043|DRUGBANK:DB00073|DRUGBANK:DB00065|DRUGBANK:DB00111|DRUGBANK:DB00081|UniProtKB:P01857' kg2_node_info.tsv 
DRUGBANK:DB00111    Daclizumab  Daclizumab  biolink:MolecularEntity
DRUGBANK:DB00081    Tositumomab Tositumomab biolink:MolecularEntity
DRUGBANK:DB00051    Adalimumab  Adalimumab  biolink:MolecularEntity
DRUGBANK:DB00043    Omalizumab  Omalizumab  biolink:MolecularEntity
DRUGBANK:DB00072    Trastuzumab Trastuzumab biolink:MolecularEntity
DRUGBANK:DB00074    Basiliximab Basiliximab biolink:MolecularEntity
DRUGBANK:DB00073    Rituximab   Rituximab   biolink:MolecularEntity
DRUGBANK:DB00065    Infliximab  Infliximab  biolink:MolecularEntity
UniProtKB:P01857    IGHG1   Immunoglobulin heavy constant gamma 1 {ECO:0000303|PubMed:11340299, ECO:0000303|Ref.11} biolink:Protein

In a way, yes, they are all this antibody. But I suppose we don't want to call them equivalent.

@amykglen @saramsey are you able to follow up and resolve this? Would be good to use this example to search for more similar issues, there may be some.

thanks!

rtroper commented 3 years ago

I can imagine in queries for biolink:Protein, perhaps wanting to treat these as equivalent (emphasis on gross structure), but in the case of biolink:Drug or biolink:ChemicalEntity, wanting to maintain a distinction based on binding target (emphasis on activity). For synonymization, could you have different behavior based on the curie prefix (whether it falls logically in one or another biolink category)?

dkoslicki commented 3 years ago

Marking high prio due to it being critical for the Dec demo

amykglen commented 3 years ago

I'll investigate a bit more and write up an issue in the RTX-KG2 repo about the offending same_as edges.

depending on timing/plans for our next KG2 build and how fast we want this fix rolled out, one option is to manually delete the problem edges from the KG2.7.2 kg2_equivalencies.tsv and re-run the 2.7.2 synonymizer and KG2c builds.

saramsey commented 3 years ago

This issue was raised (by Dr. Hadlock from Multiomics Provider) during the Translator mini-hackathon on Thursday Aug. 26. Our team's response was that assuming the root cause is RTX-KG2 issue 131, we are confident that we can get a fix rolled into production before the September Relay (Sept. 27). They described the issue as blocking for the workflow. I will work issue 131 under the assumption it is the root cause and I will post updates in that issue on the RTX-KG2 repo.

saramsey commented 3 years ago

I believe I have fixed this issue in the RTX-KG2 code, see issue 131. We anticipate this fix will be in the next build of RTX-KG2, which would be version KG2.7.3.

amykglen commented 3 years ago

confirmed this was resolved in KG2.7.3 (IGHG1 looks good now: https://arax.ncats.io/?term=UniProtKB:P01857, and the 'mabs' look good too, e.g.: https://arax.ncats.io/?term=basiliximab)

good to close, @dkoslicki?

amykglen commented 1 year ago

I guess we had already fixed this one a while back, but it looks good in the new synonymizer too (#2003):

Cluster for DRUGBANK:DB00111 (MESH:D000077561) has 8 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
CHEMBL.COMPOUND:CHEMBL1201605 ChemicalEntity DACLIZUMAB X X
DRUGBANK:DB00111 ChemicalEntity Daclizumab X X
DrugCentral:4953 ChemicalEntity daclizumab X X
GTOPDB:6880 SmallMolecule daclizumab X
KEGG.DRUG:D03639 Drug Daclizumab (USAN/INN) X
MESH:D000077561 ChemicalEntity Daclizumab X X X
NCIT:C1569 Drug Daclizumab X
RXNORM:190353 Drug daclizumab X

Cluster for DRUGBANK:DB00081 (CHEMBL.COMPOUND:CHEMBL1201604) has 4 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
CHEMBL.COMPOUND:CHEMBL1201604 ChemicalEntity TOSITUMOMAB X X X
DRUGBANK:DB00081 ChemicalEntity Tositumomab X
NCIT:C2543 Drug Tositumomab X
RXNORM:263010 Drug tositumomab X

Cluster for DRUGBANK:DB00051 (CHEMBL.COMPOUND:CHEMBL1201580) has 5 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
CHEMBL.COMPOUND:CHEMBL1201580 ChemicalEntity ADALIMUMAB X X X
DRUGBANK:DB00051 ChemicalEntity Adalimumab X
KEGG.DRUG:D02597 Drug Adalimumab (USAN/INN) X
NCIT:C65216 Drug Adalimumab X
RXNORM:327361 Drug adalimumab X

Cluster for DRUGBANK:DB00043 (CHEMBL.COMPOUND:CHEMBL1201589) has 4 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
CHEMBL.COMPOUND:CHEMBL1201589 ChemicalEntity OMALIZUMAB X X X
DRUGBANK:DB00043 ChemicalEntity Omalizumab X
NCIT:C29299 Drug Omalizumab X
RXNORM:302379 Drug omalizumab X

Cluster for DRUGBANK:DB00072 (UNII:P188ANX8CK) has 14 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
DRUGBANK:DB00072 ChemicalEntity Trastuzumab X X
DrugCentral:4979 ChemicalEntity trastuzumab X X
KEGG.DRUG:D03257 Drug Trastuzumab (USAN/INN) X
MESH:C112748 ChemicalEntity [OBSOLETE] trastuzumab X
NCIT:C1647 Drug Trastuzumab X
PathWhiz.ProteinComplex:826 MolecularEntity Trastuzumab X
RXNORM:224905 Drug trastuzumab X
UMLS:C0728747 Protein trastuzumab X X
UMLS:C4541579 Protein trastuzumab-dkst X X
UMLS:C4741882 Protein trastuzumab-pkrb X X
UMLS:C4758794 Protein trastuzumab-dttb X X
UMLS:C4764376 Protein trastuzumab-qyyp X X
UMLS:C5187552 Protein trastuzumab-anns X X
UNII:P188ANX8CK ChemicalEntity TRASTUZUMAB X X

the SRI seems to break IGHG1 into two clusters, one for Gene and one for Protein:

Cluster for NCBIGene:3500 has 2 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
NCBIGene:3500 Gene IGHG1 X X X
OMIM:147100 Gene IGHG1 X X

Cluster for UniProtKB:P01857 has 8 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
ENSEMBL:ENSP00000374991 Protein X
ENSEMBL:ENSP00000374991.2 Protein X
ENSEMBL:ENSP00000488387 Protein X
ENSEMBL:ENSP00000488387.1 Protein X
PR:P01857 Protein immunoglobulin heavy constant gamma 1 (human) X X
REACT:R-HSA-1478805 Protein IGHG1 [extracellular region] X
UMLS:C1453819 Protein IGHG1 protein, human X X
UniProtKB:P01857 Protein IGHG1_HUMAN Immunoglobulin heavy constant gamma 1 (sprot) X X X
amykglen commented 1 year ago

confirmed fixed on our dev instances:

https://arax.ncats.io/devLM/?term=Daclizumab https://arax.ncats.io/devLM/?term=Tositumomab https://arax.ncats.io/devLM/?term=Omalizumab https://arax.ncats.io/devLM/?term=Trastuzumab https://arax.ncats.io/devLM/?term=IGHG1