Closed dkoslicki closed 1 year ago
Can you paste a screencap in the issue? I don't see any results in that linked JSON
I'm guessing the objection is to this??
@saramsey I didn't resultify(), just looked at the KG (as Eric mentioned)
In KG2.5.1, the node UniProtKB:P01857
looks fine to me:
I wonder if this is a KG2c issue or synonymization issue?
Yes, this is probably a KG2C issue
this is a weird one - I think there's another instance of #1074 going on here.
for example, synonymizer.get_canonical_curies("UMLS:C0966225")
results in:
{
"UMLS:C0966225":{
"preferred_curie":"UniProtKB:P01857",
"preferred_name":"IGHG1",
"preferred_type":"protein"
}
}
but when you look up the equivalent curies for that preferred curie (synonymizer.get_equivalent_nodes("UniProtKB:P01857")
), UMLS:C0966225 doesn't appear in them:
{
"UniProtKB:P01857":{
"HGNC:5525":"KG2",
"ORPHANET:122589":"KG2",
"PR:P01857":"KG2",
"PR:000008959":"KG2",
"UniProtKB:P01857":"KG1,KG2",
"ENSEMBL:ENSG00000211896":"KG2",
"ENSEMBL:ENSG00000277633":"KG2",
"NCBIGene:3500":"KG2"
}
}
this shouldn't be possible, in my understanding of the synonymizer.
the same thing is true for most of the other concepts listed in the all_names
@edeutsch posted a screenshot of (they map to a preferred curie of UniProtKB:P01857, but they don't appear in UniProtKB:P01857's equivalent curies, according to the synonymizer). interestingly, the curies that the synonymizer does return as equivalent curies for UniProtKB:P01857 all appear to be good things to merge:
but the synonymizer is lying in a sense, about what its equivalent curies really are for this concept. :)
all of these curies map to the preferred curie UniProtKB:P01857, but don't appear in its equivalent curies:
["UMLS:C0966225", "ATC:R03DX05", "UMLS:C0728747", "ATC:L01XC03", "UMLS:C0393022", "ATC:L01XC02", "UMLS:C0676831", "ATC:L04AC02", "UMLS:C0663182", "ATC:L04AC01", "UMLS:C1122087", "ATC:L04AB04", "UMLS:C0666743", "ATC:L04AB02", "UMLS:C5201282", "CPT:80145", "UMLS:C5201962", "CPT:80230", "UMLS:C0879399", "DRUGBANK:DB00081", "DRUGBANK:DB00051", "DRUGBANK:DB00043", "DRUGBANK:DB00072", "DRUGBANK:DB00074", "DRUGBANK:DB00073", "DRUGBANK:DB00065", "DRUGBANK:DB00111", "LOINC:LP173592-9", "LOINC:LP220253-1", "LOINC:LP35095-6", "LOINC:MTHU047400", "LOINC:MTHU060561", "LOINC:MTHU061217", "UMLS:C0295415", "UMLS:C1314901", "MESH:D000069283", "MESH:D000069285", "MESH:D000069444", "UMLS:C3887777", "UMLS:C4287770", "UMLS:C4308993", "UMLS:C0281549", "UMLS:C0382306", "MESH:D000077561", "UMLS:C0757163", "MESH:D000077552", "UMLS:C4764375", "UMLS:C3886461", "UMLS:C0910794", "UMLS:C4704670", "UMLS:C4048611", "UMLS:C4704669", "MESH:D000068879", "UMLS:C4764377", "MESH:D000068878", "UMLS:C4764398", "UMLS:C4087073", "UMLS:C4688627", "UMLS:C4732952", "UMLS:C4048259", "UMLS:C4048257", "UMLS:C4048258", "UMLS:C4727807", "UMLS:C4727806", "UMLS:C4732958", "UMLS:C5207089", "UMLS:C1522155", "UMLS:C4287794", "UMLS:C3887778", "UMLS:C4745344", "UMLS:C4331946", "UMLS:C4331945", "UMLS:C3467755", "UMLS:C1522477", "NCIT:C65216", "NCIT:C52186", "UMLS:C4721708", "NCIT:C29299", "NCIT:C1569", "NCIT:C1647", "NCIT:C1702", "NCIT:C1789", "NCIT:C2543", "NDDF:001331", "NDDF:007640", "NDDF:009871", "NDDF:007236", "PDQ:CDR0000037818", "PDQ:CDR0000042006", "PDQ:CDR0000038698", "PDQ:CDR0000042265", "PDQ:CDR0000699067", "PDQ:CDR0000459956", "PDQ:CDR0000042613", "PDQ:CDR0000791684", "RXNORM:196102", "RXNORM:263010", "RXNORM:302379", "RXNORM:190353", "RXNORM:224905", "RXNORM:191831", "RXNORM:121191", "RXNORM:327361", "SNOMED:407318006", "SNOMED:407317001", "SNOMED:386977009", "SNOMED:386978004", "SNOMED:386891004", "SNOMED:386919002", "SNOMED:387003001", "SNOMED:406443008", "VANDF:4021140", "VANDF:4021126", "VANDF:4021113", "VANDF:4021104", "VANDF:4021402", "VANDF:4021399", "VANDF:4021370", "VANDF:4021083", "CHEBI:63583", "CHEBI:64357", "CHEMBL.COMPOUND:CHEMBL1201576", "CHEMBL.COMPOUND:CHEMBL1201585", "CHEMBL.COMPOUND:CHEMBL1201605", "CHEMBL.COMPOUND:CHEMBL1201581", "CHEMBL.COMPOUND:CHEMBL1201439", "CHEMBL.COMPOUND:CHEMBL1201580", "CHEMBL.COMPOUND:CHEMBL1201589", "CHEMBL.COMPOUND:CHEMBL1201604", "PathWhiz.ProteinComplex:826"]
not fixed with KG2.6.7.1c: https://arax.ncats.io/?term=Trastuzumab
Hi All - I'm wondering how actively you're working on this issue. This is important for Workflow C. We'd like to be able to look at PMIDs on NGD edges as part of the demo. Here's an example of some results from query C.1: https://arax.ncats.io/?r=12173. Interestingly, the PMIDs for natalizumab and ocrelizumab link to legitimate research results. But trastuzumab links to papers on rituximab, adalimumab, and daclizumab. Not sure why some are impacted and others are not.....
err, ahh, not very actively, I suppose. Until you reminded us!
@amykglen @saramsey I'm seeing this:
$ grep -i UNIPROTKB:P01857 kg2_equivalencies.tsv
DRUGBANK:DB00074 UniProtKB:P01857
DRUGBANK:DB00051 UniProtKB:P01857
REACT:R-HSA-1478805 UniProtKB:P01857
DRUGBANK:DB00072 UniProtKB:P01857
DRUGBANK:DB00043 UniProtKB:P01857
DRUGBANK:DB00073 UniProtKB:P01857
DRUGBANK:DB00065 UniProtKB:P01857
DRUGBANK:DB00111 UniProtKB:P01857
DRUGBANK:DB00081 UniProtKB:P01857
$ egrep -i 'DRUGBANK:DB00074|DRUGBANK:DB00051|DRUGBANK:DB00072|DRUGBANK:DB00043|DRUGBANK:DB00073|DRUGBANK:DB00065|DRUGBANK:DB00111|DRUGBANK:DB00081|UniProtKB:P01857' kg2_node_info.tsv
DRUGBANK:DB00111 Daclizumab Daclizumab biolink:MolecularEntity
DRUGBANK:DB00081 Tositumomab Tositumomab biolink:MolecularEntity
DRUGBANK:DB00051 Adalimumab Adalimumab biolink:MolecularEntity
DRUGBANK:DB00043 Omalizumab Omalizumab biolink:MolecularEntity
DRUGBANK:DB00072 Trastuzumab Trastuzumab biolink:MolecularEntity
DRUGBANK:DB00074 Basiliximab Basiliximab biolink:MolecularEntity
DRUGBANK:DB00073 Rituximab Rituximab biolink:MolecularEntity
DRUGBANK:DB00065 Infliximab Infliximab biolink:MolecularEntity
UniProtKB:P01857 IGHG1 Immunoglobulin heavy constant gamma 1 {ECO:0000303|PubMed:11340299, ECO:0000303|Ref.11} biolink:Protein
In a way, yes, they are all this antibody. But I suppose we don't want to call them equivalent.
@amykglen @saramsey are you able to follow up and resolve this? Would be good to use this example to search for more similar issues, there may be some.
thanks!
I can imagine in queries for biolink:Protein, perhaps wanting to treat these as equivalent (emphasis on gross structure), but in the case of biolink:Drug or biolink:ChemicalEntity, wanting to maintain a distinction based on binding target (emphasis on activity). For synonymization, could you have different behavior based on the curie prefix (whether it falls logically in one or another biolink category)?
Marking high prio due to it being critical for the Dec demo
I'll investigate a bit more and write up an issue in the RTX-KG2 repo about the offending same_as
edges.
depending on timing/plans for our next KG2 build and how fast we want this fix rolled out, one option is to manually delete the problem edges from the KG2.7.2 kg2_equivalencies.tsv
and re-run the 2.7.2 synonymizer and KG2c builds.
This issue was raised (by Dr. Hadlock from Multiomics Provider) during the Translator mini-hackathon on Thursday Aug. 26. Our team's response was that assuming the root cause is RTX-KG2 issue 131, we are confident that we can get a fix rolled into production before the September Relay (Sept. 27). They described the issue as blocking for the workflow. I will work issue 131 under the assumption it is the root cause and I will post updates in that issue on the RTX-KG2 repo.
I believe I have fixed this issue in the RTX-KG2 code, see issue 131. We anticipate this fix will be in the next build of RTX-KG2, which would be version KG2.7.3.
confirmed this was resolved in KG2.7.3 (IGHG1 looks good now: https://arax.ncats.io/?term=UniProtKB:P01857, and the 'mabs' look good too, e.g.: https://arax.ncats.io/?term=basiliximab)
good to close, @dkoslicki?
I guess we had already fixed this one a while back, but it looks good in the new synonymizer too (#2003):
Cluster for DRUGBANK:DB00111 (MESH:D000077561) has 8 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
CHEMBL.COMPOUND:CHEMBL1201605 | ChemicalEntity | DACLIZUMAB | X | X | |
DRUGBANK:DB00111 | ChemicalEntity | Daclizumab | X | X | |
DrugCentral:4953 | ChemicalEntity | daclizumab | X | X | |
GTOPDB:6880 | SmallMolecule | daclizumab | X | ||
KEGG.DRUG:D03639 | Drug | Daclizumab (USAN/INN) | X | ||
MESH:D000077561 | ChemicalEntity | Daclizumab | X | X | X |
NCIT:C1569 | Drug | Daclizumab | X | ||
RXNORM:190353 | Drug | daclizumab | X |
Cluster for DRUGBANK:DB00081 (CHEMBL.COMPOUND:CHEMBL1201604) has 4 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
CHEMBL.COMPOUND:CHEMBL1201604 | ChemicalEntity | TOSITUMOMAB | X | X | X |
DRUGBANK:DB00081 | ChemicalEntity | Tositumomab | X | ||
NCIT:C2543 | Drug | Tositumomab | X | ||
RXNORM:263010 | Drug | tositumomab | X |
Cluster for DRUGBANK:DB00051 (CHEMBL.COMPOUND:CHEMBL1201580) has 5 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
CHEMBL.COMPOUND:CHEMBL1201580 | ChemicalEntity | ADALIMUMAB | X | X | X |
DRUGBANK:DB00051 | ChemicalEntity | Adalimumab | X | ||
KEGG.DRUG:D02597 | Drug | Adalimumab (USAN/INN) | X | ||
NCIT:C65216 | Drug | Adalimumab | X | ||
RXNORM:327361 | Drug | adalimumab | X |
Cluster for DRUGBANK:DB00043 (CHEMBL.COMPOUND:CHEMBL1201589) has 4 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
CHEMBL.COMPOUND:CHEMBL1201589 | ChemicalEntity | OMALIZUMAB | X | X | X |
DRUGBANK:DB00043 | ChemicalEntity | Omalizumab | X | ||
NCIT:C29299 | Drug | Omalizumab | X | ||
RXNORM:302379 | Drug | omalizumab | X |
Cluster for DRUGBANK:DB00072 (UNII:P188ANX8CK) has 14 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
DRUGBANK:DB00072 | ChemicalEntity | Trastuzumab | X | X | |
DrugCentral:4979 | ChemicalEntity | trastuzumab | X | X | |
KEGG.DRUG:D03257 | Drug | Trastuzumab (USAN/INN) | X | ||
MESH:C112748 | ChemicalEntity | [OBSOLETE] trastuzumab | X | ||
NCIT:C1647 | Drug | Trastuzumab | X | ||
PathWhiz.ProteinComplex:826 | MolecularEntity | Trastuzumab | X | ||
RXNORM:224905 | Drug | trastuzumab | X | ||
UMLS:C0728747 | Protein | trastuzumab | X | X | |
UMLS:C4541579 | Protein | trastuzumab-dkst | X | X | |
UMLS:C4741882 | Protein | trastuzumab-pkrb | X | X | |
UMLS:C4758794 | Protein | trastuzumab-dttb | X | X | |
UMLS:C4764376 | Protein | trastuzumab-qyyp | X | X | |
UMLS:C5187552 | Protein | trastuzumab-anns | X | X | |
UNII:P188ANX8CK | ChemicalEntity | TRASTUZUMAB | X | X |
the SRI seems to break IGHG1 into two clusters, one for Gene and one for Protein:
Cluster for NCBIGene:3500 has 2 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
NCBIGene:3500 | Gene | IGHG1 | X | X | X |
OMIM:147100 | Gene | IGHG1 | X | X |
Cluster for UniProtKB:P01857 has 8 nodes:
id | category | name | in_SRI | in_KG2pre | is_cluster_rep |
---|---|---|---|---|---|
ENSEMBL:ENSP00000374991 | Protein | X | |||
ENSEMBL:ENSP00000374991.2 | Protein | X | |||
ENSEMBL:ENSP00000488387 | Protein | X | |||
ENSEMBL:ENSP00000488387.1 | Protein | X | |||
PR:P01857 | Protein | immunoglobulin heavy constant gamma 1 (human) | X | X | |
REACT:R-HSA-1478805 | Protein | IGHG1 [extracellular region] | X | ||
UMLS:C1453819 | Protein | IGHG1 protein, human | X | X | |
UniProtKB:P01857 | Protein | IGHG1_HUMAN Immunoglobulin heavy constant gamma 1 (sprot) | X | X | X |
Check out the KG for this: https://arax.ncats.io/?r=666 SME says they shouldn't all be lumped just because they're antibodies