Open gaurav opened 6 months ago
As of PR #342, we can use the DuckDB database to generate a list of the 74,886 duplicated CURIEs: duplicate_curies-2024sep9.csv -- some of these don't look right just yet, but I figured it was worth sharing as a first-shot knowing that the results might change later.
Here's the distribution by the types of Biolink entities being combined:
1 biolink:CellularComponent||biolink:ComplexMolecularMixture
1 biolink:ChemicalEntity||biolink:GrossAnatomicalStructure
1 biolink:ChemicalEntity||biolink:Polypeptide||biolink:Protein
1 biolink:Disease||biolink:MolecularMixture
1 biolink:Disease||biolink:OrganismTaxon
1 biolink:GrossAnatomicalStructure||biolink:SmallMolecule
1 biolink:MolecularMixture||biolink:OrganismTaxon
1 biolink:OrganismTaxon||biolink:SmallMolecule
1 biolink:PhenotypicFeature||biolink:Protein
2 biolink:AnatomicalEntity||biolink:ChemicalEntity||biolink:Protein
2 biolink:Cell||biolink:Disease
2 biolink:ChemicalEntity||biolink:MolecularMixture||biolink:Protein
3 biolink:CellularComponent||biolink:ChemicalEntity
3 biolink:Disease||biolink:SmallMolecule
3 biolink:GrossAnatomicalStructure||biolink:OrganismTaxon
4 biolink:Cell||biolink:PhenotypicFeature
4 biolink:ChemicalEntity||biolink:PhenotypicFeature
4 biolink:GrossAnatomicalStructure||biolink:PhenotypicFeature
6 biolink:AnatomicalEntity||biolink:ComplexMolecularMixture
6 biolink:CellularComponent||biolink:Disease
9 biolink:ChemicalEntity||biolink:Disease
12 biolink:Cell||biolink:OrganismTaxon
13 biolink:CellularComponent||biolink:PhenotypicFeature
13 biolink:Cell||biolink:ChemicalEntity
14 biolink:Disease||biolink:GrossAnatomicalStructure
15 biolink:AnatomicalEntity||biolink:ChemicalEntity
15 biolink:AnatomicalEntity||biolink:Disease
17 biolink:ComplexMolecularMixture||biolink:Protein
25 biolink:ChemicalEntity||biolink:Protein||biolink:SmallMolecule
28 biolink:ChemicalEntity||biolink:OrganismTaxon
34 biolink:Polypeptide||biolink:Protein
37 biolink:AnatomicalEntity||biolink:OrganismTaxon
55 biolink:Disease||biolink:Gene
56 biolink:AnatomicalEntity||biolink:PhenotypicFeature
57 biolink:ChemicalEntity||biolink:Drug||biolink:Protein
81 biolink:MolecularMixture||biolink:Protein
123 biolink:Drug
264 biolink:Drug||biolink:Protein
628 biolink:MacromolecularComplex
764 biolink:Protein||biolink:SmallMolecule
6600 biolink:Gene||biolink:Protein
65977 biolink:ChemicalEntity||biolink:Protein
Some of these are being duplicated within the DuckDB database (MacromolecularComplex) and aren't actually duplicated in the compendia files, while others are duplicated in the compendia files but are returned correctly by NodeNorm (which I think is because the system is randomly choosing the correct clique to merge on). However, they all represent something wrong that needs to be fixed somewhere.
I took out some duplicates and ended up with 33,557 results instead: duplicate_curies-2024sep9-2.csv
Biolink type distribution:
1 biolink:AnatomicalEntity||biolink:OrganismTaxon
1 biolink:ChemicalEntity||biolink:GrossAnatomicalStructure
1 biolink:ChemicalEntity||biolink:Polypeptide||biolink:Protein
1 biolink:Disease||biolink:MolecularMixture
1 biolink:Disease||biolink:OrganismTaxon
1 biolink:GrossAnatomicalStructure||biolink:SmallMolecule
1 biolink:MolecularMixture||biolink:OrganismTaxon
1 biolink:OrganismTaxon||biolink:SmallMolecule
1 biolink:PhenotypicFeature||biolink:Protein
1 biolink:Polypeptide||biolink:Protein
2 biolink:AnatomicalEntity||biolink:ChemicalEntity||biolink:Protein
2 biolink:Cell||biolink:Disease
2 biolink:ChemicalEntity||biolink:MolecularMixture||biolink:Protein
3 biolink:Cell||biolink:OrganismTaxon
3 biolink:Disease||biolink:SmallMolecule
3 biolink:GrossAnatomicalStructure||biolink:OrganismTaxon
4 biolink:Cell||biolink:PhenotypicFeature
4 biolink:ChemicalEntity||biolink:PhenotypicFeature
4 biolink:GrossAnatomicalStructure||biolink:PhenotypicFeature
6 biolink:AnatomicalEntity||biolink:ComplexMolecularMixture
6 biolink:CellularComponent||biolink:Disease
8 biolink:ChemicalEntity||biolink:OrganismTaxon
9 biolink:ChemicalEntity||biolink:Disease
11 biolink:Cell||biolink:ChemicalEntity
12 biolink:AnatomicalEntity||biolink:ChemicalEntity
12 biolink:CellularComponent||biolink:PhenotypicFeature
14 biolink:Disease||biolink:GrossAnatomicalStructure
15 biolink:AnatomicalEntity||biolink:Disease
17 biolink:ComplexMolecularMixture||biolink:Protein
25 biolink:ChemicalEntity||biolink:Protein||biolink:SmallMolecule
55 biolink:Disease||biolink:Gene
56 biolink:AnatomicalEntity||biolink:PhenotypicFeature
57 biolink:ChemicalEntity||biolink:Drug||biolink:Protein
81 biolink:MolecularMixture||biolink:Protein
263 biolink:Drug||biolink:Protein
764 biolink:Protein||biolink:SmallMolecule
32109 biolink:ChemicalEntity||biolink:Protein
Next step:
A good example is UMLS:C0006050, which is present both as its own clique and is also present in a completely different clique, since they are chemicals and proteins respectively: https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=UMLS%3AC0006050&curie=UNII%3AE211KPY694&conflate=true&drug_chemical_conflate=true
This is because db1 contains the JSON of all the identifiers, so as long as the cliques have different preferred IDs this is fine.
It's probably not possible to fix this without overhauling how the Redis databases in NodeNorm work, but it would be nice to have some sort of index-wide test (#225) to catch when this happens.