TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

`semantic_types` key in curie_to_bl_type_db includes duplicates #235

Open gaurav opened 7 months ago

gaurav commented 7 months ago

Something has gone wrong with the way in which the semantic_types key in curie_to_bl_type_db (also known as semantic-count) is set: it is supposed to be a unique list of semantic types stored in this NodeNorm instance, but it currently (2023nov5) contains a list of 3,331 Biolink types. Here is a random selection:

2493) "biolink:GeneOrGeneProduct"
2494) "biolink:Entity"
2495) "biolink:NamedThing"
2496) "biolink:BiologicalEntity"
2497) "biolink:GeneFamily"
2498) "biolink:GeneGroupingMixin"
2499) "biolink:Human"
2500) "biolink:NucleicAcidEntity"
2501) "biolink:ClinicalAttribute"
2502) "biolink:Food"
2503) "biolink:OrganismAttribute"
2504) "biolink:Attribute"
2505) "biolink:MolecularActivity"
2506) "biolink:PhysiologicalProcess"
2507) "biolink:Event"
2508) "biolink:Device"
2509) "biolink:GeographicLocation"
2510) "biolink:PlanetaryEntity"
2511) "biolink:Phenomenon"
2512) "biolink:Behavior"
2513) "biolink:Activity"
2514) "biolink:Procedure"
2515) "biolink:ActivityAndBehavior"
2516) "biolink:Agent"
2517) "biolink:AdministrativeEntity"
2518) "biolink:Cohort"
2519) "biolink:PopulationOfIndividualOrganisms"
2520) "biolink:StudyPopulation"
2521) "biolink:Drug"
2522) "biolink:MolecularMixture"
2523) "biolink:Publication"
2524) "biolink:InformationContentEntity"
2525) "biolink:PhysicalEntity"
2526) "biolink:BiologicalProcess"
2527) "biolink:Occurrent"
2528) "biolink:BiologicalProcessOrActivity"
2529) "biolink:Disease"
2530) "biolink:DiseaseOrPhenotypicFeature"
2531) "biolink:CellularComponent"
2532) "biolink:Cell"
2533) "biolink:SubjectOfInvestigation"
2534) "biolink:OrganismalEntity"
2535) "biolink:AnatomicalEntity"
2536) "biolink:ComplexMolecularMixture"
2537) "biolink:ChemicalMixture"
2538) "biolink:SmallMolecule"
2539) "biolink:ChemicalOrDrugOrTreatment"
2540) "biolink:ChemicalEntity"
2541) "biolink:MolecularEntity"
2542) "biolink:Protein"
2543) "biolink:ChemicalEntityOrProteinOrPolypeptide"
2544) "biolink:GeneProductMixin"
2545) "biolink:Polypeptide"
2546) "biolink:Gene"
2547) "biolink:MacromolecularMachineMixin"
2548) "biolink:PhysicalEssenceOrOccurrent"
2549) "biolink:ThingWithTaxon"
2550) "biolink:OntologyClass"
2551) "biolink:PhysicalEssence"
2552) "biolink:ChemicalEntityOrGeneOrGeneProduct"
2553) "biolink:GenomicEntity"
2554) "biolink:GeneOrGeneProduct"
2555) "biolink:Entity"

Presumably this bug is caused by the loader, and may be caused by each Biolink type being added along with all of its ancestors.

I've fixed this at the endpoint by uniquifying the result (PR #232), but it would be good to figure out what's going wrong with the loader and fix it there.