Knowledge-Graph-Hub / kg-phenio

A Graph for experiments doing ML on ontologies.
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Node categories #95

Open kevinschaper opened 1 year ago

kevinschaper commented 1 year ago

Here are node categories from the merged monarch graph for nodes coming in from kg-phenio.

I think it might be a bit of a group effort to QC.

category prefix count(*)
biolink:AnatomicalEntity EMAPA: 989
biolink:AnatomicalEntity FBbt: 2272
biolink:AnatomicalEntity UBERON: 5271
biolink:AnatomicalEntity WBbt: 118
biolink:AnatomicalEntity ZFA: 898
biolink:BiologicalProcess GO: 2525
biolink:Cell EMAPA: 5
biolink:Cell FBbt: 12276
biolink:Cell GO: 1
biolink:Cell WBbt: 3378
biolink:Cell ZFA: 1
biolink:CellularComponent GO: 2370
biolink:CellularComponent WBbt: 2908
biolink:ChemicalEntity CHEBI: 71
biolink:ChemicalSubstance CHEBI: 14
biolink:ClinicalModifier HP: 171
biolink:Disease MESH: 2
biolink:Disease MONDO: 25802
biolink:GeneticInheritance HP: 32
biolink:GrossAnatomicalStructure EMAPA: 4876
biolink:GrossAnatomicalStructure FBbt: 4186
biolink:GrossAnatomicalStructure UBERON: 10354
biolink:GrossAnatomicalStructure WBbt: 359
biolink:GrossAnatomicalStructure ZFA: 2201
biolink:MacromolecularComplexMixin GO: 2100
biolink:MolecularActivity GO: 1168
biolink:MolecularEntity CHEBI: 3140
biolink:NamedThing EMAPA: 2875
biolink:NamedThing FBbt: 1107
biolink:NamedThing GO: 3642
biolink:NamedThing MP: 444
biolink:NamedThing MPATH: 52
biolink:NamedThing WBPhenotype: 62
biolink:NamedThing WBbt: 793
biolink:NamedThing ZFA: 118
biolink:Occurrent GO: 38529
biolink:Occurrent UBERON: 60
biolink:PathologicalEntityMixin MPATH: 839
biolink:Pathway GO: 717
biolink:PhenotypicFeature FYPO: 1
biolink:PhenotypicFeature HP: 367
biolink:PhenotypicQuality CHEBI: 123
biolink:PhenotypicQuality HP: 16375
biolink:PhenotypicQuality MONDO: 41
biolink:PhenotypicQuality MP: 13605
biolink:PhenotypicQuality WBPhenotype: 2633
biolink:PhenotypicQuality XPO: 20061
biolink:PhenotypicQuality ZP: 36602
biolink:Protein CHEBI: 21
biolink:RNAProduct CHEBI: 3
biolink:SmallMolecule CHEBI: 118
biolink:affects GOREL: 1
biolink:affects_localization_of GOREL: 1
biolink:biological_role_mixin CHEBI: 366
biolink:causes GOREL: 1
biolink:causes MONDO: 2
biolink:chemical_role_mixin CHEBI: 40
biolink:close_match CHEBI: 2
biolink:coexists_with UBERON: 18
biolink:contributes_to MONDO: 1
biolink:derives_from CHEBI: 1
biolink:disease_has_basis_in MONDO: 2
biolink:disease_has_location MONDO: 1
biolink:has_attribute OMIM: 1
biolink:has_part MONDO: 1
biolink:increases_degradation_of GOREL: 1
biolink:is_metabolite_of CHEBI: 42
biolink:located_in GOREL: 1
biolink:manifestation_of OMIM: 1
biolink:part_of CHEBI: 1
biolink:part_of MONDO: 1
biolink:part_of UBERON: 1
biolink:related_to CHEBI: 2
biolink:related_to GOREL: 2
biolink:related_to MONDO: 1
biolink:related_to UBERON: 1
biolink:same_as MONDO: 1
biolink:subclass_of CHEBI: 1
biolink:subclass_of GO: 1
biolink:superclass_of GO: 1
biolink:superclass_of OMIM: 1
biolink:treated_by MONDO: 1
kevinschaper commented 1 year ago

I think it's likely that all of these should be PhenotypicFeature?

category prefix count(*)
biolink:PhenotypicQuality HP: 16375
biolink:PhenotypicQuality MP: 13605
biolink:PhenotypicQuality WBPhenotype: 2633
biolink:PhenotypicQuality XPO: 20061
biolink:PhenotypicQuality ZP: 36602

...but I assume these need a deeper look:

category prefix count(*)
biolink:PhenotypicQuality CHEBI: 123
biolink:PhenotypicQuality MONDO: 41
caufieldjh commented 1 year ago

Thanks @kevinschaper! Yes, I believe the majority of those phenotype nodes should be PhenotypicFeature. The 123 CHEBI PhenotypicQuality nodes are attributes so they should probably all be ChemicalRole. Same for the MONDO PhenotypicQuality nodes - they're attributes, so perhaps generic Attribute would work.

kevinschaper commented 1 year ago

I'm adding some filtering code in monarch-ingest to exclude nodes with invalid categories. It looks like I'm removing 42029 nodes, with categories:

'biolink:disease_has_location', 
'biolink:affects_localization_of', 
'biolink:disease_has_basis_in', 
'biolink:superclass_of', 
'biolink:has_part', 
'biolink:derives_from', 
'biolink:Occurrent', 
'biolink:related_to', 
'biolink:has_attribute', 
'biolink:same_as', 
'biolink:manifestation_of', 
'biolink:increases_degradation_of', 
'biolink:PathologicalEntityMixin', 
'biolink:located_in', 
'biolink:part_of', 
'biolink:biological_role_mixin', 
'biolink:chemical_role_mixin', 
'biolink:close_match', 
'biolink:affects', 
'biolink:subclass_of', 
'biolink:treated_by', 
'biolink:MacromolecularComplexMixin', 
'biolink:causes', 
'biolink:coexists_with', 
'biolink:contributes_to', 
'biolink:ChemicalSubstance', 
'biolink:is_metabolite_of'
matentzn commented 1 year ago

Wait, you probably not "exclude nodes with invalid categories" -> wouldn't it be more correct to say that you should "include nodes with at least one valid category"?

How do you determine "invalid category"?

kevinschaper commented 1 year ago

This is what I'm doing right now:

    valid_node_categories = {f"biolink:{camelcase(cat)}" for cat in biolink_model_schema.class_descendants("named thing")}
    phenio_node_categories = set(nodes_df['category'].unique())
    invalid_node_categories = phenio_node_categories - valid_node_categories
    if invalid_node_categories:
        logger.error(f"Invalid node categories: {invalid_node_categories}")
        invalid_node_categories_df = nodes_df[nodes_df['category'].isin(invalid_node_categories)]
        logger.error(f"Removing {len(invalid_node_categories_df)} nodes with invalid categories")
        nodes_df = nodes_df[~nodes_df['category'].isin(invalid_node_categories)]

Which definitely would fail on a category passed as a list, though none are right now.

Here's a cut-sort-uniq of current categories in the node file

109670 biolink:PhenotypicQuality
39246 biolink:Occurrent
28469 biolink:GrossAnatomicalStructure
25967 biolink:Disease
21547 biolink:NamedThing
17809 biolink:Cell
12285 biolink:AnatomicalEntity
5285 biolink:CellularComponent
3399 biolink:MolecularEntity
2555 biolink:BiologicalProcess
2104 biolink:MacromolecularComplexMixin
1321 biolink:CellularOrganism
1176 biolink:MolecularActivity
 839 biolink:PathologicalEntityMixin
 787 biolink:Protein
 720 biolink:Pathway
 380 biolink:PhenotypicFeature
 379 biolink:biological_role_mixin
 304 biolink:Virus
 270 biolink:BehavioralFeature
 232 biolink:LifeStage
 171 biolink:ClinicalModifier
 168 biolink:coexists_with
 104 biolink:related_to
  85 biolink:InformationContentEntity
  81 biolink:ChemicalEntity
  74 biolink:has_attribute
  71 biolink:SmallMolecule
  55 biolink:part_of
  42 biolink:located_in
  42 biolink:is_metabolite_of
  40 biolink:chemical_role_mixin
  35 biolink:GeneticInheritance
  34 biolink:increases_molecular_modification_of
  33 biolink:decreases_molecular_modification_of
  32 biolink:affects_molecular_modification_of
  31 biolink:affects
  29 biolink:has_part
  29 biolink:causes
  28 biolink:decreases_activity_of
  26 biolink:treats
  26 biolink:subclass_of
  21 biolink:Onset
  21 biolink:NucleicAcidEntity
  20 biolink:has_participant
  19 biolink:increases_activity_of
  18 biolink:directly_interacts_with
  16 biolink:has_input
  16 biolink:EvidenceType
  15 biolink:temporally_related_to
  15 biolink:participates_in
  15 biolink:occurs_in
  14 biolink:GeographicExposure
  14 biolink:ChemicalSubstance
  13 biolink:RNAProduct
  12 biolink:superclass_of
  12 biolink:OrganismalEntity
  10 biolink:has_output
  10 biolink:close_match
  10 biolink:Activity
   8 biolink:precedes
   8 biolink:SequenceFeature
   8 biolink:Agent
   7 biolink:synonym
   7 biolink:produces
   7 biolink:location_of
   7 biolink:has_unit
   7 biolink:develops_from
   7 biolink:Transcript
   6 biolink:same_as
   6 biolink:overlaps
   6 biolink:negatively_regulates
   6 biolink:caused_by
   6 biolink:actively_involved_in
   6 biolink:Procedure
   6 biolink:PhysiologicalProcess
   6 biolink:PathologicalProcess
   5 biolink:regulates
   5 biolink:quantifier_qualifier
   5 biolink:positively_regulates
   5 biolink:physically_interacts_with
   5 biolink:interacts_with
   5 biolink:disrupts
   5 biolink:derives_from
   5 biolink:contributes_to
   5 biolink:biomarker_for
   5 biolink:affects_activity_of
   5 biolink:SequenceVariant
   5 biolink:ProteinFamily
   5 biolink:PopulationOfIndividualOrganisms
   5 biolink:Device
   5 biolink:Cohort
   4 biolink:preceded_by
   4 biolink:manifestation_of
   4 biolink:has_route
   4 biolink:has_quantitative_value
   4 biolink:has_phenotype
   4 biolink:has_gene_product
   4 biolink:entity_positively_regulates_entity
   4 biolink:derives_into
   4 biolink:affects_transport_of
   4 biolink:Phenomenon
   4 biolink:OrganismTaxon
   4 biolink:Gene
   4 biolink:Drug
   4 biolink:Behavior
   3 biolink:treated_by
   3 biolink:prevents
   3 biolink:p_value
   3 biolink:mechanism_of_action
   3 biolink:lacks_part
   3 biolink:increases_degradation_of
   3 biolink:in_taxon
   3 biolink:id
   3 biolink:homologous_to
   3 biolink:gene_associated_with_condition
   3 biolink:expressed_in
   3 biolink:entity_negatively_regulates_entity
   3 biolink:decreases_response_to
   3 biolink:decreases_degradation_of
   3 biolink:capable_of
   3 biolink:associated_with
   3 biolink:affects_degradation_of
   3 biolink:Publication
   3 biolink:ProteinDomain
   3 biolink:Polypeptide
   3 biolink:MicroRNA
   3 biolink:IndividualOrganism
   3 biolink:Haplotype
   3 biolink:Genome
   3 biolink:GeneProductMixin
   3 biolink:Exon
   3 biolink:Dataset
   3 biolink:BiologicalEntity
   3 biolink:Association
   2 biolink:xref
   2 biolink:transcribed_to
   2 biolink:transcribed_from
   2 biolink:symbol
   2 biolink:summary
   2 biolink:similar_to
   2 biolink:publisher
   2 biolink:orthologous_to
   2 biolink:narrow_match
   2 biolink:model_of
   2 biolink:mentions
   2 biolink:license
   2 biolink:increases_metabolic_processing_of
   2 biolink:has_biomarker
   2 biolink:has_attribute_type
   2 biolink:format
   2 biolink:exact_match
   2 biolink:drug_regulatory_status_world_wide
   2 biolink:disease_has_basis_in
   2 biolink:diagnoses
   2 biolink:decreases_expression_of
   2 biolink:creation_date
   2 biolink:correlated_with
   2 biolink:broad_match
   2 biolink:author
   2 biolink:Treatment
   2 biolink:SiRNA
   2 biolink:PhysicalEntity
   2 biolink:MaterialSample
   2 biolink:Hospitalization
   2 biolink:GeographicLocation
   2 biolink:Genotype
   2 biolink:GenomicEntity
   2 biolink:Event
   2 biolink:ConfidenceLevel
   2 biolink:ChemicalMixture
   2 biolink:ChemicalExposure
   1 category
   1 biolink:xenologous_to
   1 biolink:volume
   1 biolink:version_of
   1 biolink:strand
   1 biolink:start_coordinate
   1 biolink:rights
   1 biolink:retrieved_on
   1 biolink:related_synonym
   1 biolink:related_condition
   1 biolink:reaction_direction
   1 biolink:published_in
   1 biolink:produced_by
   1 biolink:process_positively_regulates_process
   1 biolink:process_negatively_regulates_process
   1 biolink:predisposes
   1 biolink:positively_correlated_with
   1 biolink:phase
   1 biolink:paralogous_to
   1 biolink:pages
   1 biolink:opposite_of
   1 biolink:negatively_correlated_with
   1 biolink:narrow_synonym
   1 biolink:molecularly_interacts_with
   1 biolink:mesh_terms
   1 biolink:longitude
   1 biolink:logical_interpretation
   1 biolink:latitude
   1 biolink:issue
   1 biolink:iso_abbreviation
   1 biolink:is_synonymous_variant_of
   1 biolink:is_splice_site_variant_of
   1 biolink:is_sequence_variant_of
   1 biolink:is_non_coding_variant_of
   1 biolink:is_nearby_variant_of
   1 biolink:is_missense_variant_of
   1 biolink:is_frameshift_variant_of
   1 biolink:is_excipient_of
   1 biolink:is_active_ingredient_of
   1 biolink:iri
   1 biolink:interacting_molecules_category
   1 biolink:increases_uptake_of
   1 biolink:increases_synthesis_of
   1 biolink:increases_splicing_of
   1 biolink:increases_secretion_of
   1 biolink:increases_response_to
   1 biolink:increases_mutation_rate_of
   1 biolink:increases_molecular_interaction
   1 biolink:increases_localization_of
   1 biolink:increases_folding_of
   1 biolink:increases_expression_of
   1 biolink:in_linkage_disequilibrium_with
   1 biolink:has_variant_part
   1 biolink:has_topic
   1 biolink:has_taxonomic_rank
   1 biolink:has_stressor
   1 biolink:has_side_effect
   1 biolink:has_sequence_location
   1 biolink:has_receptor
   1 biolink:has_plasma_membrane_part
   1 biolink:has_nutrient
   1 biolink:has_not_completed
   1 biolink:has_molecular_consequence
   1 biolink:has_increased_amount
   1 biolink:has_evidence
   1 biolink:has_decreased_amount
   1 biolink:has_count
   1 biolink:has_completed
   1 biolink:has_chemical_formula
   1 biolink:has_active_ingredient
   1 biolink:genetically_interacts_with
   1 biolink:genetic_association
   1 biolink:gene_product_of
   1 biolink:expresses
   1 biolink:exact_synonym
   1 biolink:exacerbates
   1 biolink:entity_regulates_entity
   1 biolink:end_coordinate
   1 biolink:enables
   1 biolink:enabled_by
   1 biolink:editor
   1 biolink:distribution_download_url
   1 biolink:disease_has_location
   1 biolink:decreases_uptake_of
   1 biolink:decreases_transport_of
   1 biolink:decreases_synthesis_of
   1 biolink:decreases_stability_of
   1 biolink:decreases_splicing_of
   1 biolink:decreases_secretion_of
   1 biolink:decreases_mutation_rate_of
   1 biolink:decreases_molecular_interaction
   1 biolink:decreases_metabolic_processing_of
   1 biolink:decreases_localization_of
   1 biolink:decreases_folding_of
   1 biolink:created_with
   1 biolink:contributor
   1 biolink:contraindicated_for
   1 biolink:condition_associated_with_gene
   1 biolink:colocalizes_with
   1 biolink:chi_squared_statistic
   1 biolink:chemically_similar_to
   1 biolink:chapter
   1 biolink:broad_synonym
   1 biolink:associated_with_sensitivity_to
   1 biolink:ameliorates
   1 biolink:affects_uptake_of
   1 biolink:affects_synthesis_of
   1 biolink:affects_splicing_of
   1 biolink:affects_secretion_of
   1 biolink:affects_response_to
   1 biolink:affects_mutation_rate_of
   1 biolink:affects_localization_of
   1 biolink:affects_folding_of
   1 biolink:affects_expression_in
   1 biolink:affects_abundance_of
   1 biolink:acts_upstream_of_or_within_negative_effect
   1 biolink:actively_involves
   1 biolink:Zygosity
   1 biolink:TaxonomicRank
   1 biolink:Snv
   1 biolink:SequenceFeatureRelationship
   1 biolink:ReagentTargetedGene
   1 biolink:ProcessedMaterial
   1 biolink:PhenotypicSex
   1 biolink:PairwiseGeneToGeneInteraction
   1 biolink:OrganismAttribute
   1 biolink:NoncodingRNAProduct
   1 biolink:GenotypicSex
   1 biolink:GenomicSequenceLocalization
   1 biolink:GeneToPhenotypicFeatureAssociation
   1 biolink:GeneToGoTermAssociation
   1 biolink:GeneToDiseaseAssociation
   1 biolink:Food
   1 biolink:ExposureEvent
   1 biolink:EnvironmentalProcess
   1 biolink:EnvironmentalFeature
   1 biolink:DrugExposure
   1 biolink:DiseaseOrPhenotypicFeature
   1 biolink:DatasetDistribution
   1 biolink:CodingSequence
   1 biolink:ClinicalMeasurement
   1 biolink:ClinicalAttribute
   1 biolink:ChemicalToPathwayAssociation
   1 biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation
   1 biolink:ChemicalGeneInteractionAssociation
   1 biolink:CellLine
   1 biolink:BiologicalSex
   1 biolink:Attribute
   1 biolink:ActivityAndBehavior
caufieldjh commented 1 year ago

The predicate categories like part_of are essentially the Biolink mappings, e.g.,

BFO:0000050     biolink:part_of part_of a core relation that holds between a part and its whole Graph
BSPO:0001106    biolink:part_of proximalmost_part_of            Graph
BSPO:0001108    biolink:part_of distalmost_part_of              Graph
BSPO:0001113    biolink:part_of preaxialmost part of            Graph
BSPO:0001115    biolink:part_of postaxialmost part of           Graph
CHEBI:is_substituent_group_from biolink:part_of                 Graph

but biolink:Occurrent is a category assigned to developmental stages in Uberon, behaviors in NBO, and a whole bunch of GO processes, among >30K other nodes. Is there a clear reason to exclude them, other than Occurrent being a very abstract category?

kevinschaper commented 1 year ago

Oh, maybe biolink:Occurrant is newer in the model and I'm hitting that error because I need to upgrade. (the ingest is currently at 3.1.1)

caufieldjh commented 1 year ago

It looks like all of MPATH comes in as biolink:PathologicalEntityMixin and that may not be a helpful categorization, but I think it's due to the root being defined here in Biolink: https://github.com/biolink/biolink-model/blob/c65f94c54167268b0d671cd9420a7a60e7a0ec6b/biolink-model.yaml#LL8533C1-L8540C19

caufieldjh commented 11 months ago

As per KG construction crew meeting on Aug 14, the pre-normalization versions of KG-Phenio appear to have correct categories (at least PhenotypicFeature vs. PhenotypicQuality) but then get changed by universalizer.

caufieldjh commented 11 months ago

Universalizer 0.0.4 appears to yield the correct node category for phenotypes (PhenotypicFeature) as per @kevinschaper when run locally. The most recent graph build, for comparison, uses universalizer 0.0.7, so something changed in that interval (including dependencies) may be responsible for the difference.