Open kevinschaper opened 1 year ago
I think it's likely that all of these should be PhenotypicFeature?
category | prefix | count(*) |
---|---|---|
biolink:PhenotypicQuality | HP: | 16375 |
biolink:PhenotypicQuality | MP: | 13605 |
biolink:PhenotypicQuality | WBPhenotype: | 2633 |
biolink:PhenotypicQuality | XPO: | 20061 |
biolink:PhenotypicQuality | ZP: | 36602 |
...but I assume these need a deeper look:
category | prefix | count(*) |
---|---|---|
biolink:PhenotypicQuality | CHEBI: | 123 |
biolink:PhenotypicQuality | MONDO: | 41 |
Thanks @kevinschaper! Yes, I believe the majority of those phenotype nodes should be PhenotypicFeature. The 123 CHEBI PhenotypicQuality nodes are attributes so they should probably all be ChemicalRole. Same for the MONDO PhenotypicQuality nodes - they're attributes, so perhaps generic Attribute would work.
I'm adding some filtering code in monarch-ingest to exclude nodes with invalid categories. It looks like I'm removing 42029 nodes, with categories:
'biolink:disease_has_location',
'biolink:affects_localization_of',
'biolink:disease_has_basis_in',
'biolink:superclass_of',
'biolink:has_part',
'biolink:derives_from',
'biolink:Occurrent',
'biolink:related_to',
'biolink:has_attribute',
'biolink:same_as',
'biolink:manifestation_of',
'biolink:increases_degradation_of',
'biolink:PathologicalEntityMixin',
'biolink:located_in',
'biolink:part_of',
'biolink:biological_role_mixin',
'biolink:chemical_role_mixin',
'biolink:close_match',
'biolink:affects',
'biolink:subclass_of',
'biolink:treated_by',
'biolink:MacromolecularComplexMixin',
'biolink:causes',
'biolink:coexists_with',
'biolink:contributes_to',
'biolink:ChemicalSubstance',
'biolink:is_metabolite_of'
Wait, you probably not "exclude nodes with invalid categories" -> wouldn't it be more correct to say that you should "include nodes with at least one valid category"?
How do you determine "invalid category"?
This is what I'm doing right now:
valid_node_categories = {f"biolink:{camelcase(cat)}" for cat in biolink_model_schema.class_descendants("named thing")}
phenio_node_categories = set(nodes_df['category'].unique())
invalid_node_categories = phenio_node_categories - valid_node_categories
if invalid_node_categories:
logger.error(f"Invalid node categories: {invalid_node_categories}")
invalid_node_categories_df = nodes_df[nodes_df['category'].isin(invalid_node_categories)]
logger.error(f"Removing {len(invalid_node_categories_df)} nodes with invalid categories")
nodes_df = nodes_df[~nodes_df['category'].isin(invalid_node_categories)]
Which definitely would fail on a category passed as a list, though none are right now.
Here's a cut-sort-uniq of current categories in the node file
109670 biolink:PhenotypicQuality
39246 biolink:Occurrent
28469 biolink:GrossAnatomicalStructure
25967 biolink:Disease
21547 biolink:NamedThing
17809 biolink:Cell
12285 biolink:AnatomicalEntity
5285 biolink:CellularComponent
3399 biolink:MolecularEntity
2555 biolink:BiologicalProcess
2104 biolink:MacromolecularComplexMixin
1321 biolink:CellularOrganism
1176 biolink:MolecularActivity
839 biolink:PathologicalEntityMixin
787 biolink:Protein
720 biolink:Pathway
380 biolink:PhenotypicFeature
379 biolink:biological_role_mixin
304 biolink:Virus
270 biolink:BehavioralFeature
232 biolink:LifeStage
171 biolink:ClinicalModifier
168 biolink:coexists_with
104 biolink:related_to
85 biolink:InformationContentEntity
81 biolink:ChemicalEntity
74 biolink:has_attribute
71 biolink:SmallMolecule
55 biolink:part_of
42 biolink:located_in
42 biolink:is_metabolite_of
40 biolink:chemical_role_mixin
35 biolink:GeneticInheritance
34 biolink:increases_molecular_modification_of
33 biolink:decreases_molecular_modification_of
32 biolink:affects_molecular_modification_of
31 biolink:affects
29 biolink:has_part
29 biolink:causes
28 biolink:decreases_activity_of
26 biolink:treats
26 biolink:subclass_of
21 biolink:Onset
21 biolink:NucleicAcidEntity
20 biolink:has_participant
19 biolink:increases_activity_of
18 biolink:directly_interacts_with
16 biolink:has_input
16 biolink:EvidenceType
15 biolink:temporally_related_to
15 biolink:participates_in
15 biolink:occurs_in
14 biolink:GeographicExposure
14 biolink:ChemicalSubstance
13 biolink:RNAProduct
12 biolink:superclass_of
12 biolink:OrganismalEntity
10 biolink:has_output
10 biolink:close_match
10 biolink:Activity
8 biolink:precedes
8 biolink:SequenceFeature
8 biolink:Agent
7 biolink:synonym
7 biolink:produces
7 biolink:location_of
7 biolink:has_unit
7 biolink:develops_from
7 biolink:Transcript
6 biolink:same_as
6 biolink:overlaps
6 biolink:negatively_regulates
6 biolink:caused_by
6 biolink:actively_involved_in
6 biolink:Procedure
6 biolink:PhysiologicalProcess
6 biolink:PathologicalProcess
5 biolink:regulates
5 biolink:quantifier_qualifier
5 biolink:positively_regulates
5 biolink:physically_interacts_with
5 biolink:interacts_with
5 biolink:disrupts
5 biolink:derives_from
5 biolink:contributes_to
5 biolink:biomarker_for
5 biolink:affects_activity_of
5 biolink:SequenceVariant
5 biolink:ProteinFamily
5 biolink:PopulationOfIndividualOrganisms
5 biolink:Device
5 biolink:Cohort
4 biolink:preceded_by
4 biolink:manifestation_of
4 biolink:has_route
4 biolink:has_quantitative_value
4 biolink:has_phenotype
4 biolink:has_gene_product
4 biolink:entity_positively_regulates_entity
4 biolink:derives_into
4 biolink:affects_transport_of
4 biolink:Phenomenon
4 biolink:OrganismTaxon
4 biolink:Gene
4 biolink:Drug
4 biolink:Behavior
3 biolink:treated_by
3 biolink:prevents
3 biolink:p_value
3 biolink:mechanism_of_action
3 biolink:lacks_part
3 biolink:increases_degradation_of
3 biolink:in_taxon
3 biolink:id
3 biolink:homologous_to
3 biolink:gene_associated_with_condition
3 biolink:expressed_in
3 biolink:entity_negatively_regulates_entity
3 biolink:decreases_response_to
3 biolink:decreases_degradation_of
3 biolink:capable_of
3 biolink:associated_with
3 biolink:affects_degradation_of
3 biolink:Publication
3 biolink:ProteinDomain
3 biolink:Polypeptide
3 biolink:MicroRNA
3 biolink:IndividualOrganism
3 biolink:Haplotype
3 biolink:Genome
3 biolink:GeneProductMixin
3 biolink:Exon
3 biolink:Dataset
3 biolink:BiologicalEntity
3 biolink:Association
2 biolink:xref
2 biolink:transcribed_to
2 biolink:transcribed_from
2 biolink:symbol
2 biolink:summary
2 biolink:similar_to
2 biolink:publisher
2 biolink:orthologous_to
2 biolink:narrow_match
2 biolink:model_of
2 biolink:mentions
2 biolink:license
2 biolink:increases_metabolic_processing_of
2 biolink:has_biomarker
2 biolink:has_attribute_type
2 biolink:format
2 biolink:exact_match
2 biolink:drug_regulatory_status_world_wide
2 biolink:disease_has_basis_in
2 biolink:diagnoses
2 biolink:decreases_expression_of
2 biolink:creation_date
2 biolink:correlated_with
2 biolink:broad_match
2 biolink:author
2 biolink:Treatment
2 biolink:SiRNA
2 biolink:PhysicalEntity
2 biolink:MaterialSample
2 biolink:Hospitalization
2 biolink:GeographicLocation
2 biolink:Genotype
2 biolink:GenomicEntity
2 biolink:Event
2 biolink:ConfidenceLevel
2 biolink:ChemicalMixture
2 biolink:ChemicalExposure
1 category
1 biolink:xenologous_to
1 biolink:volume
1 biolink:version_of
1 biolink:strand
1 biolink:start_coordinate
1 biolink:rights
1 biolink:retrieved_on
1 biolink:related_synonym
1 biolink:related_condition
1 biolink:reaction_direction
1 biolink:published_in
1 biolink:produced_by
1 biolink:process_positively_regulates_process
1 biolink:process_negatively_regulates_process
1 biolink:predisposes
1 biolink:positively_correlated_with
1 biolink:phase
1 biolink:paralogous_to
1 biolink:pages
1 biolink:opposite_of
1 biolink:negatively_correlated_with
1 biolink:narrow_synonym
1 biolink:molecularly_interacts_with
1 biolink:mesh_terms
1 biolink:longitude
1 biolink:logical_interpretation
1 biolink:latitude
1 biolink:issue
1 biolink:iso_abbreviation
1 biolink:is_synonymous_variant_of
1 biolink:is_splice_site_variant_of
1 biolink:is_sequence_variant_of
1 biolink:is_non_coding_variant_of
1 biolink:is_nearby_variant_of
1 biolink:is_missense_variant_of
1 biolink:is_frameshift_variant_of
1 biolink:is_excipient_of
1 biolink:is_active_ingredient_of
1 biolink:iri
1 biolink:interacting_molecules_category
1 biolink:increases_uptake_of
1 biolink:increases_synthesis_of
1 biolink:increases_splicing_of
1 biolink:increases_secretion_of
1 biolink:increases_response_to
1 biolink:increases_mutation_rate_of
1 biolink:increases_molecular_interaction
1 biolink:increases_localization_of
1 biolink:increases_folding_of
1 biolink:increases_expression_of
1 biolink:in_linkage_disequilibrium_with
1 biolink:has_variant_part
1 biolink:has_topic
1 biolink:has_taxonomic_rank
1 biolink:has_stressor
1 biolink:has_side_effect
1 biolink:has_sequence_location
1 biolink:has_receptor
1 biolink:has_plasma_membrane_part
1 biolink:has_nutrient
1 biolink:has_not_completed
1 biolink:has_molecular_consequence
1 biolink:has_increased_amount
1 biolink:has_evidence
1 biolink:has_decreased_amount
1 biolink:has_count
1 biolink:has_completed
1 biolink:has_chemical_formula
1 biolink:has_active_ingredient
1 biolink:genetically_interacts_with
1 biolink:genetic_association
1 biolink:gene_product_of
1 biolink:expresses
1 biolink:exact_synonym
1 biolink:exacerbates
1 biolink:entity_regulates_entity
1 biolink:end_coordinate
1 biolink:enables
1 biolink:enabled_by
1 biolink:editor
1 biolink:distribution_download_url
1 biolink:disease_has_location
1 biolink:decreases_uptake_of
1 biolink:decreases_transport_of
1 biolink:decreases_synthesis_of
1 biolink:decreases_stability_of
1 biolink:decreases_splicing_of
1 biolink:decreases_secretion_of
1 biolink:decreases_mutation_rate_of
1 biolink:decreases_molecular_interaction
1 biolink:decreases_metabolic_processing_of
1 biolink:decreases_localization_of
1 biolink:decreases_folding_of
1 biolink:created_with
1 biolink:contributor
1 biolink:contraindicated_for
1 biolink:condition_associated_with_gene
1 biolink:colocalizes_with
1 biolink:chi_squared_statistic
1 biolink:chemically_similar_to
1 biolink:chapter
1 biolink:broad_synonym
1 biolink:associated_with_sensitivity_to
1 biolink:ameliorates
1 biolink:affects_uptake_of
1 biolink:affects_synthesis_of
1 biolink:affects_splicing_of
1 biolink:affects_secretion_of
1 biolink:affects_response_to
1 biolink:affects_mutation_rate_of
1 biolink:affects_localization_of
1 biolink:affects_folding_of
1 biolink:affects_expression_in
1 biolink:affects_abundance_of
1 biolink:acts_upstream_of_or_within_negative_effect
1 biolink:actively_involves
1 biolink:Zygosity
1 biolink:TaxonomicRank
1 biolink:Snv
1 biolink:SequenceFeatureRelationship
1 biolink:ReagentTargetedGene
1 biolink:ProcessedMaterial
1 biolink:PhenotypicSex
1 biolink:PairwiseGeneToGeneInteraction
1 biolink:OrganismAttribute
1 biolink:NoncodingRNAProduct
1 biolink:GenotypicSex
1 biolink:GenomicSequenceLocalization
1 biolink:GeneToPhenotypicFeatureAssociation
1 biolink:GeneToGoTermAssociation
1 biolink:GeneToDiseaseAssociation
1 biolink:Food
1 biolink:ExposureEvent
1 biolink:EnvironmentalProcess
1 biolink:EnvironmentalFeature
1 biolink:DrugExposure
1 biolink:DiseaseOrPhenotypicFeature
1 biolink:DatasetDistribution
1 biolink:CodingSequence
1 biolink:ClinicalMeasurement
1 biolink:ClinicalAttribute
1 biolink:ChemicalToPathwayAssociation
1 biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation
1 biolink:ChemicalGeneInteractionAssociation
1 biolink:CellLine
1 biolink:BiologicalSex
1 biolink:Attribute
1 biolink:ActivityAndBehavior
The predicate categories like part_of
are essentially the Biolink mappings, e.g.,
BFO:0000050 biolink:part_of part_of a core relation that holds between a part and its whole Graph
BSPO:0001106 biolink:part_of proximalmost_part_of Graph
BSPO:0001108 biolink:part_of distalmost_part_of Graph
BSPO:0001113 biolink:part_of preaxialmost part of Graph
BSPO:0001115 biolink:part_of postaxialmost part of Graph
CHEBI:is_substituent_group_from biolink:part_of Graph
but biolink:Occurrent
is a category assigned to developmental stages in Uberon, behaviors in NBO, and a whole bunch of GO processes, among >30K other nodes. Is there a clear reason to exclude them, other than Occurrent being a very abstract category?
Oh, maybe biolink:Occurrant is newer in the model and I'm hitting that error because I need to upgrade. (the ingest is currently at 3.1.1)
It looks like all of MPATH comes in as biolink:PathologicalEntityMixin
and that may not be a helpful categorization, but I think it's due to the root being defined here in Biolink:
https://github.com/biolink/biolink-model/blob/c65f94c54167268b0d671cd9420a7a60e7a0ec6b/biolink-model.yaml#LL8533C1-L8540C19
As per KG construction crew meeting on Aug 14, the pre-normalization versions of KG-Phenio appear to have correct categories (at least PhenotypicFeature vs. PhenotypicQuality) but then get changed by universalizer.
Universalizer 0.0.4 appears to yield the correct node category for phenotypes (PhenotypicFeature) as per @kevinschaper when run locally. The most recent graph build, for comparison, uses universalizer 0.0.7, so something changed in that interval (including dependencies) may be responsible for the difference.
Here are node categories from the merged monarch graph for nodes coming in from kg-phenio.
I think it might be a bit of a group effort to QC.