biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
177 stars 73 forks source link

Clean up structure of 'frequency' slots and associated mixins #1175

Open RichardBruskiewich opened 2 years ago

RichardBruskiewich commented 2 years ago

There are a number of existing slots/mixins/classes relating to - broadly speaking - phenotypic frequency. Recent discussions suggest room for improvement in their definition, structure and relationships.

This issue is originally inspired by some work in Monarch (thus relating to the SRI Reference KG) but may have broader impact on other Biolink Model users (e.g. Translator). A number of ideas are now on the table - discussed below.

@cmungall @kevinschaper @sierra-moxon

Core Issue

The concept of "phenotypic frequency" (broad sense, perhaps including genetic and non-genetic, disease-related and non-disease related expressed features of biological systems) is currently represented in a somewhat heterogeneous, one might hazard to say, currently somewhat inconsistent, incomplete or inefficient manner within the Biolink Model.

This is likely a reflection of the diversity of semantics of the concept within biology, and more specifically, within projects currently using the Biolink Model (i.e. Monarch, Translator, etc.)

Data with phenotypic frequency comes into knowledge graphs from various sources (e.g. HPOA, model organism data, etc.), may be quantitative (e.g. actual percentages or ratios), or qualified/categorical (e.g. HPO term annotated).

The frequency itself may be stated with reference to an entire population or a subsample (a general cohort with controls, or a specific study of patients). The frequency may simply be an observation of incidence of the (phenotype, disease or other feature) or may be made with reference to some knowledge of the underlying genotype (e.g. genetic penetrance and expressivity).

The purpose of this issue is to review the overall representation of the concept within Biolink Model with a view towards more concise, complete and efficient representation. A complementary concern is how best to present frequency annotation computationally - e.g. in various project-relevant knowledge graph representations including KGX and Python code (i.e. pydantic)?

Relevant Current Biolink Models

Types

  frequency value:
    typeof: string
    uri: UO:0000105

  percentage frequency value:
    typeof: double
    uri: UO:0000187

  quotient:
    aliases: [ 'ratio' ]
    typeof: double
    uri: UO:0010006

Slots

  frequency qualifier:
    description: >-
      a qualifier used in a phenotypic association to state how frequent the phenotype is observed in the subject
    is_a: association slot
    range: frequency value

(Note: we ignore several relative frequency association slots here as not specific to phenotypes)

Mixins (abridged definitions)

  relationship quantifier:
    mixin: true

  frequency quantifier:
    is_a: relationship quantifier
    mixin: true
    slots:
      - has count
      - has total
      - has quotient
      - has percentage

where the above slots come from:


  ## Statistics

  aggregate statistic:
    is_a: node property
    abstract: true

  has count:
    description: >-
      number of things with a particular property
    is_a: aggregate statistic
    range: integer
    exact_mappings:
      - LOINC:has_count

  has total:
    description: >-
      total number of things in a particular reference set
    is_a: aggregate statistic
    range: integer

  has quotient:
    is_a: aggregate statistic
    range: double

  has percentage:
    description: >-
      equivalent to has quotient multiplied by 100
    is_a: aggregate statistic
    range: double

The frequency quantifier is directly a mixin only used in the variant to population association below.

In contrast, the frequency qualifier slot is captured in another mixin:

  frequency qualifier mixin:
    mixin: true
    description: >-
      Qualifier for frequency type associations
    slots:
      - frequency qualifier

  entity to feature or disease qualifiers mixin:
    description: >-
      Qualifiers for entity to disease or phenotype associations.
    mixin: true
    is_a: frequency qualifier mixin
    slots:
      - severity qualifier
      - onset qualifier

  entity to phenotypic feature association mixin:
    mixin: true
    is_a: entity to feature or disease qualifiers mixin
    defining_slots:
      - object
    slot_usage:
      object:
        range: phenotypic feature
        values_from: [ 'upheno', 'hp', 'mp', 'wbphenotype' ]
    slots:
      - sex qualifier

  entity to disease association mixin:
    description: >-
      mixin class for any association whose object (target node) is a disease
    mixin: true
    is_a: entity to feature or disease qualifiers mixin

Associations (abridged)

  variant to population association:
    description: >-
      An association between a variant and a population, where the variant has
      particular frequency in the population
    mixins:
      - frequency quantifier
      - frequency qualifier mixin
    slot_usage:
      subject:
        range: sequence variant
        description: >-
          an allele that has a certain frequency in a given population
      object:
        range: population of individual organisms
        description: >-
          the population that is observed to have the frequency
      has quotient:
        description: >-
          frequency of allele in population, expressed as a number with allele
          divided by number in reference population, aka allele frequency
        examples:
          - value: "0.0001666"
      has count:
        description: >-
          number in object population that carry a particular allele, aka allele count
        examples:
          - value: "4"
            description: 4 individuals in gnomad set
      has total:
        description: >-
          number all populations that carry a particular allele, aka allele number
        examples:
          - value: "24014"
            description: 24014 individuals in gnomad set

Associations linked to the entity to phenotypic feature association mixin:

Associations linked to the entity to disease association mixin:

General Questions, Observations and Concerns

RichardBruskiewich commented 1 year ago

Model Revision Ideas

  1. Add 'frequency term' alongside 'frequency value' and designate it typeof: uriorcurie
  2. Consider somehow converting the 'quotient' model to a 2-tuple numerator/denominator integer model (I'm not sure how to express this in LinkML... someone please chime in...)
  3. Should we/can we convert the various aggregate statistic and its child models to 'association slot' or just generic 'slot' model instances? In their slot names, is the 'has' prefix superfluous, or might we rename the has to frequency, then rename the frequency quantifier to frequency quantifier mixin for consistency alongside the frequency qualifier?
  4. Move the frequency qualifier mixin in entity to feature or disease qualifiers mixin out of the is_a hierarchy and into a mixins list, then add the frequency quantifier mixin alongside it (@sierra-moxon seems to think that this is permitted)
  entity to feature or disease qualifiers mixin:
    description: >-
      Qualifiers for entity to disease or phenotype associations.
    mixin: true
    mixins: 
    -   frequency qualifier mixin
    -   frequency quantifier mixin
    slots:
      - severity qualifier
      - onset qualifier

This will automatically integrate frequency quantifier mixin into all of the same association definitions as frequency qualifier mixin

I'm not sure if this is a complete set of ideas, but it's a start. Open for comments!