start converting `string_serialization`s to enumerations

turbomam commented 1 month ago

string_serialization isn't actionable like range or pattern. It's intended as a hint for how a software agent might want to populate these values give some other inputs

I think I have dragged my feet on this task because

I was worried about the genration of OWL URIs for permissible values that contained punctuation (but this is probably under control now)
I still don't have any plan for expressing hybrids between enumerations and algebraic expressions like the {timestamp} suffix for add_recov_method

  add_recov_method:
    annotations:
      Expected_value: enumeration;timestamp
    description: Additional (i.e. Secondary, tertiary, etc.) recovery methods deployed
      for increase of hydrocarbon recovery from resource and start date for each one
      of them. If "other" is specified, please propose entry in "additional info"
      field
    title: secondary and tertiary recovery methods and start date
    examples:
    - value: Polymer Addition;2018-06-21T14:30Z
    keywords:
    - date
    - method
    - recover
    - secondary
    - start
    string_serialization: '[Water Injection|Dump Flood|Gas Injection|Wag Immiscible
      Injection|Polymer Addition|Surfactant Addition|Not Applicable|other];{timestamp}'

I see 267 string_serializations in mixs.yaml at this date

turbomam commented 1 month ago

  IFSAC_category:
    annotations:
      Expected_value: IFSAC term
    description: 'The IFSAC food categorization scheme has five distinct levels to
      ...snip...
      dairy products. An IFSAC food category chart is available from https://www.cdc.gov/foodsafety/ifsac/projects/food-categorization-scheme.html
      PMID: 28926300'
    title: Interagency Food Safety Analytics Collaboration (IFSAC) category
    examples:
    - value: Plants:Produce:Vegetables:Herbs:Dried Herbs
    keywords:
    - food
    string_serialization: '{text}'
    slot_uri: MIXS:0001179
    multivalued: true
    required: true

remove string_serialization
set range explicitly to string (don't rely on schema's default_range)

there are 42 instances of this pattern

It looks like most or all of them are cases in which there's a un-codified Expected_value annotation

turbomam commented 1 month ago

Here's an example where the Expected_value annotation and the string_serialization taken together do provide some useful guidance.

  assembly_name:
    annotations:
      Expected_value: name and version of assembly
    description: Name/version of the assembly provided by the submitter that is used
      in the genome browsers and in the community
    title: assembly name
    examples:
    - value: HuRef, JCVI_ISG_i3_1.0
    in_subset:
    - sequencing
    string_serialization: '{text} {text}'
    slot_uri: MIXS:0000057

turbomam commented 1 month ago

assembly_qual:
    annotations:
      Expected_value: enumeration
    description: 'The assembly quality category is based on sets of criteria outlined
      ...snip...
      which no genome size could be estimated'
    title: assembly quality
    examples:
    - value: High-quality draft genome
    in_subset:
    - sequencing
    keywords:
    - quality
    string_serialization: '[Finished genome|High-quality draft genome|Medium-quality
      draft genome|Low-quality draft genome|Genome fragment(s)]'
    slot_uri: MIXS:0000056

so create an AssemblyQualEnum with permissible values

Finished genome
High-quality draft genome
Medium-quality draft genome
Low-quality draft genome
Genome fragment(s)

turbomam commented 1 month ago

  biol_stat:
    annotations:
      Expected_value: enumeration
    description: The level of genome modification
    title: biological status
    examples:
      - value: natural
    keywords:
      - status
    string_serialization: '[wild|natural|semi-natural|inbred line|breeder''s line|hybrid|clonal
      selection|mutant]'
    slot_uri: MIXS:0000858

turbomam commented 1 month ago

  compl_score:
    annotations:
      Expected_value: quality;percent completeness
    description: 'Completeness score is typically based on either the fraction of
      markers found as compared to a database or the percent of a genome found as
      compared to a closely related reference genome. High Quality Draft: >90%, Medium
      Quality Draft: >50%, and Low Quality Draft: < 50% should have the indicated
      completeness scores'
    title: completeness score
    examples:
      - value: med;60%
    in_subset:
      - sequencing
    keywords:
      - score
    string_serialization: '[high|med|low];{percentage}'
    slot_uri: MIXS:0000069

turbomam commented 1 month ago

  contam_screen_param:
    annotations:
      Expected_value: enumeration;value or name
    description: Specific parameters used in the decontamination sofware, such as
      reference database, coverage, and kmers. Combinations of these parameters may
      also be used, i.e. kmer and coverage, or reference database and kmer
    title: contamination screening parameters
    examples:
      - value: kmer
    in_subset:
      - sequencing
    keywords:
      - parameter
    string_serialization: '[ref db|kmer|coverage|combination];{text|integer}'
    slot_uri: MIXS:0000073

GenomicsStandardsConsortium / mixs

start converting `string_serialization`s to enumerations #839