ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Minimal and extended schemas proposal #50

Open nsheff opened 1 year ago

nsheff commented 1 year ago

We decided to start with two schemas: a minimal schema that we would post now as what we should implement, and then an extended schema, which is in evaluation stage to see if it should end up in the minimal schema. Here are some drafts of these for comment and revision:

Minimal seqcol schema

description: "A collection of biological sequences, defined by the GA4GH Sequence Collections standard."
$id: "/schemas/seqcol_base"
version: 0.1.0
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable identifiers of each sequence (e.g. chromosome names or accessions)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    description: "Digests of sequences computed using the GA4GH digest algorithm (sha512t24u)."
    items:
      type: string
  sorted_name_length_pairs:
    type: array
    description: "Sorted digests of names+lengths pairs, computed following the seqcol specification."
    items:
      type: string
required:
  - lengths
  - names
inherent:
  - lengths
  - names
  - sequences

Extended seqcol schema

$ref: "/schemas/seqcol_base"
$id: "/schemas/seqcol_extended"
properties:
  masks:
    type: array
    collated: true
    description: "Digests of subsequence masks indicating subsequences to be excluded from an analysis, such as repeats"
    items:
      type: string
  priorities:
    type: array
    collated: true
    description: "Annotation of whether each sequence is a primary or secondary component in the collection."
    items:
      type: boolean
  topologies:
    type: array
    collated: true
    description: "Annotation of whether each sequence represents a linear or other topology."
    items:
      type: string
      enum: ["circular", "linear"]
      default: "linear"
  molecule_types:
    type: array
    collated: true
    description: "Designation of the type of molecule for each sequence, such as RNA, DNA, or protein."
    items:
      type: string
  alphabets:
    type: array
    collated: true
    description: "The set of characters actually present in each sequence"
    items:
      type: string
  alphabet_domains:
    type: array
    collated: true
    description: "The set of characters that could be included in each sequence"
    items:
      type: string
nsheff commented 1 week ago

The latest minimal schema has been updated to this:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
required:
  - names
  - lengths
  - sequences
ga4gh:
  inherent:
    - names
    - sequences