ave-dcd / mave_vocabulary

JSON Schema representation of MAVE controlled vocabulary terms
Creative Commons Attribution 4.0 International
1 stars 1 forks source link

MAVE Minimum Information Model

arXiv DOI

JSON Schema for validating MAVE experiment metadata

Purpose: To provide an overarching organization and definitions for terms relevant to tech development and data repositories associated with the Atlas of Variant Effects Alliance. This "controlled vocabulary" and standard is intended to give structure to minimum required information for data and meta-data sharing for scientists using variant effect mapping technology.

How to use this repository

This repository contains an implementation of the schema described in the Atlas of Variant Effects Alliance minimum information model for describing a multiplexed assay experiment.

Overall, it is felt that minimum standard reporting should include information on (1) means and characteristics of genetic perturbation, (2) details of the phenotypic assay employed to identify variant effects, (3) information on the cellular and environmental context(s) in which the assays were carried out, and (4) details of sequencing strategy for variant-effect associations.

The schema defines a set of required and optional fields and possible values that can be used to validate a minimum information document. The implementation is found in the schema directory.

In addition to the structure of the minimum information model, the schema also defines controlled vocabulary terms for describing one of these experiments.

The examples directory contains examples of this type of document describing real experiments, as well as a simple Python script that will run the schema validation using jsonschema. Many other implementations of the JSON Schema standard are available in other languages (see here).

Please note that although we are using the JSON Schema standard, the schema source file is written in YAML format for ease in human reading/writing, and processed to JSON using the provided Makefile.

Reading the schema

The schema directory contains JSON and YAML representations of the minimum information standard and controlled vocabulary expressed as JSON Schema. There are multiple levels of required information that can be browsed hierarchically. Most fields include a description that details the intention of that field and the type of information that is to be provided.

For many fields, there is an enumerated list of valid values corresponding to the controlled vocabulary terms that must be used to describe the experiment.

The general schema structure and terms are also described below. The YAML documents in the schema directory should be considered the authoritative structure and source of information where discrepancies exist.

Applying the schema to your datasets

Unless you are an experienced YAML user and can read the schema/experiment.yml file yourself, we recommend that you choose the most closely-related example file as a starting point and then modify it as needed.

The repository currently contains three examples:

The schema starts with some descriptive metadata, such as the title and abstract. The title and abstract should reflect the experimental dataset reflected in a study (which may optionally reference a published document that may have a differing title). The title and abstract are required properties.

The next section (document) describes a publication associated with the experiment (if any). This part of the schema is optional, but if used, must minimally include a ref property with a URI (such as a DOI) linking to the publication.

The variantLibrary and phenotypicAssay describe the experiment that was performed and both are required. Each has several subsections that provide structure for detailing the important experimental design decisions captured by the schema. We refer users to the examples and the list of controlled vocabulary terms below to help complete this section, as it will be different for each experiment.

Note: We anticipate that the standard will be adopted by established resources such as MaveDB that will provide users with the ability to download a minimum information file after data deposition.

Generating sequence identifiers

Some examples (e.g. examples/Seuma_2022.yml) include target sequence identifiers and hashes. These values were generated according to the GA4GH VRS v1.3 and refGet standards (see here for details).

Generating these stable identifiers is not required but is recommended, particularly for in-vitro construct libraries.

Controlled vocabulary terms

Overview of ontologies and identifiers

Concept codes used by the schema follow the Coding model, which describes concepts as objects with a code and label used by a system (or version of a system).

For describing assay readouts, we recommend the use of terms from the Ontology for Biomedical Investigations.

For describing human diseases relevant to the assay, we recommend using terms from OMIM or the Mondo Disease Ontology.

For describing human cell lines, we use terms from the Cell Line Ontology, where available.

We encourage users to provide an NCBI Taxonomy ID that specifically denotes the organism (including strain, where applicable).

Variant Library

This section describes the scope and characteristics of a variant library: a collection of sequence variants for a MAVE experiment that are derived from a common target sequence.

Target sequences

A collection of sequences used as references from which all variants in the library are defined. This collection is defined as a set of ReferenceSequence objects, each defined by the following properties:

id: an identifier for the sequence. sha512t24u: the GA4GH SQ. identifier (see here for details). sequence: the literal sequence as a string of IUPAC single character codes. sequenceAlphabet: one of na (nucleic acids) or aa (amino acids) for interpreting IUPAC character codes in the sequence.

Library scope

The variant library should be defined by the functional scope of DNA elements introduced into the library. DNA elements can have known or unknown functions. Example functions include a gene, an exon or set of exons included in a transcript, a set of enhancers, a set of repressors, etc.

We define the scope type using the following controlled vocabulary terms:

Libraries may be further described with description. The description field must be populated for any library of type non-coding, other (e.g. tRNA libraries).

Library generation method

The methods used to generate the library. A library may create and integrate an in vitro construct or directly edit an endogenous locus. The library generation method is defined by its type, which may be one of:

In-vitro construct library method

A methodology for generating and integrating an exogenous variant library.

For in-vitro constructs, system is one of the following controlled vocabulary terms:

In addition, integration refers to the mechanism for integration or expression of an exogenous construct and is one of the following controlled vocabulary terms:

system and integration are required properties. description may be used to further describe the generation method system and integration parameters, and is required if the system is set to other.

Endogenous locus library method

A methodology for generating a variant library at an endogenous locus.

For endogenous editing, system refers to the CRISPR/Cas system used, and is one of the following controlled vocabulary terms:

In addition, mechanism is used to define the functional mechanism of the method, and is one of the following controlled vocabulary terms:

system and mechanism are required properties. description may be used to further describe the generation method system and mechanism parameters.

Delivery method

The delivery method specifies how the variant induction machinery and/or construct was delivered to the cell/organism (e.g. viral transduction, electroporation, transfection and MOI).

The delivery method is specified by the type property and must be one of the following controlled vocabulary terms:

The type property is required. Additional detail about the delivery method may be provided with the description property.

Phenotypic assay

A physical adjudication of model system that allows for systematic interrogation of a functional read-out for a large amount of genetic variants (e.g. cell size and mode of adjudication, action potential characteristic(s) and mode of measurement, expression of a particular factor and mode of measurement (FACS, sc-RNA-seq), or transcript expression (bulk RNA-seq)).

Dimensionality

Dimensionality defines how many phenotypes and of what complexity are included in the map.

Dimensionality is primarily defined by its type, which must be one of the following controlled vocabulary terms:

where single-dimensional data refers to experiments with a single dimension (e.g. FACS fluorescence from a single protein was used), high-dimensional data refers to experiments with multiple dimensions (e.g. ML/AI enabled cell imaging/classification), and combined functional data refers to experiments where multiple phenotypic assays were combined to make a map.

The type property is required. Additional information about the dimensionality of an experiment may be provided using the description property.

Replication

Assay replication work performed is defined by its type, which must be one of the following controlled vocabulary terms:

The type property is required. Additional detail about the replication method may be provided with the description property.

Method

The assay method, defining the molecular properties interrogated by the experiment. Terms are derived from OBI subtree with root OBI_0000070: “assay” where appropriate. Term mappings to OBI concept identifiers are available in the concept vocabulary tsv. The method is specified by the type property, which must be one of the following controlled vocabulary terms:

Relevance

The disease or biological processes the assay is relevant to. Relevance is specified by an array of Coding objects (see note). We recommend relevance to be described by terms from OMIM or the Mondo Disease Ontology.

Model system

The model system context that influences expression of the phenotype. The model system is specified by the type property and must be one of the following controlled vocabulary terms:

We recommend that cell lines are further described by relevant concepts using the codings array of Coding objects (see note). We recommend that cell lines are described using the Cell Line Ontology where applicable. Some commonly used cell lines and model systems are listed below:

Cell CLO Term NCBI Taxonomy ID
Yeast n/a 4932
HEK293T 37372 or 37373 9606
HAP1 missing 9606
HeLa 3684 9606
E. coli n/a 562
iPSC-derived 37308 9606
C. elegans n/a 6239
C. savignyi n/a 51511
D. melanogaster n/a 7227
HepG2 3704 9606
Human hepatocytes 182 9606
K562 7050 9606
Mouse embryonic stem cells 37317 10090
NIH3T3 missing 10090
Bacteriophage n/a 38018
Cell-free n/a n/a

The type property is required. Additional detail about the model system may be provided with the description property.

Profiling strategy

The variant profiling strategy used to capture variant frequency associated with outcome of phenotypic assay. The profiling strategy must be one of the following following controlled vocabulary terms:

profilingStrategy is a required property.

Sequencing read type

The sequencing read type used in the assay. The read type must be one of the following controlled vocabulary terms:

sequencingReadType is a required property.