GMOD / Chado

the GMOD database schema
http://gmod.org/wiki/Chado
Artistic License 2.0
38 stars 25 forks source link

feature_cvterm vs feature_dbxref vs featureprop for feature annotations #74

Open bradfordcondon opened 5 years ago

bradfordcondon commented 5 years ago

Hello,

@mpoelchau and myself have been discussing the behavior of storing GFF files for feature annotations via Tripal. We are considering a gene that perhaps has been annotated with GO terms, KEGG terms, proposed PFAM domains, and Interproscan family annotations.

My understanding of the Chado tables (which i want to emphasize is up for debate) is:

I'll add this is the most definitive guidance i found in my search on the chado wiki in the sequence module manual

Detailed annotations, such as associations to Gene Ontology (GO) terms or Cell Ontology terms, can be attached to features using the feature_cvterm linking table. This allows multiple ontology terms to be associated with each feature. Provenance data can be attached with the feature_cvtermprop and feature_cvterm_dbxref higher-order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature_cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) that are linked to the gene features. Annotations for existing features can also go into the featureprop table using the Chado feature_property ontology (defined in chado/load/etc/feature_property.obo) and the comment or description terms as appropriate. The purpose of the feature property ontology (and the related chado/load/etc/genbank_feature_property.obo file) is to capture terms that are likely to appear in GFF or GenBank sequence files. In theory there is no overlap between these ontologies and the Sequence Ontology.

Insofar as the GFF file holding the annotations:

The gff spec states: Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label.

similarly, NCBI calls most things dbxrefs in a much broader definition than the one i use above.

Here's the conflict. KEGG terms, for example, are not ontologies. But when we read the GFF file, we parse Ontology_terms into feature_cvterm, dbxrefs to dbxrefs, and everything else to props. So for the annotations to go into feature_cvterm, they would need to be in the GFF under ontology_terms.

As monica phrased her doubts:

With GO, I get it - a GO term refers to a formal, accessioned description of a gene function (e.g. http://amigo.geneontology.org/amigo/term/GO:0003676). A GO term does not also refer to a protein sequence - you annotate the protein sequence with the GO term. An InterPro accession is an accessioned ‘signature’ (which is a combo of HMMs, profiles, position-specific scoring matrices or regular expressions), which is annotated by curators with free-text descriptions from the literature. (And they can also be associated with a GO term). As such, I view InterPro domain accessions more as entries within a very authoritative database, rather than a controlled vocabulary. Although perhaps the domain name is enough to call it a controlled vocabulary at this point?

The consequence of these decisions is we display featureprops, feature_cvterms, and feature_dbxrefs in different locations and in different ways to end users.

childers commented 5 years ago

In my past experience, cvterms and dbxrefs have each been a pain point in chado implementations. The flexibility is great, until you need to try figuring our how someone else decided to store the information in what tables.

I'm totally onboard with having some more guidance and standards, if only to make life easier for all of us to work together.

spficklin commented 5 years ago

@bradfordcondon I agree your initial bullet list of what each table is meant to store. While KEGG termns and InterPro domains lack a formal OWL or OBO file (although there have been past attempts to create these, at least for KEGG as far as I remember), in my mind they serve the same purpose as an ontology and I am inclined to store those associations with a genomic feature in the feature_cvterm table. A cvterm association is a "property" of a feature, so technically it could go into the featureprop table, but the existence of the featuer_cvterm table to me implies that these type of "properties" should be handled separately and I would be inclined to then put GO/KEGG/Interpro annotations to a genomic feature all in the feature_cvterm table.

scottcain commented 5 years ago

I further agree with @spficklin . I would say that @bradfordcondon 's first two bullet points are spot on, and while I agree that it can be somewhat difficult in some cases to tell the difference, I would posit that we generally "feel" the right answer (documenting feelings is admittedly difficult). Finally, featureprop is obviously a catch all for "everything else". I imagine there will be cases where things that get put into featureprop should end up being moved elsewhere when a human looks at them.