Open bradfordcondon opened 5 years ago
In my past experience, cvterms and dbxrefs have each been a pain point in chado implementations. The flexibility is great, until you need to try figuring our how someone else decided to store the information in what tables.
I'm totally onboard with having some more guidance and standards, if only to make life easier for all of us to work together.
@bradfordcondon I agree your initial bullet list of what each table is meant to store. While KEGG termns and InterPro domains lack a formal OWL or OBO file (although there have been past attempts to create these, at least for KEGG as far as I remember), in my mind they serve the same purpose as an ontology and I am inclined to store those associations with a genomic feature in the feature_cvterm table. A cvterm association is a "property" of a feature, so technically it could go into the featureprop table, but the existence of the featuer_cvterm table to me implies that these type of "properties" should be handled separately and I would be inclined to then put GO/KEGG/Interpro annotations to a genomic feature all in the feature_cvterm table.
I further agree with @spficklin . I would say that @bradfordcondon 's first two bullet points are spot on, and while I agree that it can be somewhat difficult in some cases to tell the difference, I would posit that we generally "feel" the right answer (documenting feelings is admittedly difficult). Finally, featureprop is obviously a catch all for "everything else". I imagine there will be cases where things that get put into featureprop should end up being moved elsewhere when a human looks at them.
Hello,
@mpoelchau and myself have been discussing the behavior of storing GFF files for feature annotations via Tripal. We are considering a gene that perhaps has been annotated with GO terms, KEGG terms, proposed PFAM domains, and Interproscan family annotations.
My understanding of the Chado tables (which i want to emphasize is up for debate) is:
feature_cvterm
is for annotating features with all of the cases I described above (GO, KEGG, PFAM) because some decision was made based on computational evidence to associate the feature with that annotation. The feature_cvtemrprop table exists to store evidence codes, qualifiers, etc.feature_dbxref
is for storing references to that record, itself, in anotehr database. So it should only be used to link back to the feature itself on a different site. Gene families its a part of, for example, wouldnt belong here.featureprop
: its hard for me to distinguish when a term annotation is better suited as a featureprop. props can have pubs for evidence but theres no featurepropprop table for evidence codes. Also, the "value" field seldom may not make sense if tagging with an annotation.I'll add this is the most definitive guidance i found in my search on the chado wiki in the sequence module manual
Insofar as the GFF file holding the annotations:
similarly, NCBI calls most things dbxrefs in a much broader definition than the one i use above.
Here's the conflict. KEGG terms, for example, are not ontologies. But when we read the GFF file, we parse Ontology_terms into feature_cvterm, dbxrefs to dbxrefs, and everything else to props. So for the annotations to go into feature_cvterm, they would need to be in the GFF under ontology_terms.
As monica phrased her doubts:
The consequence of these decisions is we display featureprops, feature_cvterms, and feature_dbxrefs in different locations and in different ways to end users.