Annotating a feature with multiple types of cvterms

almasaeed2010 commented 5 years ago

Hello all,

We are working on a way to add additional information to a feature_cvterm association. We would like to specify an analysis_id for a feature_cvterm record. We would also like to specify whether this cvterm is a relationship term.

What we are trying to create is something to similar to this page: https://www.arabidopsis.org/servlets/Search?action=search&type=annotation&tair_object_id=2025391&locus_name=AT1G01620

Where the columns relationship type and keyword are cvterms and evidence is an analysis.

Currently chado provides a way for us to associate a publication to a feature but it doesn't offer a way to link analyses. In our case, some annotations don't necessarily have a publication to link to but do have analyses (BLAST and InterProScans for example).

Does chado offer a way to do this?

Thanks!

bradfordcondon commented 5 years ago

hi @almasaeed2010 I would look at the analysisfeature and analysisfeatureprop tables. rather than associate the cvterm with the feature, you would associate it with the analysisfeature record.

That said i dont think that would be enough to help you with the relationship type component. For that you might need analysisfeature_cvterm and analysisfeature_cvtermprop instead of analysisfeatureprop.

Alternatively you could associate each analysisfeature with two props: one for the annotation, and one that is a "relationship type" cvterm

spficklin commented 5 years ago

Hi @almasaeed2010 . Did you see the comments for the feature_cvtermprop table? They can be used for evidence codes and other metadata. Perhaps you can also store relationship type there as well.

I'm really glad you're looking at storing evidence codes. It is often ignored when adding annotations to features.

spficklin commented 5 years ago

But if you want to associate those evidence codes on a per analysis basis then I think @bradfordcondon suggestion is a good way to go but you'd need to make some custom tables.

mestato commented 5 years ago

Analysisfeature specifically indicates its only for use when an analysis generates that feature, whereas we are talking about annotation. However, i guess we could ignore the table description particulars, as its not too big a leap.

Prop tables seem to be an obvious answer - but they only work if we have a single ontology that will always be used for relationships and another single ontology that will always be used for evidence codes (then you could deduce what column of the table we are dealing with), but is that likely? what if we actually end up needing to use multiple ontologies for relationships and/or evidence codes? I admit that does not sound appealing but relationships is going to be a tough one from our ebi obo ontology lookup searches. We stumbled on this old GO relationship ontology that looked promising but it leads to dead links: http://wiki.geneontology.org/index.php/Relationships_between_annotation_objects_and_ontology_terms

Maybe it is unreasonable to try to support multiple ontologies for these two categories, but just to explore that a bit more... we would actually need two type_ids. For a particular feature_cvterm, we need an id to tell us we are storing a "relationship" and an id to tell us what that relationship is "upregulated by" or "expressed during". And for the same feature_cvterm, we need an id to tell us we are storing an "evidence code" and an id to tell us what that evidence code is ("inferred by electronic annotation" or "gene_expression_experiment").

I guess this is like our closed cvterm prop value issue #26 - I want another cvalue_id. Or does everyone just think this is totally crazy and we need to define the columns by single ontologies?

ekcannon commented 5 years ago

This sounds a bit like post-composed terms. If I understand correctly, a sentence might look like: [gene:feature] [role/relationship:cvterm] [trait:cvterm][ontology:db][evidence code:cvterm] e.g: AT1G01620 | involved in | response to water deprivation | biological process | inferred from expression pattern

Or more simply: [feature] [post-composed-term][evidence-code]

A few years back Nama and I developed a proposal for storing EQ statements, a specific form of post-composed terms. We used the phenotype table as the "glue" to hold a sentence together. This approach may not work for this case, but might be worth looking at for ideas. It would be good if all post-composed terms are constructed in the same, or at least similar ways.

http://gmod.org/wiki/Chado_Post-Composed_Phenotypes

bradfordcondon commented 5 years ago

However, i guess we could ignore the table description particulars, as its not too big a leap.

Good point. For what it's worth this is what some of the Tripal functional annotation modules do: for example the Tripal Analysis Blast module stores the blast annotations in analysisfeature, and the features being annotated weren't generated by the blast analysis.

almasaeed2010 commented 5 years ago

I think the Blast module also defines a custom table, blast_hit_data, which makes it easier to handle odd issues.

If we insert everything into prop tables, our queries won't be efficient enough to handle large datasets. And we'll need to do some in memory data processing which means we'll have to load all the data for a given feature, which makes pagination and memory management not possible. Both of those issues might not be a problem if we don't expect a large number of annotations per feature.

spficklin commented 5 years ago

Yes, I concur with what @mestato says:

Analysisfeature specifically indicates its only for use when an analysis generates that feature, whereas we are talking about annotation. However, i guess we could ignore the table description particulars, as its not too big a leap.

And, yes Bradford is right, Tripal extension modules are violating this:

For what it's worth this is what some of the Tripal functional annotation modules do: for example the Tripal Analysis Blast module stores the blast annotations in analysisfeature, and the features being annotated weren't generated by the blast analysis.

I think this may confuse GBrowse that Tripal extension modules are commandeering the analysisfeature table for analysis results. I don't recall. @scottcain may remember. So, perhaps the correct thing to do is to fix those Tripal extension modules to no longer store analysis results there. But that's a different issue....

A simple solution would be to add a type_id column to the feature_cvterm table that is NULLable. This would allow you to specify the relationship term, and it's backwards compatible. Then the evidence code goes in the feature_cvtermprop table.

If you like that idea it may be something that could easily be implemented in the CodeFest... maybe make it into v1.4 release??

almasaeed2010 commented 5 years ago

It would be nice to consider the type_id column. It's a simple addition that would solve problems for multiple modules.

scottcain commented 5 years ago

@spficklin says "I may remember" with regard to the GBrowse Chado adaptor. Good one. I haven't used the GBrowse Chado adaptor in quite a while, and it's been even longer since I looked at the code. Given that nobody has complained about it I can safely assume one of two things: 1) it works fine as is, or B) nobody uses the GBrowse Chado adaptor anymore. I have no idea which it is correct :-)

GMOD / Chado

Annotating a feature with multiple types of cvterms #75