GMOD / Chado

the GMOD database schema
http://gmod.org/wiki/Chado
Artistic License 2.0
38 stars 25 forks source link

organism_analysis linker table #59

Open bradfordcondon opened 6 years ago

bradfordcondon commented 6 years ago

I think there was some request for an organism_analysis linker table in the schema? We plan on creating one at HWG anyway.

-- ================================================
-- TABLE: organism_analysis
-- ================================================

create table organism_analysis (
       organism_analysis_id bigserial not null,
       primary key (organism_analysis_id),
       organism_id bigint not null,
       foreign key (organism_id) references organism (organism_id) on delete cascade INITIALLY DEFERRED,
       analysis_id bigint not null,
       foreign key (analysis_id) references analysis (analysis_id) on delete cascade INITIALLY DEFERRED,
       constraint organism_analysis_c1 unique (organism_id,analysis_id)
);
create index organism_analysis_idx1 on organism_analysis (organism_id);
create index organism_analysis_idx2 on organism_analysis (analysis_id);

COMMENT ON TABLE organism_analysis IS 'Linking table for associating organisms and analyses.';
ekcannon commented 6 years ago

It's difficult to say no to a linker table - the more the merrier - but I am curious how this would be used.

bradfordcondon commented 6 years ago

We (meaning HWG) typically want to list the analyses performed on an organism on the respective organism's page. this let's users visit that analysis to download the data or the associated entities (features, biosamples, what have you).

Edit here is an example organism. The Transcriptome and Analysis tabs make use of the proposed linker field to display all analyses associated with this organism.

laceysanderson commented 6 years ago

I would worry a bit about this causing inconsistencies in the database. For example, what if your transcripts for a given analysis are for species X and the linker table points to species Y? I think this might be better implemented as a materialized view that pulls through the analysis -> feature -> organism connection. From an application standpoint this ends up being the same thing but from the database level it will help you keep consistency. Would that work for HWG @bradfordcondon?

spficklin commented 6 years ago

If I remeber correctly this table came up as a suggestion during Chado v1.3 discussions and it was shot down because it breaks the highly normalized structure of Chado for the reason @laceysanderson specified.

I know of some Tripal sites that use this table but they add it as a custom table. But the MView is probably a better idea.

ekcannon commented 6 years ago

I have the same concerns as @laceysanderson and @spficklin, but analyses can also be attached to non-features, like phylotree records (which would typically be multi-species so an organism_analysis record may not make sense). Still, it seems it should be possible to create a view that identifies all analyses associated with an organism without adding this table.

bradfordcondon commented 6 years ago

@laceysanderson we originally had an mview that did that, but i have analyses that are not associated via features. Then, adding things to that view manually would be lost if the mview was repopulated. (not to mention writing a field that inserted into the mview broke because it didnt have a primary key...)

ekcannon commented 6 years ago

@bradfordcondon are these non-feature-related analyses connected to organisms by any other means (however circuitous), or only via organism_analysis? If not, what is an example?

bradfordcondon commented 6 years ago

Great question @ekcannon . Both, I'd say.

I suppose the feature-less analyses could have a single dummy feature loaded to associate them, if i, say, unpublished that entity...

spficklin commented 6 years ago

Those are valid use cases. I can see now the need for the table. There can be other analyses that don't generate genomic features. I'm not sure what to do in this case. I would need someone much smarter than I am on highly normalized dbs to weigh in.

IMO I would suggest not to store dummy features...

bradfordcondon commented 6 years ago

Is there any sane way to have both the linker table and the mview, and have the mview update the linker table without overwritting non-feature related entries? I recognize the importance of avoiding inconsistencies in the database.

laceysanderson commented 6 years ago

The only way I can think of to ensure database integrity is to have a trigger on analysisfeature that ensures if there is a record in organism_analysis then the organism matches the current feature being inserted/updated. However, the performance hit during loading could be substantial since it would run a query per feature :-(

bradfordcondon commented 6 years ago

ensures if there is a record in organism_analysis then the organism matches the current feature being inserted/updated

but what if you have multiple organisms associated with a single analysis?

FYI I am going ahead with the linker table on HWG but I won't necessarily expect it to be part of Chado.

bradfordcondon commented 6 years ago

If Tripal users are looking for this, here is the tripal 3 module that defines the linker field to link stuff via organism_analysis.

https://github.com/statonlab/tripal_manage_analyses

laceysanderson commented 6 years ago

Good point on the multiple organisms. Perhaps the trigger could check if the organism for the current feature being inserted is in the organism_analysis table and if not, adds it?

bradfordcondon commented 6 years ago

I imagine that how Tripal ensures the integrity is an issue for my linker module itself, so I've moved it there: https://github.com/statonlab/tripal_manage_analyses/issues/32.

I dont think there's a way to force the integrity at the Chado level?

laceysanderson commented 6 years ago

You can do it at the Chado level using triggers: https://www.postgresql.org/docs/9.4/static/rules-triggers.html This can be more reliable then ensuring integrity at the application level.

bradfordcondon commented 6 years ago

Very cool thank you

bradfordcondon commented 6 years ago

This is still pseudocode but something like this?

Are there cases where an analysis would have features from multiple organisms? If so I'd need to check that first...

rule for insert is much simpler...

CREATE RULE organism_analysis_update AS ON UPDATE TO feature.organism_id
    DO UPDATE(organism_id), (NEW.organism_id) FROM organism_analysis WHERE organism_id = OLD.organism_id AND analysis_id = 
    (SELECT analysis_id FROM analysis A
    INNER JOIN analysisfeature AF ON A.analysis_id = AF.analysis_id
     INNER JOIN feature F ON F.feature_id = AF.feature_id
     INNER JOIN organism O ON O.organism_id = F.organism_id
    WHERE F.feature_id = NEW.feature_id);

puzzling it out here, also tried a trigger which seems more complicated... https://gist.github.com/bradfordcondon/366c3ca47e93d78badcaba2469885fd9

laceysanderson commented 6 years ago

It looks like the best practices recommendation is to use triggers instead of rules "unless you really know what you are doing or the performance hit from them is unacceptable, as rules can work in unexpected ways." https://stackoverflow.com/questions/5894142/what-are-postgresql-rules-good-for

Ideally you would support cases where an analysis would have features from multiple organisms since that's supported by the database. For example, someone could have aligned an EST set against the current genome where the ESTs are from multiple organisms for comparative analysis.

I looked quickly at the trigger in the gist you linked to. It's looking good so far :-) You would select the analysis_id for the feature using

SELECT analysis_id FROM analysis A
    INNER JOIN analysisfeature AF ON A.analysis_id = AF.analysis_id
     INNER JOIN feature F ON F.feature_id = AF.feature_id
     INNER JOIN organism O ON O.organism_id = F.organism_id
WHERE F.feature_id = OLD.feature_id
scottcain commented 5 years ago

@bradfordcondon what's the status of this? I'm not convinced it should go into 1.4, but if trigger is working well, we can at least consider it.

bradfordcondon commented 5 years ago

the status is that people are using the tripal module to link organisms to analysis, but i never got around to figuring out how to implement lacey's trigger suggestion :(