GMOD / Chado

the GMOD database schema
http://gmod.org/wiki/Chado
Artistic License 2.0
38 stars 25 forks source link

analysis.uniquename (v1.4) #40

Open bradfordcondon opened 6 years ago

bradfordcondon commented 6 years ago

http://gmod.827538.n3.nabble.com/Analysis-uniqename-td4049078.html

The suggestion was raised that there needs to be some sort of unique constraint on analysis. To be consistent the idea was proposed for adding a uniquename field to the analysis table Stephen says (4/24/2015). I agree the current unique constraint on the analysis table is a bit hard to deal with. Also, It makes sense to me that a feature or stock would have a human-readable name that doesn’t need to be unique and a separate unique name, but I don’t think an analysis needs two names. I think the analysis name would be the best thing to make unique (no need for a unique name). But since a unique constraint has never existing on the name folks could have duplicate names and an upgrade that included a unique constraint on the name would break things. I’m not sure how to resolve this… marking for v1.4.

bradfordcondon commented 6 years ago

maybe best addressed for Chado 1.5 ;) Unless there are fresh ideas on how to resolve this

ekcannon commented 6 years ago

I too would like to see a unique constraint on the analysis table.

scottcain commented 5 years ago

The unique constraint on analysis is constraint analysis_c1 unique (program,programversion,sourcename). Is the problem with this constraint that it is not "obvious" or is there some other problem? Do people what to be able to refer to analyses by a human readable name? Because "program+programversion+sourcename" does work as a uniquename, even if it's ugly.

bradfordcondon commented 5 years ago

the constraint makes sense intellectually but i think practically its annoying . Generally this comes up when we are adding blast annotations. We run blast against several databases, so the program, program version (blast, whatever the blast version), and the source name (if youre thinking that the source is the assembly. maybe source is supposed to be assembly_name vs blast DB_name) are the same in each run. We end up naming our programs something like organism X BLAST against trembl.

in such cases imo it makes sense for the program, program version, and sourcename to be non-unique, and the analysis name to be unique.

There are probably other cases im not thinking of where I find myself cursing at this constraint.

spficklin commented 5 years ago

I like the idea of making the analysis name field unique but it's not backwards compatible. Folks who may have analyses with the same names would suddenly find they can't upgrade Chado without fixing their data. However, if someone does have duplicate names would that just be a mistake or is there an actual use case for two analyses having the same name but different unique tuple (program, program version, and sourcename)? If there is no good reason to have a duplicate name then we could do folks a favor by making the name field unique and forcing them to fix their data integrity.

ekcannon commented 5 years ago

The reason the current constraint, constraint analysis_c1 unique (program,programversion,sourcename) is problematic is that I have multiple genomes from the same source assembled with the same program and program version. I have to fudge the name of one of these to meet the constraints.

I agree with @spficklin that adding a unique constraint on name would cause problems. As well, adding a uniquename field with a unique constraint is also not backward compatible.

We seem to be at an impasse on this one.

bradfordcondon commented 5 years ago

could we simply remove the unique constraint?

edit: obviously thats against all of the requests for a unique constraint.

what if we added name to the existing unique constraint?

constraint analysis_c1 unique (program,programversion, name, sourcename) This would be backwards compatible. It would provide the name-based unique constraint that we are asking for.

ekcannon commented 5 years ago

Hmm. That might be the easiest solution. Any implications to having no unique constraints on the table? (Aside from analysis_id, of course)

bradfordcondon commented 5 years ago

@ekcannon i just edited my comment right before you posted (sorry!): what about adding name to the existing unique constraint?

ekcannon commented 5 years ago

Adding name to the constraint was recommended above, but @spficklin made a good point about it not being backward compatible, but wondered if it would be too horrible to force people to modify their data if there are duplicate names.

It seems the choices are: 1) add name to constraint and possibly force data adjustments, 2) remove the constraint altogether, or 3) leave things as they are.

I'm not sure what my vote is, maybe weakly 1), 2), then 3).

bradfordcondon commented 5 years ago

I think we need to clarify the difference between adding name as its own unique constraint, and adding name to the existing constraint.

#case 1
constraint analysis_c1 unique (name)

# case 2

constraint analysis_c1 unique (program,programversion, name, sourcename)

I believe case 1 is what is being argued for above.

I believe that the second case is fully backwards compatible, but allows us to do what we want: create analyses with the same program/programversion,sourcename but different names.

to be super clear, if i have analyses with the same name but unique program/programversion/sourcename combinations (and, by definition, i DO, since thats the existing constraint) case 2 does not break my data.

ekcannon commented 5 years ago

Ah. Correct. Now I have a definite preference for your case #2.

spficklin commented 5 years ago

I support case #2. It keeps the spirit of the original constraint which I'm sure had some reason for it, and improves flexibility.

bradfordcondon commented 5 years ago

with no objections, i'm going forward and updating the unique constraint to:

constraint analysis_c1 unique (program,programversion, name, sourcename)

OK'd by @scottcain