Closed jcmatese closed 2 years ago
OK, if this is something that people vote for. You may have to do the following:
name_fk
foreign key to Compound
, and a new database sequence id
as the primary key of compound_synonym
compound.name
into compound_synonym
, relating back to the compound, and also add the new inserted name_fk
python app_code/manage.py makemigrations --empty --name transfer_compound_names DataRepo
Compound.name
columnCompound.name_fk
to Compound.name
I am uncertain exactly how migration number 2 (data migration) happens in practice, because it is not model generated.
I also don't know exactly how Django transactions will handle the reciprocal non-nullable foreign keys between compound
and compound_synonym
. They would both have to be inserted within the same transaction, and hope they don't run afoul of a non-nullable check. Alternatively, modify that model too, somehow (reverse the relationship direction/cardinality, or something).
Personally, I prefer to have the model remain simple and the systematic/preferred name be intrinsic to the compound model, but if the majority want to embark on the above, we can try it.
@lparsons @hepcat72 @fkang-pu
I want to make sure I understand the issue correctly. Here is what I think in my mind with the following example. How would you handle this case?
example: Compound.id=20 Compound.name=glucose
two synonyms: b-Glucose Glucose
I would:
do nothing to Compound model/table
load two rows to CompoundSynonym: e.g. id=1, name='b-Glucose', compound_id=20 id=2, name='Glucose', compound_id=20
if Glucose is used in data loading files, replace it with "glucose" during data loading process (based on synonym and compound.name mapping): e.g. in database Animal.tracer=glucose Peakgroup.name=glucose
You're example makes sense Fan, however, I see the issue as ensuring that we don't create some inconsisency down the line (accidentally). Following you example, someone might provide a new list of compounds and synonyms that include:
Compound.id=20 Compound.name=a-Glucose
synonym: glucose
Since there is no record of glucose
in the synonym table, to catch this, we would have to ensure that in addition to making sure that the list of synonyms is unique (easily done with a database constraint), that no synonym matches a name in the Compound table (not difficult, but an extra consistency check that is required every time). I'd like to change the wording of this issue to state that this is a requirement of the model/code and then propose some solutions (I can see many ways to handle take care of this).
@hepcat72 and @jcmatese Am I capturing the concerns correctly here? I'd like to make sure I capture everyone's concerns. If so, I'll reword the issue and start a list of options.
We could just ensure that every compound name is also loaded as a synonym to itself, with both exact and various permutations. The thing is, because the individual lab members are the source of the names, there will likely always be the edits required (either to database or file) to normalize their spelling, cases, etc. [else, append onto the list of existing synonyms]. I understand the concerns, but I am not too troubled by them.
We could just ensure that every compound name is also loaded as a synonym to itself
Nice idea @jcmatese. Seems like a good compromise, in fact, I've added it as "Option 1" above.
Nice idea.
Just saw this. I don't have a particular stake in how its done. As long as consistency is maintained and overhead is reduced. So what's the mechanism for ensuring every compound name is a synonym to itself? That's only a loading mechanism, correct? There's no database enforcement. Personally, I prefer a database mechanism, but as long as the potential inconsistency is mitigated, then I'm fine with a loading mechanism to ensure consistency. I don't think that the circular reference concern is terribly bad. We have lots of circular references in HTSEQ. A circular reference would be a concern if they are traversed backwards and forwards to accomplish the same goal, but in this case, it's a different goal. Compound->synonym is to get the "primary" name. Synonym->Compound is to get all names (secondary or primary). The relationship in this scenario is enforced by the database. The downside is that the links could be improperly set, so there's an overhead concern in any case. So again, I don't have a stake in the way we choose to do it.
Here's an as-yet novel idea that could be the best of both worlds: in the synonym table, we could set a priority value and ensure that synonym_id and synonym_priority are unique. Then we can grab the name with the highest priority when we want the primary compound name.
Names for Compounds are currently stored in both the Compound.name field and the Synonym.name field. Thus, it is possible for a particular name to refer to two compounds
Example:
Some possible options to avoid this inconsistency:
Option 1 - Ensure Compound.name is also Synonym.name [This was the chosen initial implementation]
By overriding the
save()
method, we can ensure thatCompound.name
is always also aSynonym.name
. If there was a conflict, we can throw aValidationError()
though I'm not sure if that should happen insave()
orclean()
or both.Option 2 - Store all names in the Synonym table
We could store all names in the
Synonym
table, with a unique constraint. An additional boolean field ofis_primary_name
could be used to track the "canonical name". Overriding theclean()
method would allow us to ensure exactly one name is primary for each compound. TheCompound.name
could then be a property that looks up the primary name from the Synonym table. See the State and Capital example in the article Don't be Circular. Downside: Removal ofName
from Compound also removes the only non-nullable independent indicator of an intrinsic/unique compound (formula
is not unique,HMDB
is slated to become optional, etc.).Option 3 - Circular references between Compound and Synonym
It's unclear to me if this will work in Django, but I believe it could be made to work.
Compound.name
would be a foreign key toSynonym
. All names would be Synonyms. The issue here is that this type of circular reference could cause some issues in implementation due to "chicken and egg" issues.Initial implementation will do Option 1.
Q. Shouldn't we either remove
Compound.name
or else make it a "primary synonym" link to the synonym table? Otherwise, this creates a bit of overhead to make sure the name and synonyms are copacetic._Originally posted by @hepcat72 in https://github.com/Princeton-LSI-ResearchComputing/tracebase/pull/230#discussion_r739347012_