Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Ensure that a given name refers to exactly one Compound record #234

Closed jcmatese closed 2 years ago

jcmatese commented 2 years ago

Names for Compounds are currently stored in both the Compound.name field and the Synonym.name field. Thus, it is possible for a particular name to refer to two compounds

Example:

Compound.id=1
Compound.name=glucose
Compound.id=2
Compound.name=sugar
Synonym.id=1
Synonym.name=glucose
Synonym.compound.id=2

Some possible options to avoid this inconsistency:

Option 1 - Ensure Compound.name is also Synonym.name [This was the chosen initial implementation]

By overriding the save() method, we can ensure that Compound.name is always also a Synonym.name. If there was a conflict, we can throw a ValidationError() though I'm not sure if that should happen in save() or clean() or both.

Option 2 - Store all names in the Synonym table

We could store all names in the Synonym table, with a unique constraint. An additional boolean field of is_primary_name could be used to track the "canonical name". Overriding the clean() method would allow us to ensure exactly one name is primary for each compound. The Compound.name could then be a property that looks up the primary name from the Synonym table. See the State and Capital example in the article Don't be Circular. Downside: Removal of Name from Compound also removes the only non-nullable independent indicator of an intrinsic/unique compound (formula is not unique, HMDB is slated to become optional, etc.).

Option 3 - Circular references between Compound and Synonym

It's unclear to me if this will work in Django, but I believe it could be made to work. Compound.name would be a foreign key to Synonym. All names would be Synonyms. The issue here is that this type of circular reference could cause some issues in implementation due to "chicken and egg" issues.


Initial implementation will do Option 1.


Q. Shouldn't we either remove Compound.name or else make it a "primary synonym" link to the synonym table? Otherwise, this creates a bit of overhead to make sure the name and synonyms are copacetic.

_Originally posted by @hepcat72 in https://github.com/Princeton-LSI-ResearchComputing/tracebase/pull/230#discussion_r739347012_

jcmatese commented 2 years ago

OK, if this is something that people vote for. You may have to do the following:

  1. New migration, adding nullable name_fk foreign key to Compound, and a new database sequence id as the primary key of compound_synonym
  2. migrate
  3. New empty migration, with python code to insert all the current values of compound.name into compound_synonym , relating back to the compound, and also add the new inserted name_fk python app_code/manage.py makemigrations --empty --name transfer_compound_names DataRepo
  4. migrate
  5. new migration, making name_fk required, and the dropping the initial Compound.name column
  6. migrate
  7. new migration, rename the Compound.name_fk to Compound.name
  8. migrate
  9. Ensure all loading and viewing code deals with the above changes

I am uncertain exactly how migration number 2 (data migration) happens in practice, because it is not model generated.

I also don't know exactly how Django transactions will handle the reciprocal non-nullable foreign keys between compound and compound_synonym. They would both have to be inserted within the same transaction, and hope they don't run afoul of a non-nullable check. Alternatively, modify that model too, somehow (reverse the relationship direction/cardinality, or something).

Personally, I prefer to have the model remain simple and the systematic/preferred name be intrinsic to the compound model, but if the majority want to embark on the above, we can try it.

@lparsons @hepcat72 @fkang-pu

fkang-pu commented 2 years ago

I want to make sure I understand the issue correctly. Here is what I think in my mind with the following example. How would you handle this case?

example: Compound.id=20 Compound.name=glucose

two synonyms: b-Glucose Glucose

I would:

lparsons commented 2 years ago

You're example makes sense Fan, however, I see the issue as ensuring that we don't create some inconsisency down the line (accidentally). Following you example, someone might provide a new list of compounds and synonyms that include:

Compound.id=20 Compound.name=a-Glucose

synonym: glucose

Since there is no record of glucose in the synonym table, to catch this, we would have to ensure that in addition to making sure that the list of synonyms is unique (easily done with a database constraint), that no synonym matches a name in the Compound table (not difficult, but an extra consistency check that is required every time). I'd like to change the wording of this issue to state that this is a requirement of the model/code and then propose some solutions (I can see many ways to handle take care of this).

@hepcat72 and @jcmatese Am I capturing the concerns correctly here? I'd like to make sure I capture everyone's concerns. If so, I'll reword the issue and start a list of options.

jcmatese commented 2 years ago

We could just ensure that every compound name is also loaded as a synonym to itself, with both exact and various permutations. The thing is, because the individual lab members are the source of the names, there will likely always be the edits required (either to database or file) to normalize their spelling, cases, etc. [else, append onto the list of existing synonyms]. I understand the concerns, but I am not too troubled by them.

lparsons commented 2 years ago

We could just ensure that every compound name is also loaded as a synonym to itself

Nice idea @jcmatese. Seems like a good compromise, in fact, I've added it as "Option 1" above.

fkang-pu commented 2 years ago

Nice idea.

hepcat72 commented 2 years ago

Just saw this. I don't have a particular stake in how its done. As long as consistency is maintained and overhead is reduced. So what's the mechanism for ensuring every compound name is a synonym to itself? That's only a loading mechanism, correct? There's no database enforcement. Personally, I prefer a database mechanism, but as long as the potential inconsistency is mitigated, then I'm fine with a loading mechanism to ensure consistency. I don't think that the circular reference concern is terribly bad. We have lots of circular references in HTSEQ. A circular reference would be a concern if they are traversed backwards and forwards to accomplish the same goal, but in this case, it's a different goal. Compound->synonym is to get the "primary" name. Synonym->Compound is to get all names (secondary or primary). The relationship in this scenario is enforced by the database. The downside is that the links could be improperly set, so there's an overhead concern in any case. So again, I don't have a stake in the way we choose to do it.

Here's an as-yet novel idea that could be the best of both worlds: in the synonym table, we could set a priority value and ensure that synonym_id and synonym_priority are unique. Then we can grab the name with the highest priority when we want the primary compound name.