Resource Model: SimpleAllele.canonicalAllele

cbizon commented 9 years ago

In the resource model our simple allele resource has a canonicalAllele element.

But, if there are multiple canonicalizers, it is possible for there to be more than one canonical allele associated with a simple allele. We should either remove this or make it 1..*.

larrybabb commented 9 years ago

My perspective is that no two canoncilizing authority's will produce the same results and therefore cannot be mixed.

The canonicalAllele concept is really not a physical thing, it is more accurately described as a grouping identifier as defined by a given authority which may or may not follow a specification for grouping alleles that are the "same" thing.

I think it is becoming clearer to more and more groups that there needs to be a way to reliably and confidently compare allele representations. The canonical identifier is a means to that end. But it will take a significant effort to get a single specification developed and then enough groups to develop a repository for it which systems will depend on.

So, we should continue to document that notion that the canonical allele "identifier" from an external authority must be acceptable to a system that chooses to use it. But when dealing with multiple sources of canonical identifiers, it will be the responbility of the receiving system to transform it to a single authorities proper value (which may end up splitting or merging the simple allele set related to it. It may also be a point of negotiation when setting up interfaces between systems that transfer canonical identifiers to settle on a single authority. I suppose a third option, and probably one that will be most common is for each system that receives, curates and stores canonical alleles to have their own identifier.

What to do with "canonical identifiers" from external systems that do not reflect the canonicalization choices required by a given implementation? These would (IMO) become simple allele identifiers that would be tied to each individual simple allele from that external system.

I was thinking at one point that we should simply have a special "canonical identifier" on the simple allele which would be a mechanism for grouping simple alleles, but I think keeping it as a resource is still useful in order to deal with complex alleles. We really need a complex allele example to help clarify this vision.

I've been starting to think about these allele registries as something analogous to patient registries. The idea of identifying patients is a very challenging and fairly well developed structure in health care. Patients can be identified by many different systems. Most health care systems have something equivalent to an MPI (Master Patient Index), which "canoncalizes" patient information. It is a bit different in that there is not necessarily multiple patient records with different representations of a patient. But, there is some similarity in that all systems that need to identify a patient have a set of attributes or rules to assure that they identify the right one.

I was thinking that if we present a "Master Allele Index" approach then you could imagine that the actual equivalent to a MPI identifier for a patient would be a canonical identifier for an allele. And, just like you need 3 PHI attributes to identify a person (name, dob, gender) or (ssn, dob, name) or whatever. We could define the sets of attributes that would reasonably locate the canonical allele and return a MAI identifier.

Of course these are all just ideas that I am throwing out in the hopes that something resonates and makes sense to others.

In the end, we should leave the simple allele with a reference to one and only one canonical allele, and then document the notion that only a single authority should capture and group simple alleles for a given repository.

I suppose it would be reasonable to think you might be able to segregate allele types to different canonocializing authorities (like structured v sequence nucleotide v sequence amino acid). But, we won't really get into those weeds until further down the road.

cbizon commented 9 years ago

So your point of view is that a given database will only allow a single canonicalizer? Admittedly, I think that is the main case, but I'm not sure that it's realistically going to cover all the edges/corners. What happens when a canonicalizer needs to be updated? What if it changes a few canonicalizations but not all of them? What if an update to a canonicalizer breaks up a canonical allele that is attached to an assertion of some sort?

You also lose the ability to track things like potential differences between canonicalizers in a single database. (though I guess that the approach of stuffing that information into allele name could work, it is sort of a misappropriation of that data field, since names of that type have a special significance that the field does not really support semantically).

I feel like we're making decisions for the implementors there that may not be warranted. If the model allows multiple canonicalizers, that doesn't force an implementation to use more than one, but if we only allow one, then we're constraining them, perhaps unduly.

larrybabb commented 9 years ago

My assumption regarding the scenarios of a single authority changing canonicalization methods resulting in splitting and merging of historical canonical identifiers was to use the "replaced-by" and "replaces" related canonical identifier relationship that we added for just that purpose.

The replaced identifiers may end up mapping to multiple new identifiers (in the case of a split) or be combined with a previously existing canonical identifer (in the case of a merge). In either situation the identifier being replaced would become inactive (which we have a flag for) and this would still allow external systems to be able to find any previously published identifier and discover it's new canonical form.

I suppose a single canonical allele could even become obsolete without any mapping to one or more active ones (but I am not 100% positive).

This is sort of how snomed and umls ids work each time the publish a new version of the database.

We should consider documenting this paradigm I suppose, because in the end it will be very important for systems to know which version of a canonicalization engine implementation was used to generate a set of canonical ids.

No one said this was simple. ;)

I am open to other perspectives. This is simply how I currently see things working.

cbizon commented 9 years ago

To be fair, this was pretty much the way I had envisioned things as well: you would have a single canonicalizer related to a repository, but now I'm wondering if that's overly restrictive.

One question: would you consider the referenced canonical allele to be part of the data of the simple allele? Is it ok for a simple allele in system A (pointing to canonical allele CA) to be identified with a simple allele in system B (pointing to a different canonical allele CB, say created using a different canonicalizer)? Or are references like this outside the scope for what is considered identical?

larrybabb commented 9 years ago

Great point.

I think you are starting to reveal some considerable issues with the CanonicalAllele design. Conceptually speaking it is awesome and I get the value, but practically speaking there is a bit of work to be done in order to get the design right (IMO). I do believe we should focus in on this as a group and see if we can come up with some better options.

I think it would be good if you would do a quick read through of the Related Patient discussion and design solutions implemented in the FHIR spec 5.1.3 Patient id's and Patient resource id's 5.1.4 Linking Patients 5.1.5 Patient vs Person vs Patient.Link 5.1.7 Merging records

I know this is not a perfect match with our situation, but I do think it provokes ideas on how we should frame and design our solution to grouping, relating, canonicalizing,... alleles.

The biggest hurdle I am having is that the CanonicalAllele is not something you can physically point to in relation to the other resources we have. It really is a simpleAllele and it is the simpleAlleles that are linked or related to each other as the "same" thing.

I'm starting to waiver on our canonical allele concept. Maybe we can have an impromptu call on Monday to discuss (and you can bring me back from the edge).

cbizon commented 9 years ago

Yes, let's discuss on Monday. I don't see anything here that's making me waver on the concept, though if somebody has another approach I'd be happy to hear about it...

clingen-data-model / allele

Resource Model: SimpleAllele.canonicalAllele #57