ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 32 forks source link

Make a better term for Genotype.count #398

Closed ahwagner closed 2 years ago

ahwagner commented 2 years ago

@larrybabb and @rrfreimuth agree that Genotype.count is not an ideal term due to confusion with GenotypeMember.count

Other terms considered include ploidy and somy but no clear term exists for this measure. Blocking issue for #394.

larrybabb commented 2 years ago

How about total or _totalcount?

rrfreimuth commented 2 years ago

I think the name needs to include a noun if it does not refer to the class that contains it. For example, member_count. I don’t know that I have a good alternative at this time.

larrybabb commented 2 years ago

It's really a count of the occurrences of the molecules in the genome (or system) that contain the genotypeMembers. It could be seen as a "member count" but i think that takes the focus off of the primary definition which is more of the systemic count or a max potential member count.

rrfreimuth commented 2 years ago

That's a good clarification, and it exemplifies why I think it is important to put more meaning into the name (which is something I try to minimize). In this case, the class is Genotype but the count refers to molecules in a system, so Genotype.count isn't quite right because it sounds like a count of genotypes. I would offer an idea if I had one, but to be honest this is a very difficult thing to try to convey within this structure and I'm having a little trouble wrapping my head around it. I'm gravitating towards something related to "locus" and "ploidy" but I'm not sure either of those words are quite right.

ahwagner commented 2 years ago

homologous_members_at_locus​? Wordy, but descriptive.

larrybabb commented 2 years ago

i don't think it should have a reference to "members". A single occurrence of a genotypeMember exists on one and only one copy of the molecule that it's parent Genotype is defined by. The total possible copies of the genotype's molecule is the genotype count. Each copy of the molecule may or may not have an explicit GenotypeMember defined in the Genotype definition. These total copies of the molecules for the Genotype are essential in understanding the level of specificity a given Genotype is conveying.

rrfreimuth commented 2 years ago

The total possible copies of the genotype's molecule is the genotype count.

I don't quite follow the idea of a "genotype's molecule" (since genotype is a slice across molecules), but I know what you're trying to convey. I don't want to over complicate this, but I think part of the issue is that there are multiple concepts represented by the same structure. Without getting too caught up in the names, the following concepts are part of this object:

Note: The locus isn't specifically defined at the level of Genotype, but must be derived from its members. There may be a need for business rules or implementation guidance to ensure all members have consistent Locations.

Some concepts are attributes of the system, some of a reference molecule, some of observed molecules with homology to the reference. This is where I tend to get tangled up a bit.

ahwagner commented 2 years ago

This is a complex object, though I do not think that is the challenge in finding an appropriate term. Genotype is a well-understood concept, and many of the referenced attributes of this object have been previously characterized in VRS. For example, the Haplotype and Allele members capture:

And the Genotype inherits from Systemic Variation, which describes:

So the only new pieces to consider are:

In my view, the point of having a description for all classes and slots in our schema and documentation is to capture these definitions and keep the variables themselves simple. The challenge for "the total in-trans homologous molecules at the genomic locus" concept is that I don't think there is a way to describe this in one or two words. I still think count is best.

When considering variables such as Allele.state, what it means is only clear in the context of the parent object, by its definition. I am unsure why we are holding the Genotype.count slot to a higher standard than Allele.state, though I am beginning to think this issue might just be a matter of clearer definitions rather than descriptive slot names.

ahwagner commented 2 years ago

In discussion with @rrfreimuth no better variable name was found and it was agreed that the associated Genotype.count description attribute sufficiently conveys the intent of this field.