Closed MillironX closed 1 year ago
I've never seen these terms actually defined (until now), so I can only share the way I've seen people using it. People who use "haplotype" tend to define haplotypes based on observed patterns of inheritance, i.e. a haplotype is a collection of alleles which empirially clusters in actual populations. To me, that does not align with the use here. I would expect a HaploType struct to be defined in PopGen.jl.
Part of the problem is that nomenclariture is field-specific. "Variant" is pretty well understood among virologists[1], but I have no idea what e.g. botanists think it means. It's interesting (and sad) that Ensembl defined variant to mean something different.
I can't think of any terminology which is clear and unambiguous across fields. Perhaps "Genotype" for what is currently called "Variant"... but then again, I'm the one who originally settled on "Variant"/"Variation", so no wonder I can't come up with anything better. :)
[1]. Wikipedia:
a subtype of a microorganism that is genetically distinct from a main strain, but not sufficiently different to be termed a distinct strain
Well, I was hoping for more input than that, but...
I can see the issue with the name "haplotype," as it typically is associated with populations for mammals. My understanding is that a "haplotype" is the "genotype" for single-ploidy organism (e.g. viruses), while the "genotype" of a multi-ploidy organism consists of multiple "haplotypes." Since this package only deals with single-ploidy references, then it makes sense to use the more specific "haplotype," but I can see either term working.
"Variant" will still in my vocabulary mean a specific site (e.g. Single Nucleotide Variant), but enough papers refer to a single site as a "variation," that I would be content keeping "variation". Based on asking people in my department, it seems the term "variant" only came to have a strain-like connotation in the wake of SARS CoV2 (we use the term "lineage" or "clade" for what news anchors call "variant"), so its clear to me that the term "variant" is ambiguous enough that it probably should be removed entirely.
Side note: similar workflows to the paper A beginner’s guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing were what sparked the initial name choices.
I propose one of the following changes then
Option 1 | Option 2 | Option 3 | Option 4 |
---|---|---|---|
Variant -> Haplotype |
Variant -> Haplotype |
Variant -> Genotype |
Variant -> Genotype |
Variation -> Variant |
Variation -> Variation |
Variation -> Variant |
Variation -> Variation |
Feedback, anyone?
Expected behavior
TL;DR: I want a (mostly) unambiguous pair of terms to rename the types
Variant
andVariation
to. I chose "Haplotype" and "Variant," but want feedback from other biologists before writing the code to change it.First, let's clarify what
Variant
andVariation
do in the context of the package.Variant
Variation
My point of view (veterinary diagnostics) is different, but aligns pretty closely with the Ensembl glossary. Here are some entries from the glossary that I think are pertinent:
The big pain point for me comes from the fact that "Variant" refers to a single locus in most places, but in the package refers to a collection of loci. That disconnect even taints glue functions trying to parse
Variation
s from VariantCallFormat.jl'sVCF.Record
.I propose renaming
Variant
toHaplotype
, andVariation
toVariant
. These seem like the least ambiguous terms that apply from the glossary.I would like feedback from others on these terms. Specifically @jakobnissen, since I know you also work(ed) on viral genomes, and @rasmushenningsson, since there's some overlap between the terminology in VariantCallFormat and SequenceVariation. Anyone else with an opinion, please also jump in.
Current behavior
Why did I implement issue forms?
Possible implementation
Again, why?
Context
No response
Link to your project
No response