Open lh3 opened 3 years ago
Hey, this is really interesting that the provenance for 1% frequency has gone missing. I can find several references to it, but not the source yet. Here is a paper from 2001 that refers to it but does not reference where this wisdom came from. There are others, some as recent as 2016.
• Published: March 2001 Variation is the spice of life • Leonid Kruglyak & • Deborah A Nickerson
https://www-nature-com.ezproxy.lib.utah.edu/articles/ng0301_234 "A more reasonable basis for comparison is obtained if we restrict our consideration to SNPs with both alleles occurring in the population at or above a minimal frequency. The traditional definition of 'polymorphism' sets this frequency at 1%. "
I think what we need to convey here is that there is a difference between SNV and SNP - in that with SNPs there are multiple (at least two) versions that are circulating in a population. I'm not so sure about including the word 'normal' as even in a het state, there can be observed phenotypes.
This is a great place to start a conversation though and now I am determined to get to the bottom of the 1% mystery.
We will get back to you after we have had time to dive into the literature this weekend and see if we can formulate a definition that makes everyone happy.
sincerely, --Karen
Note that the source of this 1% threshold is just one problem. The other bigger problem is that "population" is not defined, which makes it impractical to apply the frequency threshold.
SNP is germline only. SNV may represent somatic substitutions. That is their main difference. Actually when SNV was first used in the field, it was mostly used for somatic substitutions. Generalizing SNV to germline substitutions is a more recent thing.
Sure SNP is germline, SNV can be either. They are not synonyms though.
We need two terms, one to say this is a difference that was observed (ie de novo variant in sick kid in the nicu), and the other to say that this is a difference that is observed at some frequency in some population (ie the variant that causes red hair). We are not in the business of defining population in SO. That would be a local implementation.
Removing the hard threshold from the definition is fine - it seems quite arbitrary, but I would like to understand the history of that so I can clarify in the comment section.
Hey, this is really interesting that the provenance for 1% frequency has gone missing.
I recall either 5% or 1% MAF thresholds for classifying SNPs as common or rare variants. For example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3260649/ "Rare variants were defined as SNPs with minor allele frequency (MAF) < 0.01"
Note that doesn't preclude using SNP for rare variants, it just says rare variants are a subset of SNPs.
Plus dbSNP, which I believe started in 2001, has never applied MAF thresholds for saying a variant is in scope for a dbSNP record. It seems a bit awkward (and amusing) to have an SO definition of SNP that doesn't include the majority of entries in dbSNP.
What is the SO term name and accession?
SNP (SO:0000694)
Describe what you would like to change.
Drop the frequency requirement and change the definition to
There could be more accurate phrasing but let's leave it to another issue.
Relevant Publications
The dbSNP paper in 1999, the HGP paper in 2001 and the 1000 Genomes paper in 2015 use SNP without requiring a frequency threshold. As is pointed out by Matthew Hahn, one of the first uses of the terminology doesn't mention a frequency threshold, either. In these papers, a SNP simply refers to a germline substitution. I also wrote a blog post to explain why a frequency threshold is impractical to test or apply. Aylwyn Scally has a similar concern.
When this 1% threshold was added to the definition of SNP is a mystery. Matthew and a few others have tried but couldn't identify the source. The current SO definition is not backed up by publications.