Closed ahwagner closed 8 months ago
In short, we have different sequence expressions that allow you to express tandem repeating sequence with varying levels of precision. To represent any tandem repeats (of either a precise or approximate sequence and contiguous on a molecule), we use Allele with a RepeatedSequenceExpression state
.
The RepeatedSequenceExpression
has another SequenceExpression
as the subject
. If it is important to precisely represent an exact repeated sequence, no matter how large, that is done using a LiteralSequenceExpression
as the subject. If you want to express an approximately reference sequence, you would use a DerivedSequenceExpression, where the sequence is considered approximately matching (derived from) the sequence found at a specified SequenceLocation.
If you want to specify the number of copies of a specific Allele
or a more broad characterization of a Gene across a system, but not necessarily in tandem, that can be done using CopyNumber with an Allele
(a type of Molecular Variation) or Gene
(a type of Feature) as the CopyNumber subject.
Finally, we get to your primary question, and the title of this thread, what is the artificial / accepted boundary between CNV and tandem repeats?
The short answer is we do not impose a boundary. We provide a mechanism where systems can specify precise sequences (explicitly) or approximate sequences (by Location-derived reference) as needed for purpose. In practice, we think that this boundary generally exists at one of:
@bheale I hope that you find this helpful, and I would welcome any clarifying questions you may have. Your questions have demonstrated the need for us to create a writeup in our VRS Appendices / FAQ to help other VRS newcomers, and any suggestions you may have on improving the above explanation would be welcome in creating that documentation!
Yep. I agree with the above. "We provide a mechanism where systems can specify precise sequences (explicitly) or approximate sequences (by Location-derived reference) as needed for purpose" perfect. It acknowledges the CNV's fuzzyness (e.g. just saying a region is repeated based on microarray (probe-based) data without sequence data) and allows for one to fully describe a CNV as a repeat on the sequence level. Systems will need to be able to handle both representations. Thanks! Bret
@bheale Be careful with this:
just saying a region is repeated based on microarray
... uses "repeated" which assumes in-situ. In reality, CNVs may hang around anywhere in the genome, possibly at different locations etc. I would use "repeated" only if in same location (Tandems ...).
This issue was marked stale due to inactivity.
This was resolved in systemic / molecular variation modeling of VRS 1.2.
Bringing in the below questions from #363, which were off-topic for the containing issue but worth discussing:
Originally posted by @bheale in https://github.com/ga4gh/vrs/issues/363#issuecomment-988309394