ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 34 forks source link

What is the artificial (or generally acceptable) boundary between CNV and tandem repeats? #364

Closed ahwagner closed 8 months ago

ahwagner commented 2 years ago

Bringing in the below questions from #363, which were off-topic for the containing issue but worth discussing:

What is the artificial (or generally acceptable) boundary between CNV versus tandem repeats? Or could the structure here also cover communication of a CNV where the sequence of each copy of X is known? E.g. a gene duplication event where the duplicates do not share the exact same sequence?

Originally posted by @bheale in https://github.com/ga4gh/vrs/issues/363#issuecomment-988309394

ahwagner commented 2 years ago

In short, we have different sequence expressions that allow you to express tandem repeating sequence with varying levels of precision. To represent any tandem repeats (of either a precise or approximate sequence and contiguous on a molecule), we use Allele with a RepeatedSequenceExpression state.

The RepeatedSequenceExpression has another SequenceExpression as the subject. If it is important to precisely represent an exact repeated sequence, no matter how large, that is done using a LiteralSequenceExpression as the subject. If you want to express an approximately reference sequence, you would use a DerivedSequenceExpression, where the sequence is considered approximately matching (derived from) the sequence found at a specified SequenceLocation.

If you want to specify the number of copies of a specific Allele or a more broad characterization of a Gene across a system, but not necessarily in tandem, that can be done using CopyNumber with an Allele (a type of Molecular Variation) or Gene (a type of Feature) as the CopyNumber subject.

Finally, we get to your primary question, and the title of this thread, what is the artificial / accepted boundary between CNV and tandem repeats?

The short answer is we do not impose a boundary. We provide a mechanism where systems can specify precise sequences (explicitly) or approximate sequences (by Location-derived reference) as needed for purpose. In practice, we think that this boundary generally exists at one of:

  1. the size beyond which your assay cannot confidently express a contiguous, in-cis sequence
  2. the size beyond which the exact sequence is not important for the interpretation of the CNV / tandem repeat

@bheale I hope that you find this helpful, and I would welcome any clarifying questions you may have. Your questions have demonstrated the need for us to create a writeup in our VRS Appendices / FAQ to help other VRS newcomers, and any suggestions you may have on improving the above explanation would be welcome in creating that documentation!

bheale commented 2 years ago

Yep. I agree with the above. "We provide a mechanism where systems can specify precise sequences (explicitly) or approximate sequences (by Location-derived reference) as needed for purpose" perfect. It acknowledges the CNV's fuzzyness (e.g. just saying a region is repeated based on microarray (probe-based) data without sequence data) and allows for one to fully describe a CNV as a repeat on the sequence level. Systems will need to be able to handle both representations. Thanks! Bret

mbaudis commented 2 years ago

@bheale Be careful with this:

just saying a region is repeated based on microarray

... uses "repeated" which assumes in-situ. In reality, CNVs may hang around anywhere in the genome, possibly at different locations etc. I would use "repeated" only if in same location (Tandems ...).

github-actions[bot] commented 10 months ago

This issue was marked stale due to inactivity.

ahwagner commented 8 months ago

This was resolved in systemic / molecular variation modeling of VRS 1.2.