Total Copy/Genomic Count concept and HGVS

larrybabb commented 5 years ago

This is a copy and paste from an email thread initiated by Peter Causey-Freeman...

==== From 9/26/2018 ====

My colleague, Raymond Dalgleish and I have been asked to advise regarding HGVS descriptions in which the term dup (duplication) is being used to describe increases in the copy number of a specified range of genomic sequence. The problem is that dup is intended as the description of a tandem duplication of sequence, so its use in these circumstances is inappropriate because the additional copies of a described genomic region sequence would be alleles of a different genomic loci. If I remember correctly, you have views on this issue relating to the VMC model.

Raymond is a full member of the HGVS SVN working group. He recently attended a meeting concerning the best practice for variant reporting in DMD. At the meeting he was asked whether there was an HGVS compliant way to describe the total count for individual exons, for example DMD gene exons identified by MLPA (https://en.wikipedia.org/wiki/Multiplex_ligation-dependent_probe_amplification ). When Raymond and I discussed the issue, I remembered the VarRep discussions about the potential use of a total genome count object within the VMC model. Raymond and I want to propose a similar object/variant description for adoption by the HGVS and it would make sense to use similar terminology. What we need is a 3 letter description for the variant type e.g. dup = duplication, ins = insertion. Have you chosen a term for use in VMC? If not, we were thinking about proposing tcc = total copy count or tgc = total genomic count.

Do you have any comments or thoughts on the issue.

Best wishes,

Peter Causey-Freeman

larrybabb commented 5 years ago

=== Response from LBabb later that day on 9/26/2018 ====

Thanks for the email and the ideas.

We have been wrestling through the copies issue in relation to differentiating tandem vs total copies as you outlined below. I’m not sure we’ve exactly settled on a single term or even if the VR group feels these should represented differently. Tristan may have better insights as he has been spearheading those discussions.

My perception is that we do need a term for the concept that you have specified below. So far it has shown up as a “copies” attribute within our “state” component (aka ALT in VCF lingo). But pulling out a “total ? Copies” term makes a lot of sense to me as “copies” by itself can mean “in addition to the original “ which can be confusing. Also our “copies” has not been clear on distinguishing tandem vs random.(I think)

Let us shoot this around but we don’t want to deter your efforts. I will raise it with the group and use it to influence that work as we get further along.

larrybabb commented 5 years ago

=== Peter's response the next day 9/27/2018 ====

Happy to discuss this with Reece and Tristan as well.

It seems that your thoughts on the issue are in line with ours. Raymond and I also thought that a total count term would be more appropriate for the reasons you mintioned i.e. explicitly inclusive of all copies of a specified region.

There were concearns on the last meeting I was able to attend as to whether the state object was appropriate because, as you say, it potentially implies that any increased copy number belongs to the identified Allele. For tandem copy number increases this is correct, but not if the increased copy number is inserted into a different genomic locus, or if the site of this insertion is unknown.

A suggestion made by Tristan was that it might be appropriate to use a separate object outside of the state object to define a copy count, and that object could be linked to multiple alleles if the location of additional copies of the originating sequence are identified. This model certainly feels more comfortable to myself and Raymond, hence I thought it best to see where we were up to before making a proposal to the HGVS.

To us, it makes sense to align efforts on this issue, especially since the VR group are discussing it so thoroughly.

We are in no rush to push ahead with a proposal, and it would make sense to hang fire until a consensus is reached by the VR group. I'd also be interested to hear what Reece and Tristan think.

Thanks for getting back to us so quickly. Best wishes.

Pete

larrybabb commented 5 years ago

=== Reece Hart's comments later that day 9/27/2018 ====

We're still discussing how to model copy number, including the cases you mentioned.

Some questions that we're mulling over:

Relative versus absolute copy number (or both)? If relative, then relative to what?
How to handle copy number ranges (e.g., 3-7 copies)
"Locatable" and non-locatable. We may or may not know where the cnv occurs, and we must handle both cases.
Should small-scale repeats (e.g., trinucleotides) be modeled the same way as larger repeats (e.g., genes)? If not, how do we define the cases?
Uncertainty: Some assays report ranges, such as "more than 50 repeats". How do we capture that?
What's the interplay between CNVs and other phenomena like translocations?

mbaudis commented 5 years ago

Here are some of my opinions (1/several coming up):

The problem is that dup is intended as the description of a tandem duplication of sequence,

This is plainly wrong, since it doesn't follow common use in cytogenetc nomenclature, VCF etc., where "DUP" or "dup" just indicate a quantitative increase of a genomic region compared to "baseline" - usually 2n, but can be against any base ploidy for the genome or (in the case of X, Y) chromosome in question. So /DUP/i should be used for any quantitative increase.

In principle, the use of trp could be fine - as a subset of dup for the gain of one additional copy on a 2n ploidy, such as trp_ch21; but would be rather limited in its added benefit. The general syntax for copy number abnormalities should include the

directionality of the change compared to baseline (DUP | DEL)
- this is sufficient for many use cases, and in contrast to a tgc (total count) can be interpreted w/o the optional information below
optional: an absolute or relative indication of the quantitative change
- the numeric tgc - which will frequently not be available - or a class (corresponding e.g. to "high-level amplification") - for which a vocabulary has to be set
optional: the base ploidy

In an object model, this can be easily done through a prototype (in the example homozygous deletion on a diploid baseline):

  "state" : {
    "termId" : "SO:0001743",
    "termLabel": "copy_number_loss",
    "info" : {
      "ploidy" : "2n",
      "alleleCount" : 0
    }
  }

In a stringified shorthand GRCh38::chr9:21967753-21975098::DEL_tgc0_bgc_2 or such would cover this scenario (and could be used to generate HGVS annotation).

mbaudis commented 5 years ago

@larrybabb @reece @PeteCausey-Freeman Picking up now

=== Reece Hart's comments later that day 9/27/2018 ====

... with some comments from my side:

Some questions that we're mulling over:

Relative versus absolute copy number (or both)? If relative, then relative to what?

How to handle copy number ranges (e.g., 3-7 copies)

See example in my post above. There is an open question where e.g. the base ploidy would be expressed (allele/variant vs. genotype ... in a complete model? Wherever the reference is being pointed at from? Also, development of those "CN range classes".

"Locatable" and non-locatable. We may or may not know where the cnv occurs, and we must handle both cases.

Yes. There has to be a mechanism to link variants. It seems preferable to e.g. encode BRKs as separate sevents, and link them to the ends of DUP events for placing them. Please see the example.

Should small-scale repeats (e.g., trinucleotides) be modeled the same way as larger repeats (e.g., genes)? If not, how do we define the cases?

Uncertainty: Some assays report ranges, such as "more than 50 repeats". How do we capture that?

While in principle a repeat expansion could be reported as DUP compared to a reference number of repeats, those are essentially very different beasts, in essence "precise" sequence modifications (even if the repeat number may be in a range & w/ uncertainty). So should have a separate type of representation, though e.g. a DUP representation would be "logically" correct (we don't assume any type of mechanism behind a DUP event), but certainly not optimal.

What's the interplay between CNVs and other phenomena like translocations?

All CNVs are flanked by 2 pairs of breakpoints (BRK in VCF speak).; the inner ones are part of the CNV definition, the outer ones are either known or not, and may be on the same chromosome or not.

So one needs:

a definition for BRK variation events (position, state)
a general method to reference between variation events

See again the example.

Alternative to "placed CNV by referencing"

Not my preferred option; however (especially for knowledge collections), one could also create some kind of "bundle" of CNV and BRK events, (or for that matter of the 2 BRK events constituting a translocation ... but then it becomes crazy if one wants to represent complex events...).

Peter-J-Freeman commented 5 years ago

Hi Michael,

Unfortunately, within HGVS the term dup is reserved for tandem duplication of a specified range within a reference sequence. Refer to the following link. http://varnomen.hgvs.org/recommendations/DNA/variant/duplication/

Similarly, deletions in HGVS are reserved for the absence of a specified range from a reference sequence.

So it looks like there will be a need to address some disparity between HGVS and other models such as VCF and VMC. I suspect that the HGVS working group would prefer to create additional variant types other than dup or del to specify a gain or loss of copies.

Have you any suggestions as to how your example might be used to determine whether a DUP is a copy number change or a tandem insertion of the specified range.

Also, do we need further discussion with the HGVS SVN working group with respect to this issue?

Thanks

Pete

Peter-J-Freeman commented 5 years ago

Hi Michael,

I spotted the duplication in your example. I had previously missed the link.

Can you provide a shorthand example for a duplication similar to the copy number loss above?

Also, can I just ask for clarification on the alleleCount? Am I correct to assume that this represents the number of observed copies of the allele or have I totally missed the point? Is it worth also having an option to state the number of tandem copies of the specified range are present within the current allele?

I have a few ideas, but it's worth making sure I'm understanding your examples correctly before commenting further.

Thanks

Pete

reece commented 5 years ago

Please use #46 for a consolidated discussion of CNV requirements.

ga4gh / vrs

Total Copy/Genomic Count concept and HGVS #42

Alternative to "placed CNV by referencing"