Open reece opened 6 years ago
C/P from #51, closed as duplicate:
A goal for this issue is to write a document that includes fusion use cases and a proposed model.
This should also serve as an introductory exercise for handling ambiguous representations of fusions (e.g. only one fusion partner specified or only gene names specified) alongside particular representations of fusions (defined transcript regions present / absent).
NTRK Fusions curation elements example (initial draft, WIP, from Angshumoy Roy and Gordana Raca, ClinGen Somatic WG): https://drive.google.com/file/d/18EEeIadChFwh79vEBz2knphKsYsfpONu/view?usp=sharing
From Subha (ClinGen Somatic WG), a paper with a few interesting data elements to capture on fusions: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6329466/
From Marilyn Li (of AMP/ASCO/CAP guidelines) on fusions in HGVS:
Here is my suggestions to them: I like the proposed nomenclature which reflects well the nature of fusion genes at RNA level. Since the nomenclature is at RNA level, there is no need to separate out translocation fusion vs. deletion fusion; as the nomenclature itself does not indicate the mechanism of fusion formation. The same is true for the duplication fusion or inversion fusion. I like very much the way to indicate a linker between the two fusion partners. However, there are a few practice issues to apply this nomenclature to our daily work:
- Most of the labs use short fragment sequencing, although the sequences from 5' end to the fusion point and from fusion point to 3' end of the fusion gene could be inferred, we know there are exceptions such as potential alternative splicing, etc.
- Although RNA numbers (r.123) accurately represent the fusion location, it is very hard for people to remember what exons these numbers correlate to. In many cases, different exons involving in the fusions between the same two fusion partner genes may carry different clinical significance, much like different mutations in the same gene. Therefore, it is important to indicate the exon number in the fusion nomenclature. For these reasons, I would like to propose to add exon number to the nomenclature and suggest a short form of RNA nomenclature to complement the long form proposed by your group: Long form: NM_002354.2:r.-358_555(exon8)::NM000251.2:r.212(exon11)*279 Short form: Simple exon to exon fusion: NM_152263.2:exon8::NM_002609.3:exon11
I've some time to review the proposal and I can see a number of issues with SV support, many of which are also problems for VCF. Here is my take on (DNA) SV support:
1) The data model must support three SV primitives:
Fundamentally, there are three fundamental primitives require to fully support arbitrary genomic rearrangements:
SequenceLocation
with an orientation. NestedInterval
looks like it can handle imprecision in both CN segments and breakends which is nice.
Other primitives are possible but unnecessary. Whilst VCF does support promiscuous breakpoints, in practice, nobody uses these. If there is any demand for the representation of such ambiguities, additional annotation on a set of breakends are able to represent this (single breakends were a late addition to VCF).
2) SV events are composed of one of more primitives
One criticial limitation of VCF is the lack of definition of what a symbolic allele actually represents. If I have a VCF, I don't know whether the <DEL>
call is a) a CN call (microarray call), b) a breakpoint call (DNA sequencing SV caller), or c) a claim that a simple deletion event occurs at that position. The difference between b) and c) is subtle but extremely important.
Unlike SNVs and small indels, where the primitive corresponds directly to the events, there is a 1 to many relationship between events and primitives for SVs. A simple deletion event requires both a breakpoint and a DNA loss/CNV to co-occur on the same chromatid. Chromothripsis results in many deletion-like events, as do retrocopied processed transcripts. The lack of seperation between the detection of the basic primitives and the biological interpretation of the events producing those primitives is a critial flaw in the current VCF specifications.
I've been working a lot of this lately and, if you CN and SV calls are good enough, path traversal and event classification is can be done, even for events composed of 50+ breakpoints. I'll provide a link to our bioarvix preprint as soon as it goes up (~1-2 weeks). Even something as simple as a gene fusion can (spoiler: and frequently does) have multiple underlying breakpoints in the DNA.
3) Aneuploid-aware phasing
This is extremely important in the cancer setting. Phasing events to paternal/maternal is of marginal utility to event determination and downstream interpretation when aneuploidy is ubitiquitous. To be useful in a cancer setting, we need to be able to specify whether or not two events are adjacent on the same chromatid. I'm not even sure what to call this. It's not quite phasing but, as far as I know, the community hasn't come up with a term for this.
This capability important even for SNV and small indels as to determining whether a gene has any functional copies when aneuploidy is present requires finding a deleterious event on every copy of that gene.
Fortunately, this is almost handled by the Haplotypes
class in the Future Plans
section. Unfortunately, it's not quite enough as, the definition refires the same reference sequence
which is incorrect when inter-chromosomal events are present, and in the presence of breakage-fusion-bridge (BFB), one can have multiple copies of an allele on the same chromatid, each of which is 'phased' to a different event. For example, a simple deletion amplified by BFB to 4 copies could have 3 copies adjacent to one particular SV, and the other copy adjacent to a different one. A wording change in the definition, and a way to specify adjacency (and adjacency ambiguity) should be enough to represent aneuploid phasing.
4) Normalization
A recommended SV/CNV normalisation scheme should be included. I'm a big fan of centre-aligning and calling the middle of any interval of ambiguity or micro-homology as it a) maximises the chance of matching SV and CNV calls actually matching coordinates, and b) depending on the orientation, left/right aligning a breakpoints will result in the other side being aligned to opposite way.
Note that NCBI’s Variant Overprecision Correction Algorithm only works for isolated SNV and indels and fails to resolve ambiguity on more complex events. For example, a del-with-ins event with a 50bp interval replaced by 5bp of random sequence will be incorrectly represented as S/W alignment of such events almost universally results in them being represented as a set of SNVs and smaller indels and the overprecision correction algorithm does not enforce a valid haplotype interpretation. The SNVs can be adjusted into one of the deletion calls which, given they are phased, doesn't make sense.
Similarly, tandem duplications can be represented as insertions. Benchmarking SV callers on the CHM1/CHM13 data set was problematic due to the different representation. For example, a 3xSINE sequencing being expanded into a 4xSINE sequence had the long read callers report an INS at the first SINE element, and the short read callers report a DUP of the final SINE element. They both result in the same sequence, but they're reported as different event types and their positions are nowhere near overlapping.
5) Sub-clonality
As witth aneuplody-aware phasing, sub-clonality complicates the model but is absolutely required for the specifications to be useful in a somatic setting. The proposed Haplotype
class does not support any relationship between multiple related copies of the same germline sequence, either through sub-clonality, or aneuploidy.
On the plus side, the decision to use an interbase coordinate system was asolutely the correct decision and it avoids a whole host of problems that are a pain to deal with in VCF.
How is group membership of this specifications determined? I'm more happy to join and draft up the changes required to support SVs.
@d-cameron Thanks for the interest and willingness to contribute! We'd love to have you join the calls (Mon 1600 UTC). I realize that this is a terrible time for you. We have currently have 10-20 people on the call reliably, most in the US and Europe.
We can pick up the substance of your proposals at a later date (and after some of the current construction dust has settled).
We recently discussed how structural variants are annotated on knowledgebases at one of our oncology annotation meetings. Here's what we concluded:
1) We would prefer structural variants/fusion to be searchable by either 5' or 3' partner. For example, there are distinct reasons why someone might want to look up "RUNX1 fusions" or "ETV6 fusions". Would like the capability to search in either direction to allow modeling of "junctions".
2) CIViC structure could be cleaned up by depositing all fusions into one "bucket" per gene. For example ABL1 fusions --> (drop down to All ABL1 fusions, ABL1-BCR, ABL1-NUP214, ABL1 fusion not otherwise specified (NOS)... then group resistance mutations associated with each group. This will allow for easier searches and a cleaner look to the web page. Not sure if this is completely in-scope.
3) Not sure if this is in-scope with this group, but can we talk about standardizing the vocabulary used to describe copy number losses/gains. For some genes, there are variants under both "loss", "loss-of-function" and "deletion". All would be relevant to my search. Ex/PTEN in CIViC - Loss of function group is established in the variant list. "Deletion" and "loss" have the same genetic result and may be better suited to be consolidated as "loss" and reside in that box.
4) Agree with Dr. Li's proposal for RNA-based fusion annotation if a transcript is specified - due to exon numbering inconsistencies between gene transcripts.
5) I would love to help brainstorm annotation of structural variants that do NOT result in fusions (position effects). Maybe annotate as structural variant resulting in a change in expression. Either way, sometimes it is difficult to know if we are looking at something novel since these annotations are so inconsistent - we are often looking to conventional cytogenetic results.
Thank you for all the work you do!
From a conversation with CIViC team today on fusion representation, capturing a few points shared by @obigriffith:
I think for now users should be aware that if they see a simple GeneA-GeneB fusion in CIViC with a representative exon-exon junction for coordinates that the coordinates may represent a common/dominant isoform or be a fairly arbitrary representation of a diverse set. If nothing else the variant summary could include these details
In practice I think most fusions are actually being detected by somewhat non-specific means that don't correspond to any one isoform, nor should they. Most FISH, Archer assays, Foundation etc are able to detect multiple isoforms.
However, it sounded like if CIViC needed to represent a specific exon-exon junction between two transcripts (versus one of a set of exon-exon junctions that are collectively the subject of an evidence statement), this would be captured in the variant name, e.g. FGFR3(exon17)-TACC3(exon11)
(as opposed to FGFR3-TACC3
). From Marilyn Li's comments above, it sounds like we will still want to distinguish between the CIViC-style ambiguous fusion representation (seems like a candidate use for GeneLocation objects?) and the more concrete junction-level representation proposed by Dr. Li.
I'm a bit confused as to what is actually being represented by a gene fusion variant. Are we referring to the transcript-level products, or the breakpoint(s) in the underlying DNA enabling one or more fusion transcripts? If we're talking about both (or have use cases for both), then we need to distinguish between these two claims as there is a many to one mapping between fusion transcripts, and underlying rearranged DNA.
there is no need to separate out translocation fusion vs. deletion fusion
Note how I did not say breakpoints. The majority of TMPRSS2-ERG fusions are chromothripticly derived gene fusions with potentially hundreds of breakpoints involved in the originating event - it's not a simple translocation/deletion dichotomy. Complicating things further, a non-trivial percentage (30%+ for TMPRSS2-ERG) of these involve more than one breakpoint in the fusion itself. That is, TMPRSS2-(elsewhere)-ERG. It's not just single DNA fragments either - I've seen multiple cases of driver gene fusions involving 3+ underlying breakpoints.
@d-cameron I encourage you (and other interested participants on this thread) to join in on our VR call tomorrow, where we will be discussing the varying levels of specificity our model will need to capture, including your comments above.
Things to be aware of when designing model:
Living slide deck with notes on curation and strawman model: http://bit.ly/VR-SVdeck
Per @d-cameron statement above regarding complex rearrangements: If we want to further clarify chromothrypsis, I think there is potential under this model: 1) List genomic region(s) of complexity (per chromosome start-stop of total complex segment including copy number losses, gains, and structural variants) and within that total region of complexity: a) List genomic regions of copy number loss within complex segment i) potential to call out tumor suppressor genes within regions of loss b) List genomic regions of copy number gain within complex segment ii) potential to call out oncogenes in regions of copy number gain if potentially significant. c) List coordinates/fusion partners signifying important structural variants within chromothryptic region.
I'm open to criticism on this. This is something our group put together for potentially classifying chromothrypsis for mate pair sequencing and then abandoned since we are not planning on doing clinical exploratory whole genome analysis using that assay anytime soon.
Apologies that I was not able to make the call.
From my perspective, we need to two-level model (I'm working on one for VCFv4.3) that separates the low-level claims made by the callers (ie break-junctions and CN segments), from the higher level claims that associate them into events.
Take the following relatively simple example of chromothripsis: Here we can fully resolve the chromothriptic event. There's a whole lot of regions of CN gain/loss and a whole lot of SVs (NB: all SVs are 'important' when reconstructing derivate chromosomes), but it's really just single event happening.
Another relatively simple event is the breakage-fusion-bridge cycle on the COLO829T cell line that looks like the following:
It's just a few fold-back inversions amplying chr3 but the fold-back breakpoints involve 6, 10 and 12 as well. Again, this is a relatively simple event with just a handful of breakpoints and CN segments involved. I have samples where we have chromothripsis with 500+ breakpoints and can't fully reconstruct the derivate chromosome but I still need to the able to represent the partial reconstructions that we can do.
Another example of why it's important to separate the breakpoint/CN claims from the event claims is for retrocopied genes. They look like the following:
Note how the SV breakpoints and CN changes all line up with the intron/exon boundries. There are about 15 of these prevalent in the population but not in the reference and if you look at the variant databases such as dbvar, they claim that the gene in question has an intronic deletions at almost every intron. At one level this is correct, as the extra copy lacks introns, but at the same time, those DEL claims are all incorrect as there's no actual deletions in the gene itself and having a database full of them is completely misleading to users that are wanting to check the prevalence of mutations in their gene of interest.
@bpitel12
a) List genomic regions of copy number loss within complex segment
I don't see the need for a separate definition of a complex segment in the specifications themselves.
i) potential to call out tumor suppressor genes within regions of loss
Biallelic inactivation is really what's important for most tumor suppressors and that's a commonly a combination of a SNV/indel and a (partial) CN loss. I do see value in annotating of the impact of variants but it's not always as straight-forward as a simple CN loss.
b) List genomic regions of copy number gain within complex segment
Again, I don't see the need for a complex segment definition. We're possibly talking about similar things with different terminology. I'm proposing the grouping of a set of SV/CN into an 'event'. Such a model would not need to differentiate between simple and complex events but it definitely needs to be more complex than a 'start/end complex region' style of interval. A simple deletion would merely have a single breakpoint and CN associated with it, whereas chromothripsis would have many of both.
The temporal association and overlaps between event would have to be clarified though. For example, if a simple DEL is just a SV+CN segment, nested deletions would be problematic as there would be 3 CN segments relevant to the outer deleted region.
ii) potential to call out oncogenes in regions of copy number gain if potentially significant.
Driver gene annotation is indeed useful, and amplified drivers are usually contained with a single CN segment so that would usually work.
c) List coordinates/fusion partners signifying important structural variants within chromothryptic region.
A non-trival number of expressed gene fusions involve multiple breakpoints. It's not just geneA->SV->geneB. It's rearrangements such as geneA->SV->other location->SV->another location->SV->geneB that still produce functional fusion products. Again, we're looking at a set of SV/CN that are responsible for a gene fusion (the complex ones typically form part of a larger chromothriptic rearrangements).
Hi Daniel- Can you please describe what these figures are showing? I think I understand, but it seems to me that we should capture this background info for a future a requirements doc (a Google doc).
Generally, VR does anticipate the two-level model that you describe. I've historically (in VMC) typically described these as observed and representative variation, where observed variation is precise (to the limits of the assay), whereas representative variation generalizes observations. We're also considering rule-based variation, which is a kind of representative variation, so this language likely needs to be revisited (or at least agreed upon).
I'm glad to have your experience guiding our SV modeling.
-Reece
On Thu, Aug 15, 2019 at 7:07 AM Daniel Cameron notifications@github.com wrote:
Apologies that I was not able to make the call.
From my perspective, we need to two-level model (I'm working on one for VCFv4.3) that separates the low-level claims made by the callers (ie break-junctions and CN segments), from the higher level claims that associate them into events.
Take the following relatively simple example of chromothripsis: [image: image] https://user-images.githubusercontent.com/6036536/63098082-6706de80-bfb5-11e9-8913-00b66b02db31.png Here we can fully resolve the chromothriptic event. There's a whole lot of regions of CN gain/loss and a whole lot of SVs (NB: all SVs are 'important' when reconstructing derivate chromosomes), but it's really just single event happening.
Another relatively simple event is the breakage-fusion-bridge cycle on the COLO829T cell line that looks like the following:
[image: image] https://user-images.githubusercontent.com/6036536/63098342-0b892080-bfb6-11e9-8eb0-a76cd6af23a0.png
It's just a few fold-back inversions amplying chr3 but the fold-back breakpoints involve 6, 10 and 12 as well. Again, this is a relatively simple event with just a handful of breakpoints and CN segments involved. I have samples where we have chromothripsis with 500+ breakpoints and can't fully reconstruct the derivate chromosome but I still need to the able to represent the partial reconstructions that we can do.
Another example of why it's important to separate the breakpoint/CN claims from the event claims is for retrocopied genes. They look like the following: [image: image] https://user-images.githubusercontent.com/6036536/63098654-b00b6280-bfb6-11e9-8aa9-04643184beca.png
Note how the SV breakpoints and CN changes all line up with the intron/exon boundries. There are about 15 of these prevalent in the population but not in the reference and if you look at the variant databases such as dbvar, they claim that the gene in question has an intronic deletions at almost every intron. At one level this is correct, as the extra copy lacks introns, but at the same time, those DEL claims are all incorrect as there's no actual deletions in the gene itself and having a database full of them is completely misleading to users that are wanting to check the prevalence of mutations in their gene of interest.
@bpitel12 https://github.com/bpitel12
a) List genomic regions of copy number loss within complex segment
I don't see the need for a separate definition of a complex segment in the specifications themselves.
i) potential to call out tumor suppressor genes within regions of loss
Biallelic inactivation is really what's important for most tumor suppressors and that's a commonly a combination of a SNV/indel and a (partial) CN loss. I do see value in annotating of the impact of variants but it's not always as straight-forward as a simple CN loss.
b) List genomic regions of copy number gain within complex segment
Again, I don't see the need for a complex segment definition. We're possibly talking about similar things with different terminology. I'm proposing the grouping of a set of SV/CN into an 'event'. Such a model would not need to differentiate between simple and complex events but it definitely needs to be more complex than a 'start/end complex region' style of interval. A simple deletion would merely have a single breakpoint and CN associated with it, whereas chromothripsis would have many of both.
The temporal association and overlaps between event would have to be clarified though. For example, if a simple DEL is just a SV+CN segment, nested deletions would be problematic as there would be 3 CN segments relevant to the outer deleted region.
ii) potential to call out oncogenes in regions of copy number gain if potentially significant.
Driver gene annotation is indeed useful, and amplified drivers are usually contained with a single CN segment so that would usually work.
c) List coordinates/fusion partners signifying important structural variants within chromothryptic region.
A non-trival number of expressed gene fusions involve multiple breakpoints. It's not just geneA->SV->geneB. It's rearrangements such as geneA->SV->other location->SV->another location->SV->geneB that still produce functional fusion products. Again, we're looking at a set of SV/CN that are responsible for a gene fusion (the complex ones typically form part of a larger chromothriptic rearrangements).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ga4gh/vr-spec/issues/28?email_source=notifications&email_token=AAA2XDJDS66XLGGP5GQRCD3QEVPIJA5CNFSM4FVRJS4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4L43PA#issuecomment-521653692, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA2XDKPA4CRKBHPIQ4C62DQEVPIJANCNFSM4FVRJS4A .
NCI internal fusion specification (Thomas Coard, Anna Lu, Jennifer Lee):
We now have our preprint up: https://www.biorxiv.org/content/10.1101/781013v1
@reece we now proper documentation for our circos plots https://github.com/hartwigmedical/hmftools/blob/master/sv-linx/README_VIS.md#circos-panel as well as our event classifier/DNA-Seq fusion caller https://github.com/hartwigmedical/hmftools/tree/master/sv-linx
@ahwagner The NCI fusion specifications only work for RNA. It doesn't work for DNA for tthe following reasons:
16% of driver fusion in our cohort of ~4,000 WGS metastatic tumour samples (see above preprint) involve multiple breakpoints. That is, we find functional TMPRSS2-ERG fusions (validated by RNA-seq) that go TMPRSS2 -> other genomic location [-> yet another genomic location, ...] -> ERG. 3 and 4 and even 5 break fusion are also possible. The max we've seen in 15 breakpoints but I'm somewhat dubious about that one since it's such an extreme outlier.
DNA breakpoint position coordinates do not uniquely determine transcript product. The nomenclature does not specify what transcript/exon skipping occurs.
The mapping from DNA breakpoints to gene fusions is decidedly non-trivial (see https://github.com/hartwigmedical/hmftools/blob/master/sv-linx/src/main/resources/readme/fusion_configurations.png). It's entirely possible to have a functional gene product when the breakpoint does not occur in the downstream gene at all: does the proposed formal support negative 5' gene breakpoint positions? For example, a breakpoint upstream of promoter would result in the gene fusion with the first exon being skipped.
Here is a simple real-world example of what I'm talking about: TMPRSS2 exon1 -> TMPRSS2 intron 1 in the wrong orientation -> ERG intron 1 = TMPRSS2 exon 1 to ERG exon 2 gene fusion. Either of these breakpoints in isolation does not give a gene fusion product, but combined (we also do phasing of SVs so we can tell they are cis SVs), you get a gene fusion.
NB: It might be even messier than we have found as it's theoretically possible for n-way fusions in which more than two genes are involved (ie the other genome locations contain exons). We haven't found any in our data set but that's only because we haven't looked since our software only supports 2-way fusions. Another edge case we did not investigate is read-through transcription resulting from two genes being brought in close proximity to each other (but not fused per se).
Here's another read-world example that I don't think the proposed design handles:
In this sample, we see a chromoplex event causing a TMPRSS2-ERG fusion via PTEN itron 1 (follow the blue line) whilst also breaking PTEN (since it was put back together partly on the yellow and partly on the purple chromatid), as well as a loss of PPP2R2A.
We have a single event resulting in ~20 breakpoints with three outcomes, all clinically relevant.
At minimum, a usable genomic rearrangement (I use this term due the fundamentally interconnected relationship between SVs and CNVs) representation format needs to be able unambigiously allow (partial) derivative chromosomes to be defined and, preferrably, be able to explicitly represent complex events.
We are currently using custom csv files to represent all this. We could do it in VCF but we'd have to define a whole lot of custom fields to do it since the spec-defined fields don't cover any of this.
These are beautiful illustrations, @d-cameron! I wanted to chime in and say that the last example above is what we see fairly frequently in our lab through mate pair sequencing. To report these variants, we have been using a hybrid of HGVS nomenclature Cytogenetic ISCN detailed system nomenclature, similar to (but not exactly) what is described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4067557/
Here are a few publications from our group that may add to the wealth of information above on how complex variants can be described in human-readable(ish) form. I think we're all looking for ways to describe this a little more easily: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822454/pdf/13039_2019_Article_455.pdf https://www.tandfonline.com/doi/pdf/10.1080/10428194.2018.1480774?needAccess=true
Thanks, team.
@bpitel12 how do you find that notation scaling?
From the looks of it, the detailed notation requires ~40-60 characters per breakpoint (rearrangements using the nomenclature of the linked paper).. A 'simple' 100 break chromothripsis event requires ~5000 characters to specify and if you follow it by breakage fusion bridge, it looks like you'd have to define the full derivative chromatid structure and specify the 10 fold amplified regions 10 times since you need to walk all the way through the derivative chromosome.
@d-cameron I appreciate your comment. We have never tried to annotate chromothripsis in this manner for that very reason. It is very difficult to be concise enough yet descriptive enough to effectively communicate the structure in text. I suppose if we had to we could provide the 5000 character long version as supplemental - but not sure who has the time/energy to sift through 5000 characters of structural description. In this case, we would likely supply a visual like you so nicely did in the string above. For communicating complex SVs, we often will provide some nomenclature for just the parts of the rearrangement that are clinically relevant.
For example - imagine you have IGH inserted into the MYC region as part of a chromothriptic event on chromosome 8 in addition to MYC amplification. Maybe we can provide a visual similar to what you have shown above as well as some nomenclature that looks something like this (I apologize for not providing exact coordinates - in a time crunch for some other projects today, but wanted to respond and see if maybe you or others had some ideas on making this better):
ins(14q32.33)(q24.1)(MYC+) ; 8pter-->cth8(q23qter)...(8q24.1(123,456,789)::14q32.33(89,123,456-->90,234,456)::8q24,1(123,567,890)...
Thank you!
I have problems to understand the value - but also the consistency - of describing CTLPs* completely, as single events; especially since they won’t ever happen again, in the exact form.
But then, one can provide the elements in a format that allows reconstruction, right? I mean, such a Cicos plot is drawn from individual events, isn’t it?
A representation of a CTLP has to:
So in a variant DB you would use a reference to the same callset/sample/experiment, whatever your flavour, with each fusion, CNV.
In an evidence DB, you’d list the individual events with your ID’d variant.
————-
I mean, such a Cicos plot is drawn from individual events, isn’t it?
It's from individual breakpoints and (allele-specific) copy numbers. 'event' is a loaded term. In my work, I define an 'event' as a transformation from one stable genomic configuration to another. Using this definition, a genome can have zero or more chromothripsis events, each of which will have many breakpoints. It also results in chromothripsis + breakage fusion bridge to be defined as a single event and, given that BFB occurs over multiple cell divisions, others may disagree with our event definition.
In an evidence DB, you’d list the individual events with your ID’d variant.
That approach won't handle compound fusion events where a fusion is the result of multiple breakpoints. Each breakpoint on it's own won't make the fusion, but traversing the derivative chromosome across multiple breakpoints results in a functional fusion product.
It is also possible to have a 3-way fusions in which part of a third gene is between the 5' and 3' fusion partners.
resulting CNVs
The CNVs in the aggregate result across all derivate chromosomes. The CN and CN delta from a chromothripsis event is the sample copy number. Are you proposing deconvolution of the copy number into the consistuent 'events', or breaking down by derivate chromosome? AFAIK, there is currently no software that does either of these things.
a method to retrieve all events occurring in the same callset (i.e. analysis of a biosample) - this is essential anyway for genotype descriptions
Phasing of SVs is extremely important (~15% of driver fusions are compound fusions - mostly TMPRSS2-*-ERG fusions).
a categorical annotation with each event (fusion, CNV...), labeling it as part of chromothripsis, CTLP, whatever..., for querying/association/attribution
I am indeed advocating for this and am currently working on incorporating this into VCF vNext. We're not at the point were we can use a closed vocabulary for the events, but being able to tied related breakpoints/CN changes together is important.
Relevant meta-thread from hts-specs: https://github.com/samtools/hts-specs/pull/465
Hi All! As a medical scientist I would find it beneficial to have two forms of fusion nomenclature: a short one and a long one. I think short one should include information that is crucial for the clinical utility of the fusion. The longer form would include all details that help to identify the exact localisation of the fusion in genome. I would see the longer form to be included in the supplementary data of the medical report and a shorter version would be included in the main report comment together with a clinical utility of the fusion. Looking on the NTRK fusions curation elements table: longer version of a fusion description could include all information presented in that table. Short nomenclature perhaps could be limited to: genome version, refseq transcripts, gene names, their positions (3' or 5'), exons and functional domains information. It should be clearly stated if this causes loss or gain of function as this information is crucial for the treatment decision. Additional information about resistance mutations should be added if applicable. I’m really interested what are your thoughts on this idea?
Whether we should comment if the fusion was “in-frame” or not it depends on the context. From my experience I can say that when I was dealing with a DNA seq results I would not report a structural variant as a fusion (especially for new fusion genes) if it was out of frame and I could not confirm it by RNA methods. However, reading frame information could be beneficial, for example, when DNA seq was out-of frame but RNA results showed that the final product was in frame (for instance in exon skipping situation).
Moving forward on this we should distinguish DNA and RNA focused nomenclature. Some labs haven’t got RNA sequencing methods in place. Also, for some technologies it's hard to determine the exact breakpoint of a fusion. When a breakpoint is in the intron the most important question is which exons are fused to each other, the exact genomic position has less clinical utility value. Nevertheless, when we're dealing with a break point in exonic sequence the exact nucleotide position is crucial to determine functionality of a fusion.
When it comes to RNA nomenclature Li's proposal looks nice and practical and I would definitely like to add gene name to both forms. Ordulu et al. (2014) suggestions are good but complex and when I reviewed this nomenclature in my lab it seemed like this complexity is not applicable for a clinical reporting.
Shouldn't the downstream impacts be within the scope of Variant Annotation and not Variant Representation?
@d-cameron +1 @TanskaAnnna These points are all related to annotation/attribution, not to the representation of sequence alterations per se.
I think that @TanskaAnnna intended to write these comments in the context of some of the comments earlier in this thread (1, 2, 3, 4) pertaining to data elements relevant to the characterization of gene fusions. Like me, Ania is thinking about how categorical notions of fusions (e.g. fusions that are defined by hyperactivity of a specified functional domain) should be represented as subjects of annotations. This is a fuzzy line between VR and VA that we are still trying to work through, but categorical variation is a key use case for the VICC driver project (which primarily works with aggregated concepts like this) which we are working to support in VRS.
As a Requirements
tagged thread, I think it is valid and reasonable to put these thoughts here, especially as they are a direct response to 1 and 2 above.
However, as meta-threads like this can get a little busy, I am going to capture Ania's comments over at cancervariants/fusions to continue the discussion on short-form vs. long-form nomenclatures, and bring back the key findings from that effort to inform building of these concepts in VRS and VA.
I think that @TanskaAnnna intended to write these comments in the context of some of the comments earlier in this thread (1, 2, 3, 4) pertaining to data elements relevant to the characterization of gene fusions. Like me, Ania is thinking about how categorical notions of fusions (e.g. fusions that are defined by hyperactivity of a specified functional domain) should be represented as subjects of annotations. This is a fuzzy line between VR and VA that we are still trying to work through, but categorical variation is a key use case for the VICC driver project (which primarily works with aggregated concepts like this) which we are working to support in VRS.
As a
Requirements
tagged thread, I think it is valid and reasonable to put these thoughts here, especially as they are a direct response to 1 and 2 above.However, as meta-threads like this can get a little busy, I am going to capture Ania's comments over at cancervariants/fusions to continue the discussion on short-form vs. long-form nomenclatures, and bring back the key findings from that effort to inform building of these concepts in VRS and VA.
Apologies for confusion folks and thanks @ahwagner - that's exactly what I had in mind. Will continue on Salient elements of gene fusions issue.
Shirley Li mentioned COSMIC on the VICC General Call today; not seeing the link on this thread, so adding it here to track: https://cancer.sanger.ac.uk/cosmic/fusion
Also mentioned gnomAD fusions, which we have not yet looked at.
This issue was marked stale due to inactivity.
VR needs to have a path for representing translocations.
Allele is currently defined as a contiguous sequence change at a single location. Translocations and junctions are unlikely to fit in that model.
See also #23 and #51.