Missing Structural Variant Model Properties

n1zea144 commented 4 years ago

This issue is part of an effort to reconcile the current structural_variant database model and discussions of structural variant / fusion file format that took place during the 2019 cBioPortal Hackathon.

Fields in hackathon document missing from model:

Strand 1
Strand 2
Status (Germline or Somatic)

To be discuss -

Do we want to define consequence (currently event_info) of structural variant in more granular terms?
What level, if any, do we support MAVIS tool. We support input to the tool, but not the following output fields (glossary):
- spanning reads (the number of spanning reads which support the event)
- flanking read pairs (A pair of reads where one read maps to one side of a set of breakpoints and its mate maps to the other)
- split reads (Number of split reads that align to both breakpoints)
- Half-mapped reads (A read whose mate is unaligned. Generally this refers to reads in the evidence stage that are mapped next to a breakpoint)
- Event type (The classification of the event - is this same as CLASS?)
- Fusion protein hgvs (Describes the fusion protein in HGVS notation. Will be None if the change is not an indel or is synonymous)

migbro commented 4 years ago

Hi! This is Miguel, from D3b at CHOP. We talked a little about this today and had a couple questions to start: 1) Can some comments/description be added to the planned DB schema? It'd help us to see what is being loaded and if anything is missing 2) What are the plans on the front end to have users query SVs? SVs rarely have the same start and ends, and can encompass 10s to 1000s of genes, depending on the size. Is there a mechanism in place if a user searches a list of genes/regions to show which SVs overlap?

jjgao commented 4 years ago

@migbro thanks for the comments.

Please see @n1zea144's link above.
We are mostly interested in supporting query and visualization of gene fusions instead of copy number events if I understand your question correctly.

migbro commented 4 years ago

Hi @jjgao , thanks for your quick response. 1) I have seen that link, I was thinking of something more along the lines of: https://github.com/cBioPortal/cbioportal/blob/87e5a228d80a31515bf1a3e762bd2d70db89d2fb/db-scripts/src/main/resources/cgds.sql#L425 for instance, what is CONNECTION_TYPE? 2) Yeah, I realize it sounds like I was describing CNVs, as DNA SVs can be described similarly, but they’re not quite the same. The model definitely seems to fit what is needed to describe RNA fusions, in which most are concerned with which two gene transcripts are fused together and where, but for DNA structural variants, which can be massive insertions, deletions, chromosomal region swaps - not necessarily a change in copy number, but sequence content, might be captured at a basic level of describing the borders of the event, but not quite what genes within those borders are affected.

If the real purpose here is to move RNA gene fusions out of the mutations into its own thing and perhaps deal with DNA SVs at another time, then this will probably do it. I think the only issue we had earlier was that you could not search the 3' end of the gene fusion, you could only search the 5' ones, unless you represented it twice. Hopefully the new model will fix that. Thanks for your work!

migbro commented 4 years ago

Ah, a quick update, so my colleague @kgaonkar6 spotted this note in the hackathon outline:

Fusion / rearrangement file format (Benjamin, Pieter, Niki, Tali)
Agreed that we need an annotation service that can annotate all events based on the following attributes

So, that might address most of my concerns regarding the DNA SVs 😅

jjgao commented 4 years ago

thanks, @migbro .

The data model is exactly what we would like to finalize :) - it would be great if we can get more input from you and others.
I would say our format is really just about the break points, e.g. when there is a deletion, we would like to model the new join the genome (e.g. fusion) instead of the deleted genes. When there is a complex rearrangement, we would like to look at each breakpoint separately for their effect on proteins. I understand we may miss the complete picture by doing that, but we thought it's better to cover the SV with fusion effect first before tackling more complex cases.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jjgao commented 3 years ago

moving to icebox for now. Will revisit.

cBioPortal / icebox

Missing Structural Variant Model Properties #119