PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
254 stars 71 forks source link

Downstream analysis. What to do with a VCF ? #318

Closed ptranvan closed 4 years ago

ptranvan commented 4 years ago

Hi @d-cameron.

Thanks for your software. I got now a vcf file but it's difficult to interpret when you have only breakpoints.

I have used the StructuralVariantAnnotation package and the code found in the 2nd post here (https://github.com/PapenfussLab/gridss/issues/74) to classify the variant. Few questions:

1) Is your code still valid or you have an up to date way to classify variants (since the post is 3 years old).

2) I am interested to have the genotype information (0/0, 0/1 or 1/1). Is there any way you recommend to infer it from your vcf ?

Thanks for your answers.

d-cameron commented 4 years ago

Is your code still valid or you have an up to date way to classify variants (since the post is 3 years old).

We now have an event classifier that runs downstream of GRIDSS (see https://www.biorxiv.org/content/10.1101/781013v1.full). It's designed for somatic analysis as that has been our use case and you get way more complex rearrangments in cancer than you do in the germline.

I am interested to have the genotype information (0/0, 0/1 or 1/1). Is there any way you recommend to infer it from your vcf ?

GRIDSS doesn't (yet) genotype as a) it makes no assumptions regarding ploidy, and b) once you have structural variation, you are no longer diploid in that region.

That said, if you do have strong diploid constraints (e.g. you're looking at germline variants), then you can genotype variants based on REF, REFPAIR, and VF.

Note that you shouldn't include REFPAIR for small variants as they provide support for both the ref and the varaint (e.g. RP spanning a 50bp del is consistent with both the reference allele and the variant allele). See libgridss.gridss_bp_af() and gridss_be_af() for our breakpoint/breakend allele fraction calculations.

As an aside:

I got now a vcf file but it's difficult to interpret when you have only breakpoints.

almost every SV caller^1, reports breakpoints without regard for copy number. When a caller reports a <DEL>, it's merely indicating that a breakpoint consistent with a deletion exists^2. It doesn't actually check that the "deleted" region has actually been deleted. GRIDSS reports all variants in BND notation to make it explicit what claim is actually being made. Essentially, the only difference in a <DEL> call made by GRIDSS or DELLY is in the notation - neither actually check that a <DEL> call is actually a deletion.

^1 Lumpy is the only exception I can think of as that incorporates the CN flanking the SV, but still doesn't check for CN consistency over the entire SV interval. ^2 Unless you run a CN caller in which case it's making a claim about the read depth/copy number, but not about the breakpoint.

d-cameron commented 4 years ago

Is your code still valid or you have an up to date way to classify variants (since the post is 3 years old).

FYI: I've just added simpleEventType() and simpleEventLength() to StructuralVariantAnnotation as there are a number of scenarios in which they are indeed useful (e.g. the code I also just added that matches the DUP and INS representations of the same variant).

Internal dev branch so you'll need to install via devtools::install_github("PapenfussLab/StructuralVariantAnnotation") instead of BioConductor at for now.

ptranvan commented 4 years ago

Thanks for your answer.

Can I use LINX (https://github.com/hartwigmedical/hmftools/tree/master/sv-linx) directly with the .vcf generated with GRIDSS without running the whole package ?

d-cameron commented 4 years ago

No. LINX is design for somatic analysis, and the raw GRIDSS output also include germline events. You'll need to run the gridss_somatic_filter.R script to filter to high confidence somatic events, PURPLE for CN calling that is base-pair consistent with the SV calls, then LINX.

Unforuntately LINX is currently somatic only. Extending this is on our TODO list, but do not yet have an ETA.

ptranvan commented 4 years ago

Thanks @d-cameron. I have 2 questions. I am trying to understand the relation between REF, REFPAIR and VF.

1) VF could be either DP or SR, and REF is "Count of reads mapping across this breakend" ( = SR ?) So the number of DP supporting the alternative allele is REF-VF right ?

I am interested only in INVERSION and I have performed a joint calling with a normal sample and 4 tumor samples. This is the kind of result I have: (for each sample the numbers correspond to REF, REFPAIR and VF respectively)

#CHROM  POS ID  REF ALT QUAL    FILTER  SVLEN (pb)  Normal  Tumor1  Tumor2  Tumor3  Tumor4
Scaffold03  11457402    gridss3bb_3693h A   [Scaffold03:8096392[GTCTATATATATACACATAATAAATATATTTTTCCATGATATACATATATTAATTTCCTATATAACATTTATCATATGTA    2064.47 PASS    SVLEN=3361088   10, 6, 0    10, 0, 7    10, 0, 14   4, 0, 16    15, 0, 17

Scaffold03  8096392 gridss3bb_3693o G   [Scaffold03:11457402[ACATATGATAAATGTTATATAGGAAATTAATATATGTATATCATGGAAAAATATATTTATTATGTGTATATATATAGACG   2064.47 PASS    SVLEN=3361088   21, 3, 0    10, 8, 7    18, 14, 14  14, 6, 16   16, 3, 17

These are the two complementary inversions.

The first one is what I expected (in term of REF, REFPAIR and VF) because there are 0 REFPAIR for each of the tumor samples. But in the second one, I don't understand why I found a certain numbers of REFPAIR. I would expect something symmetrical to the first inversion.

Any explanations ?

Thanks !

d-cameron commented 4 years ago

These are the two complementary inversions.

No, these are the two sides a single breakpoint. An actual inversion requires two breakpoints: one in ++ orientation and another in -- orientation. For a perfect inversion, these positions will differ by 1 bp.

So the number of DP supporting the alternative allele is REF-VF right ?

No. VF is the count of fragments supporting the breakpoint. If a read pair is both part of a discordant read pair, as well as a split read (e.g. the actual breakpoint occurs in the middle of the read), then VF prevents that fragment being double-counted as both a SR and a DP.

REF and REFPAIR are defined as breakpoint occuring within the read or between the reads so are mutually exclusive and directly comparable with VF. The variant allele fraction is VF/(REF+REFPAIR+VF)^.

I don't understand why I found a certain numbers of REFPAIR. I would expect something symmetrical to the first inversion

REF and REFPAIR are breakend attributes. On one side, you have the # reference-supporting reads/pairs spanning Scaffold03:11457402, on the other side it's Scaffold03:8096392. VF is a breakpoint attribute so it's always the same on both sides.

^ Exception: small ins/del/dups can have read pairs containing the variant. When calculating the VAF of small events, REFPAIR should be excluded. CLEVER (the caller) handles these correctly and it's on my TODO list to add this capability to GRIDSS.

for each sample the numbers correspond to REF, REFPAIR and VF respectively

Those numbers seem to indicate the variant is not present in the normal, and is heterzygous in the all 4 tumours. The fact that there are no refpair reads on lower breakend is a bit suspicious - I'd be checking copy number, for additional flanking variants.

d-cameron commented 4 years ago

If you're looking a tumour samples, be aware that inversion-like events can actually be part of more complex rearrangements (such as chromothripsis).

You might want to consider using gridss/purple/linx if you're interested in complex somatic rearrangements (https://www.biorxiv.org/content/10.1101/781013v1)

ptranvan commented 4 years ago

You might want to consider using gridss/purple/linx

I don't really have a normal-tumor samples. I am working on a non model species that don't recombine in a special region: there are 2 haplotypes that have big inversions.

I have 3 genotypes: 2 homozygotes AA, BB and 1 heterozygote AB. My reference genome is a AA and I use a AA sample as a "normal" and 4 samples (BB) as "tumor", so I expect homozygous variations.

Do you see any problem here ? I mean, I think gridss/purple/linx has been mostly developed for human and will not work in my case right? However gridss only is not a problem right ? I am doing a joint calling,then gridss_somatic_filter.Rand simple_event_annotation.R, then use only SVTYPE=INV and PASS.

No, these are the two sides a single breakpoint. An actual inversion requires two breakpoints: one in ++ orientation and another in -- orientation. For a perfect inversion, these positions will differ by 1 bp.

So I always found two sides of a single breakpoint and never two breakpoints (coordinates with position differing by 1bp). What does that mean ? could it be complex structural variant ?

And if I find two close breakpoints but >1bp, it is likely an inversion-indel right ?

REF and REFPAIR are breakend attributes. On one side, you have the # reference-supporting reads/pairs spanning Scaffold03:11457402, on the other side it's Scaffold03:8096392. VF is a breakpoint attribute so it's always the same on both sides.

Now I am a bit confused. What is your definition of breakend and breakpoint. I have found this paper but now I am not sure: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3249479/

Breakpoints are usually identified by comparing the structure of an experimental genome to that of the reference genome, and breakpoint positions are reported based on the coordinate system of the reference (Figure 2). This can cause some confusion since the number of sites may be different depending on which genome one is referring to. For example, a deletion produces a single junction in the experimental genome, but this junction is defined by two coordinates in the reference; the term “breakpoint” has in various studies been used to describe either, or both points of view. The Variant Call Format (VCF) definition resolves this ambiguity by using the terms “novel adjacency” and “breakend” to refer to sites in the experimental and reference genomes, respectively [22].

d-cameron commented 4 years ago

Now I am a bit confused. What is your definition of breakend and breakpoint.

I use the VCF definitions (https://samtools.github.io/hts-specs/VCFv4.3.pdf) for breakpoint and breakend.

So I always found two sides of a single breakpoint and never two breakpoints (coordinates with position differing by 1bp).

Section 5.4.7 (and figure 8) of the VCF specifications have a good example of a clean inversion: 1 simple inversion = 2 breakpoint = 4 breakends. Note that in the VCF spec example, W and V differ by 1bp, and U and X differ by 1bp, and the breakpoints are ++ and -- in orientation.

What does that mean ? could it be complex structural variant ?

It means your events are is not 'simple' inversions but something more complicated.

In our cancer cohort we found that inversions are almost never canonical simple inversions - there's always some loss or overlap at the break positions.

And if I find two close breakpoints but >1bp, it is likely an inversion-indel right ?

If you look at our LINX classification system at https://github.com/hartwigmedical/hmftools/blob/master/sv-linx/README.md#reciprocal-events you'll see that we don't model RECIP_INV, we model RECIP_INV_DUPS, RECIP_INV_DELS, RECIP_INV_DEL_DUP as we almost always have overlapping or lost sequence at the breaks.

I think gridss/purple/linx has been mostly developed for human and will not work in my case right?

It's been designed for human but if your organisms are diploid, then it should still work if you can get together the appropriate reference files. It doesn't support non-haploid genomes or circular chromosomes.

carolynzy commented 3 years ago

Thanks a lot for the discussion. It's very helpful to me. I have an additional question based on your answer @d-cameron .

It's been designed for human but if your organisms are diploid, then it should still work if you can get together the appropriate reference files. It doesn't support non-haploid genomes or circular chromosomes.

Do you mean GRIDSS doesn't work on prokaryotic genomes? But I have tried gridss.sh on MTB sequencing data from Illumina Hiseq. And it indeed produced the output files as expected. So do you mean the following somatic filtering step etc doesn't work on circular chromosomes? Please help me to clarify. Thanks!

d-cameron commented 3 years ago

Do you mean GRIDSS doesn't work on prokaryotic genomes?

GRIDSS is general purpose (I've even done SV detection in Plasmodium Falciparum with it).

The gridss somatic filtering script has been testing and tuned on human data. You'd have to create your own panel of normals if you're using other than hg19 or hg38 and possibly change some thresholds in gridss.config.R if they don't suit your particular data.

PURPLE/LINX were specifically designed for human tumour data to the point where there's hardcoded assumptions (such as only processing 1-22,X,Y) so may not work at all on other species.