PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
254 stars 71 forks source link

VIRUSBreakend VCF detail. #489

Closed byeongill closed 3 years ago

byeongill commented 3 years ago

Hi, I am using VIRUSBreakend for detecting viral integration in human WGS(30X) samples.

I used the default parameters in the virusbreakend.sh script.

I got the output file in .vcf format and converted to excel for examples.(provided in the preview).

I have the following question regarding the output file:

1: What is the critera of FILTER? I found that the values are enough high in the QUAL(684.91, 357.52, 276.53), but they regarded as LOW-QUAL in the FILTER column. And, is it possible to change the filtering criteria and make it more strict?

2: I need to the supporting viral read count and human read count at breakpoint. I wonder RP,SR,REF,RF,VF FORAMT(or INFO) vaues are corresponding (representing) to these breakpoint values?

image

d-cameron commented 3 years ago

Apart from BEALN and

The FILTER, FORMAT, and INFO fields are the outputs from running GRIDSS on the viral genome. VIRUSBreakend identifies viral single breakends that go to non-viral sequence then identifies the integration locations in the host genome. The INFO and FORMAT annotations are single breakend viral annotations.

1: What is the critera of FILTER? I found that the values are enough high in the QUAL(684.91, 357.52, 276.53), but they regarded as LOW-QUAL in the FILTER column.

They are just the default GRIDSS FILTERs. GRIDSS has a higher threshold for single breakends than breakpoints hence the LOW_QUAL filter being applied to these integration sites.

And, is it possible to change the filtering criteria and make it more strict?

In the Hartwig cohort, I've found that if an integration site was detected at all (regardless of FILTER), then it's highly likely to be a true positive. Is this not the case for your data?

I wonder RP,SR,REF,RF,VF FORAMT(or INFO) vaues are corresponding (representing) to these breakpoint values?

No, they correspond to the support for the single breakend from a viral genome perspective. This means that RP and SR are always zero, REF/REFPAIR are viral reference supporting read counts.

2: I need to the supporting viral read count and human read count at breakpoint.

What do you mean by "supporting viral read count" and "human read count"?

To prevent doubling-counting of breakpoint-supporting read pairs that also have one read with a split read alignment, GRIDSS uses a supporting fragment approach. For a breakpoint, you have the following supporting fragment counts:

When the location of the viral integration is ambiguous (e.g. integration into a centromere), then it's not really possible to determine the number of reads/fragments supporting the reference allele on the host side.

The simplest way to get the human ref support is to run gridss.AnnotateReferenceCoverage but that's not going to work with a VIRUSBreakend VCF since the viral reference contig doesn't exist in the host-aligned input bam so it'll throw an error. Your options are:

It'll be much faster to only process the relevant subset of reads. The relevant subset is the union of the reads extracted by VIRUSBreakend (already in the virusbreakend working directory), and the reads surrounding the breakpoint. The latter can be extracted by gridss_Extract_overlapping_fragments.sh. You'll need to make sure you don't include the same reads twice and mess up your counts.

Note that this only works for viral integration where the integration sites occurs in mappable sequence. Integration sites in unmappable sequence (such as centromeres, telomeres, and some repetative or low complexy sequences) will not be called by GRIDSS.