VIRUSBreakend VCF detail.

Apart from BEALN and

The FILTER, FORMAT, and INFO fields are the outputs from running GRIDSS on the viral genome. VIRUSBreakend identifies viral single breakends that go to non-viral sequence then identifies the integration locations in the host genome. The INFO and FORMAT annotations are single breakend viral annotations.

1: What is the critera of FILTER? I found that the values are enough high in the QUAL(684.91, 357.52, 276.53), but they regarded as LOW-QUAL in the FILTER column.

They are just the default GRIDSS FILTERs. GRIDSS has a higher threshold for single breakends than breakpoints hence the LOW_QUAL filter being applied to these integration sites.

And, is it possible to change the filtering criteria and make it more strict?

In the Hartwig cohort, I've found that if an integration site was detected at all (regardless of FILTER), then it's highly likely to be a true positive. Is this not the case for your data?

I wonder RP,SR,REF,RF,VF FORAMT(or INFO) vaues are corresponding (representing) to these breakpoint values?

No, they correspond to the support for the single breakend from a viral genome perspective. This means that RP and SR are always zero, REF/REFPAIR are viral reference supporting read counts.

2: I need to the supporting viral read count and human read count at breakpoint.

What do you mean by "supporting viral read count" and "human read count"?

To prevent doubling-counting of breakpoint-supporting read pairs that also have one read with a split read alignment, GRIDSS uses a supporting fragment approach. For a breakpoint, you have the following supporting fragment counts:

fragments aligned to the human reference overlapping the breakpoint position that support the reference allele (VIRUSBreakend does not report this)
fragments aligned to the viral reference overlapping the breakpoint position that support the reference allele (VIRUSBreakend reports this as: REF+REFPAIR)
fragments partially aligned to both human and viral sequence that support the breakpoint (VIRUSBreakend reports this as: BVF)

When the location of the viral integration is ambiguous (e.g. integration into a centromere), then it's not really possible to determine the number of reads/fragments supporting the reference allele on the host side.

The simplest way to get the human ref support is to run gridss.AnnotateReferenceCoverage but that's not going to work with a VIRUSBreakend VCF since the viral reference contig doesn't exist in the host-aligned input bam so it'll throw an error. Your options are:

hacking the VIRUSBreakend to make the VCF
- e.g. filter to breakends with host coordinates, change ALT to N[[chr1:1[[, run gridss.AnnotateReferenceCoverage, merge just REF/PAIR into the original VIRUSBreakend VCF INFO fields.
- host aligned bam can be subset to only the regions around the integration sites to make it run faster (it does a linear pass on the entire bam)
Re-calling with GRIDSS against a host+virus reference

It'll be much faster to only process the relevant subset of reads. The relevant subset is the union of the reads extracted by VIRUSBreakend (already in the virusbreakend working directory), and the reads surrounding the breakpoint. The latter can be extracted by gridss_Extract_overlapping_fragments.sh. You'll need to make sure you don't include the same reads twice and mess up your counts.

Run samtools view host_aligned.bam $region_of_interest (where $region_of_interest is +- 2kbp around all the host integration sites)and append to the virusbreakend readnames file
- Alternatively: subset the VIRUSBreakend to the host breakend records, run gridss_extract_overlapping_fragments.sh on it, and get the read names from that subset bam file
run gridsstools extractFragmentsToFastq on larger readnames file
create a reference genome containing the host + viral genomes
realign the relevant reads to the host+viral reference with bwa; sort
run GRIDSS
Filter to host <-> viral breakpoints

Note that this only works for viral integration where the integration sites occurs in mappable sequence. Integration sites in unmappable sequence (such as centromeres, telomeres, and some repetative or low complexy sequences) will not be called by GRIDSS.

PapenfussLab / gridss

VIRUSBreakend VCF detail. #489