Questions about GRIDSS command and output

enes-ak commented 3 years ago

First question: If I didn't wrong, GRIDSS use black list bed file to exclude some redundant regions like telomeric, centromeric.

Can I use another .bed file besides black list to define target (interest) region? If I can, how?

Second question: I saw the R packages for annotate vcf files, but I am wondering is there any another way to annotate output vcf? I want to use this tool for a part of big project and that project was prepared as python, so I do not want to mix R and python.

Third question: GRIDSS detects breakpoints from exome data (genome is preferred but we can also use exome data too). How can I find fusion genes from GRIDSS output? Can I understand fusion genes from VCF file or annotatation step is required to find fusions?

Thanks!

DarioS commented 3 years ago

For regions of interest, see F.A.Q.: How do I process only my region of interest? Fusion genes can be identified by PURPLE (pre-requisite) and LINX. Before debating R or Python programming languages, decide what kind of annotation you want to have made to your VCF file. It's unlikely that the kind of functions in StructuralVariantAnnotation would be what you are seeking.

d-cameron commented 3 years ago

@DarioS is correct.

If you're doing WES, the simplest approach is just to run GRIDSS with the default blacklist and do the filtering downstream of GRIDSS (this is the approach we use on our targeted panel clinical cancer data). gridss_extract_overlapping_fragments is designed for targeted recalling of the output of another caller, or calling in smalll regions of interest. There's no need to run it on WES data.

is there any another way to annotate output vcf

Depends on what annotations you want. gridss_annotate_vcf_repeatmasker and gridss_annotate_vcf_kraken2 are part of GRIDSS and do not require R. gridss_somatic_filter does required R but a Java reimplementation by the Hartwig Medical Foundation is available (https://github.com/hartwigmedical/hmftools/blob/master/gripss/README.md). Similiarly, the Hartwig Medical Foundation tool LINX is not R based and is currently the only tool that can report complex fusions from DNA sequencing data (https://www.biorxiv.org/content/10.1101/2020.12.03.410860v1). Hartwig tools are optimised for high coverage WGS data so it's possible that they won't work with your WES data.

GRIDSS is fully compliant with the VCF specifications so if you really don't want to use R you're free to write your own annotation tools in python for whatever annotations you need.

Can I understand fusion genes from VCF file or annotatation step is required to find fusions?

GRIDSS itself is purely a structural variant (breakpoint/single breakend) caller. It has no concept of a gene model thus cannot report fusions. Fusion annotation must be down downstream of the raw GRIDSS VCF output. Annotations tools are free to add their annotations to the VCF, or output in another format. This choice is up to the tool author.

See #517 for likely exome performance.

PapenfussLab / gridss

Questions about GRIDSS command and output #527