PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
258 stars 71 forks source link

GeneratePonBedpe Has Index of Normal Parameter #210

Closed DarioS closed 5 years ago

DarioS commented 5 years ago

The function GeneratePonBedpe has parameter documentation

@Argument(shortName="NO", doc="VCF index of normal", optional=true)
public int NORMAL_ORDINAL = 1;

Since all of the inputs are normals, so I can't imagine what would be output that is not normal that is stored in the other columns. For subsequent variant calling, INPUT is required to be one or more BAM files, so I'm confused how the BEDPE can be used a baseline to subtract non-somatic variants during variant calling using an input cancer sample.

d-cameron commented 5 years ago

Since all of the inputs are normals, so I can't imagine what would be output that is not normal that is stored in the other columns

It's the sample/genotype index of the normal sample in the input file, not the BEDPE output.

I generated the Hartwig Medical Foundation hg19 GRIDSS PON (publicly available at https://resources.hartwigmedicalfoundation.nl) from ~2500 WGS pair tumour/normal (30x normal, 100x tumour) samples. GRIDSS joint calling was performed on each sample so each VCF had both germline and somatic variants in it. When generating a PON, only the variants that are in the germline should be added to the PON - that is what the NORMAL_ORDINAL parameter specifies.

I'm confused how the BEDPE can be used a baseline to subtract non-somatic variants during variant calling using an input cancer sample

The BEDPE contains all germline variant calls found in multiple samples in the reference cohort used to generate the PON. Removing these variants from your cancer sample should remove the germline and (systematic false positive) variants with sufficiently high penetrance in your reference cohort.

See https://github.com/hartwigmedical/scripts/blob/master/gridss/libgridss.R#L64 for an example of how this is done: any variant calls that overlap any of the BEDPE PON breakpoints have FILTER of PON set.

d-cameron commented 5 years ago

If you're using hg19 (and are happy to use a Dutch reference cohort), then I recommend using the hartwig PON.

d-cameron commented 5 years ago

I used BEDPE for the PON since, unlike the raw GRIDSS calls, there's not much annotation needed for the PON. Whether a variant is stored in VCF or BEDPE doesn't really matter to StructuralVariantAnnotation since it can transform both formats to the common breakpoint GRanges format used by the package.

DarioS commented 5 years ago

Thanks for the detailed explanation My understanding of it is:

  1. For each particular patient using their pair of samples (1 healthy, 1 (or more) disease), call variants, generating 1 VCF file per patient.
  2. Input the set of VCF for all patients to GeneratePonBedpe to generate a panel-of-normals BEDPE file.

Is it correct? The remaining mystery for me is how to use the derived BEDPE file in future analyses. I imagine it be used in a scenario where the next batch of patients has some patients with a pair of samples and some with only a tumour sample because the healthy sample could not be extracted during surgery. I don't see a parameter to gridss.CallVariants that could accept a BEDPE file as the baseline to analyse those tumour-only patients. It is also unexpected to me that analysing a set of 2500 patients that all had pairs of samples available would benefit from a panel-of-normals kind of analysis. I thought it was a poor-man's pseudo-normal sample, when a biological healthy sample is not available, according to the opinion of the GATK project leader which I read.

d-cameron commented 5 years ago

Correct. The filtering is performed in the downstream analysis after GRIDSS has been run. Whilst it can be used for tumour-only, we have found it useful as for filtering out variants that are not eliminated by the matched normal. Since there is cross-contamination of the normal with the tumour if they're both done in the same sequencing run, as well as tumour contamination of the matched blood sample, using a threshold of QUAL=0 in the normal will remove high ploidy somatic driver events (e.g. at 30x normal, you will expect a read from a 100x amplified HER2 somatic mutation to show up in the normal with blood contamination as low as 0.05%). To ensure these driver variants are kept, you need to have a relatively high contamination threshold (we use <3% QUAL from normal). This high threshold then gives you a higher FDR of germline/artefact variants which we use a large PON to filter out.

DarioS commented 5 years ago

It quickly gets quite complicated. For other readers' understanding, the filtering described is done by gridss_somatic_filter.R in the Hartwig Medical Foundation repository. It takes as input a panel-of-normals which is generated by create_gridss_pon.R. That script depends on a bunch of utility functions in libgridss.R which in turn sources gridss.config.R to define a variety of constants, one of which is gridss.allowable_normal_contamination=0.03. That's the "<3% QUAL from normal" described. Before running the scripts, search for "library(" to find the package dependencies and install each one.

I hope that there will be a guide like GATK Best Practices in the future. It would be enlightening for a novice to see what GRIDSS experts recommend for different scenarios, such as an individual lab project with 50 patients - 40 which have a pair of samples and 10 of which are tumour-only - and another consortium project that has 2000 patients, each having a pair of samples sequenced. The workflows would probably have subtle but important differences. If the standard VCF output has lenient filtering, I also wonder about using GRIDSS for germline structural variant calling with normals.

d-cameron commented 5 years ago

Better documentation is definitely on the TODO list. I just saw a paper released (PMID 30759232) where GRIDSS did terribly. It turns out they were running GRIDSS on ultra high depth (80,000x coverage) targeted sequence data, without changing the max coverage from it's default of 25,000x. I'm hoping to get a protocols paper out later this year that does exactly that: provide best practice guidance on how to run GRIDSS across different use cases and outline potential pitfalls and how to avoid them.

On Mon, Apr 29, 2019 at 3:00 PM Dario Strbenac notifications@github.com wrote:

It quickly gets quite complicated. For other readers' understanding, the filtering described is done by gridss_somatic_filter.R in the Hartwig Medical Foundation repository. It takes as input a panel-of-normals which is generated by create_gridss_pon.R. That script depends on a bunch of utility functions in libgridss.R which in turn sources gridss.config.R to define a variety of constants, one of which is gridss.allowable_normal_contamination=0.03. That's the "<3% QUAL from normal" described. Before running the scripts, search for "library(" to find the package dependencies and install each one.

I hope that there will be a guide like GATK Best Practices https://software.broadinstitute.org/gatk/best-practices/ in the future. It would be enlightening for a novice to see what GRIDSS experts recommend for different scenarios, such as an individual lab project with 50 patients

  • 40 which have a pair of samples and 10 of which are tumour-only - and another consortium project that has 2000 patients, each having a pair of samples sequenced. The workflows would probably have subtle but important differences. If the standard VCF output has lenient filtering, I also wonder about using GRIDSS for germline structural variant calling with normals.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PapenfussLab/gridss/issues/210#issuecomment-487452179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOBYOEOQCHOJCHAG3NA77TPSZ6FTANCNFSM4HIJERFA .