Exome Analysis - Githubissues

jasper1918 commented 7 years ago

Hi Daniel,

Having had good success with WGS I am now working on WES data.

On a large batch of exome data gridss is generally overcalling SVs at about .5-1 Million calls per sample. Other methods are calling less than 5000 on average. The exome data is from intact (non-FFPE) material.

I'm running gridss as follows: java -jar gridss-0.11.6-jar-with-dependencies.jar I=sample.bam BL=ENCFF001TDO.bed R=GRCh37.p13.fa O=sample.Gridss.bam.sv.vcf THREADS=10

Does gridss work for WES data and any parameters to pay special attention to?

d-cameron commented 7 years ago

We've been using it on targeted cancer panel data (exome + tiled introns on known fusion genes) with success. There does appear to be a consistent false positive rates between particular genes and their homologs (eg every single sample in that capture panel has a supposed fusion between NOTCH2 and an untargetted pseudogene homolog). As this is in the cancer context we handle these using a panel of normals, and also using the IHOMPOS fields.

IHOMPOS give the span of the inexact homology between the reference sequence at each breakend. A large IHOMPOS such as [-300,300] indicates that the sequence homology extends 300bp (or more, we stop at 300bp) out in either direction thus indicating that the two sequence are basically the same and the call is likely to be a false positive.

Other filters we use are:

variant allele frequency (calculated from REF, REFPAIR, RP, RSR & SR) (this is messy since you have VAF from reads, VAF from read pairs, for both sides (aneuploidy can make an SV het on one side, and hom on the other)
VCF filters (by default we ignore ~90% of GRIDSS output as it doesn't pass the default filters)
ENCODE DAC blacklist of known bad regions (applies to WGS data)
capture region (at least one breakend of each breakpoint should be close to, or within, the capture region as defined by the BED file)

On a large batch of exome data gridss is generally overcalling SVs at about .5-1 Million calls per sample.

GRIDSS is intentionally spammy as it includes many low quality calls but that is indeed unexpectedly high. What is the quality score distribution of the calls? The default minimum quality score threshold has been set such that it just slightly exceeds the expected maximum score for a single read. If your particular data allows a single read to exceed that score (scoring is, in part, based on the aligner MAPQ scores and and the empirical distribution in the input library), then GRIDSS will spam out a variant call for every single discordant read pair or split read which would be consistent with your 1 million calls.

If you create a gridss.properties file containing the following line:

variantcalling.minScore = 100

and set CONFIGURATION_FILE=gridss.properties that will up the default 3 reads (possibly only 2 if you data has very high-scoring reads). If you're applying the default filter, then you could up this to 250 or 500 since those call would get removed by the filter step anyway.

jasper1918 commented 7 years ago

Thanks Daniel. This is very helpful info on how gridss is designed to work. Just as you mentioned, the majority of the events had very low VAF and a QUAL score in the 40s. The IHOMPOS was also helpful in identifying artifact calls.

The QUAL distribution for all calls: q-05 40.2300 q-25 42.2700 q-75 106.4700 q-95 211.3900 median 46.8700

In addition to QUAL filtering, I am also filtering out the DAC blacklist, and including only positions where at least one breakpoint is within the exome definition. I would like to also filter on the vcf FILTER values. I remove the LOW_QUAL values outright but it appears other categories should be handled differently.

My breakdown of these after applying the filters above is as follows: 482 . 174 ASSEMBLY_ONLY 114 NO_ASSEMBLY 1008 SINGLE_ASSEMBLY

The QUAL values for SINGLE_ASSEMBLY, NO_ASSEMBLY, ASSEMBLY_ONLY are quite high and I'm curious of your thoughts on how to approach these. Would you use them with high qual or under what scenarios would you consider them to be true calls? Use SR, RSR and RP values to qualify them?

Thanks for your help!

d-cameron commented 7 years ago

What to do with these variants really depends on how interested you are in repetitive sequence. Unless caused by low coverage, I'd be suspicious of NO_ASSEMBLY variants. Generally speaking, variants with SINGLE_ASSEMBLY support have a repeat element on one side, but are uniquely mappable on the side that could be assembled.

That said, variants with those filters aren't necessarily false positives (they just have a higher FDR) and SR/RSR/RP filtering is not necessarily what you want to do. As an extreme example, in one of my cancer data sets GRIDSS made a correct SINGLE_ASSEMBLY call that was supported by no split reads or read pairs. GRIDSS was able to correctly assemble a 316bp SINE contig that uniquely mapped to the correct location, even though every single 100bp read from that SINE element was incorrectly placed by the aligner.

As for additional filtering, the ENCODE Duke blacklist works well to remove high-scoring false positive variants ('true' being defined as PacBio read support in NA12878) not only for GRIDSS, but also for a number of other calls (Pindel, BreakDancer, etc).

I'm actually in the process of developing a structural variant quality score recalculation package which will take variant calls from any of 10 different callers (GRIDSS, Pindel, BreakDancer, DELLY, CREST, LUMPY, manta, etc) and write a meaningful QUAL score which will actually be the calibrated phred score of the variant (based on GiaB training data). My aim is for this package to to be publicly available by the of the year (although you'll probably see an early version appearing on my github later this month).

WinterLi1993 commented 6 years ago

@d-cameron I am doing tumor-normal sequencing for somatic calling . I have checked your example/somatic.R and I found you just keep these records that their FILTER flag value is '.' and 'PASS' ,but I feel 'ASSEMBLY_ONLY' ,NO_ASSEMBLY,SINGLE_ASSEMBLY also should not be removed . what's your thoughts about that ?

WinterLi1993 commented 6 years ago

@jasper1918 HI, could you share your filter script for somatic calling ? I do not how to filter gridss result is good . somatic.R will remove my true variants in the sample ,because it just consider '.' and 'PASS' for filtering . my data is from tumor target panel sequencing ,sample is tissue DNA or cfDNA from blood .

any help will be welcome !

d-cameron commented 6 years ago

See #118

PapenfussLab / gridss

Exome Analysis #41