Detail of GRIDSS paper figures

ctsa commented 6 years ago

Glad to see GRIDSS out in print @d-cameron , this is an innovative method bringing many interesting new ideas to the field.

I'm following through on the supplementary material from your paper which describes the src/test/sim and src/test/r subdirectories as containing the details of (at least some of) the published analyses. Specifically I'm interested in recreating the NA12878 analysis to see how various tools were run, how each tool's existing FILTER fields were handled and how calls were ranked for the analysis in figure 3. If there is an overview of this process in the GRIDSS repo somewhere or if I missed this in the paper, I'd appreciate your help in pointing out these details.

d-cameron commented 6 years ago

If there is an overview of this process in the GRIDSS repo

The closest I have a step-by-step tutorial for regenerating figure 3 is the outline in the header comments of src/test/r/na12878.R. The other figures also have corresponding reproduction instructions in their corresponding header files.

Overall, the general approach was to have separate shell scripts for each step. Every step has a MD5.metadata containing the relevant information (in bash environment variant assignment format). All shell variables with a CX_ prefix are written to the .metadata file. Input files are taken based on filename suffixes

Further notes on these instructions:

CX_REFERENCE and CX_BLACKLIST in the metadatashould match your reference data locations
CX_BLACKLIST uses the hg19 ENCODE blacklist
You'll need to download the callers to a tools/ directory and put them on path.
run call_*.sh. E.g. call_gridss.sh na12878
- These will spit out a data.na12878/MD5.metadata and data.na12878/MD5.vcf for the caller
- I just noticed call_manta.sh is missing from the GRIDSS source tree (I've been refactoring the comparisons into their own project)
- It can be found at https://github.com/PapenfussLab/sv_benchmark/blob/master/call_manta.sh
- The older manta version used in the GRIDSS paper is https://github.com/PapenfussLab/sv_benchmark/blob/dd1d0a91d11dbc4530ea5556585b7a007b0f7ed0/call_manta.sh (it was 20 months between the GRIDSS manuscript being written and final publication).

how each tool's existing FILTER fields were handled and how calls were ranked for the analysis in figure 3

Once the VCFs were generated, the analysis switches to the R code in src/test/r/na12878.R (https://github.com/PapenfussLab/sv_benchmark/R for more comprehensive comparisons than what were done in the GRIDSS paper). The comparison with other callers finishes at na12878.R:122

Only calls with "PASS" or "." in the FILTER field are included in the high confidence call set. The high and low confidence call set includes all calls regardless of FILTER. Calls are ranked by QUAL (descending), with total read count used as a proxy when no QUAL is written (withqual() function at src/test/r/libvcf.R:573). See src/test/r/libvcf.R:401.

The scripts are generally of the following form:

Load reference data
Load *.metadata into a metadata data frame metadata <- LoadMetadata()
Load *.vcf into a list vcfs <- LoadVcfs(metadata, existingVcfs=vcfs)`
Clean vcfs vcfs <- ...
Convert to data frame of calls CalculateTruthSummary()
Generate ROC data point TruthSummaryToROC()
Generate plots using ggplot

Let me know if you have any further questions.

Cheers Daniel

d-cameron commented 6 years ago

You will get very different lumpy results with the current version of bwa - I think one of the bwa updates made in the last 18 months broke lumpy.

ctsa commented 6 years ago

Thank you for the very detailed answer. I will try to dig further into these scripts tomorrow.

The main item I'm confused by is whether figure 3 represents each caller's FILTER=(PASS|.) calls, which are then ranked by QUAL, or if this represents all calls (regardless of FILTER) ranked by QUAL. I should be able to extract this from your scripts but if you're able to describe this detail that would be helpful.

Secondly, it looks from the scripts like the CX_BLACKLIST is provided to GRIDSS and lumpy at runtime, but I assume it is set to the same blacklist file that is used to filter calls prior to scoring tp/fp based on QUAL rankings. Can you confirm this is the case?

d-cameron commented 6 years ago

The main item I'm confused by is whether figure 3 represents each caller's FILTER=(PASS|.) calls, which are then ranked by QUAL, or if this represents all calls (regardless of FILTER) ranked by QUAL. I should be able to extract this from your scripts but if you're able to describe this detail that would be helpful.

Both. The solid lines are for the subset of calls with FILTER=(PASS|.), the dotted lines are the full call sets. na12878.R:65-77 runs the matching calculations for both versions (the only difference being ignoreFilters=TRUE|FALSE) then merges them into a single data frame with the Filter column representing indicating whether it was the full or pass only call set.

"High confidence only" <-> FILTER=(PASS|.) <-> Solid line "High & Low Confidence" <-> all calls <-> Dotted line

Secondly, it looks from the scripts like the CX_BLACKLIST is provided to GRIDSS and lumpy at runtime, but I assume it is set to the same blacklist file that is used to filter calls prior to scoring tp/fp based on QUAL rankings. Can you confirm this is the case?

This is indeed the case. Both truth and caller variants are ignored if either breakend within 2.5 fragment lengths of a blacklisted region (libvcf.R:296).

ctsa commented 6 years ago

Thanks for the additional detail. Those all sound like sensible choices. I will go ahead and close this item as answered and let you know if there are any followup questions from the linked scripts.

d-cameron commented 6 years ago

Manta: manta-data.na12878.zip GRIDSS: (LOW_QUAL filtered calls have been excluded due to file upload size limits) gridss-bwa-data.na12878.zip

Note that the latest version of both tools should perform better than the results from these versions.

One of my collaborators was working on a large somatic WGS cohort using both manta and GRIDSS. They found similar results from both tools with the exception that manta required an additional filter for some 'obvious' false positives that manta was systematically calling with high QUAL scores. Would you like me to follow up with them to find out what filter logic they are using?

d-cameron commented 6 years ago

longread.bed.zip

Long read .bed files used for the GRIDSS figure 3 (libna12878.R:46) is attached.

PapenfussLab / gridss

Detail of GRIDSS paper figures #100