Scalpel InDel calling support

mjafin commented 10 years ago

Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/

Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me

lbeltrame commented 10 years ago

Quite interesting. It looks like it might be another good candidate for somatic variant calling.

mjafin commented 10 years ago

My thoughts exactly. Could implement it as a separate tool or bunch it with MuTect.

lbeltrame commented 10 years ago

Given that MuTect doesn't handle indels and Scalpel does exactly fit the niche, IMO it looks sensible to bundle both together.

Brad, what do you think about it?

chapmanb commented 10 years ago

Miika; Thanks for heading this up, Scalpel support would be very welcome. Please let me know how I can help as you get into it.

A focus on somatic calling to start with makes a lot of the sense but it would be great if you could make this general so we could inject it into different tools as an indel supplement. A lot of somatic callers focus on SNPs so I'd imagine it would be generally useful (as long as it benchmarks well). Thanks again.

mjafin commented 10 years ago

cool, I'll look into this in the next few weeks

mjafin commented 10 years ago

OK so I've started work on Scalpel integration here: https://github.com/mjafin/bcbio-nextgen/commit/3d6ff8172c6ae4779e70725d4ad3331738a272f9

I reckon we can make calls to scalpel.py from other callers like mutect.py fairly easily and combine the variants as I'm currently doing for mutect+SID.

I've logged a few tickets about Scalpel here https://sourceforge.net/p/scalpel/bugs/3/, https://sourceforge.net/p/scalpel/tickets/2/ and https://sourceforge.net/p/scalpel/tickets/1/ but in the meantime I fixed some of the show stoppers myself in https://github.com/mjafin/scalpel

I've been also testing Scalpel on real data and in single sample mode it catches indels well. However, in somatic mode it somehow produces empty vcf output always. My test cases are really very obvious indels that are clearly there in the somatic sample and not in the normal. I've checked coverage is good in both and in single sample mode Scalpel produces the indel for the somatic sample OK. I'll see if I can see something obvious in the code.

chapmanb commented 10 years ago

Miika -- this is great, thank you for doing this work. Definitely keep us up to date on the bugs and tumor/normal results. I'll be happy to integrate this and scalpel installation whenever you think things are in a useful state. I'm excited to test this out.

mjafin commented 10 years ago

@chapmanb looking to get this working in the next 24h or so. Is there any chance you could introduce an insertion and deletion in the cancer test data? I think it only currently has SNPs (maybe I'm wrong).

chapmanb commented 10 years ago

Miika; Good idea. I've been wanting to update this with data from the ICGC-TCGA Dream challenge. Synthetic dataset 3 looks like it would be good for our purposes:

https://www.synapse.org/#!Synapse:syn312572/wiki/62018

I'll download and give this a go on subsetting for a more useful test case. Unfortunately I don't believe they've released a truth VCF but we could bug them for this now that the challenge is complete.

lbeltrame commented 10 years ago

In data lunedì 26 maggio 2014 11:32:11, Brad Chapman ha scritto:

I'll download and give this a go on subsetting for a more useful test case. Unfortunately I don't believe they've released a truth VCF but we could bug them for this now that the challenge is complete.

Agreed! The more test datasets we have for this, the better. Miika: do you think it would be OK for me to pull from your github clone, or should I wait? I have a few datasets I could use to test it with.

mjafin commented 10 years ago

@chapmanb @lbeltrame Great ideas both - I know one of my colleagues is partaking in the Dream challenge and has some data but I don't know if I'm allowed to pass it on. Let me know what you think.

@lbeltrame would be awesome if you could test it out! You need to clone https://github.com/mjafin/scalpel and run make in the root dir, then set the root dir into your PATH. In terms of bcbio you'd need to pull this https://github.com/mjafin/bcbio-nextgen/tree/scalpel. I'll do a pull request with bcbio master soon, but there are some rough corners still:

In mutect, it always uses scalpel and ignores SomaticIndelDetector. I want to make this slightly better somehow
There's no way of calling scalpel independently as of yet; scalpel only works as part of mutect (either in single or paired mode though)

If you see any bugs lemme know!

Edit:

For scalpel to work it needs to be able to see samtools in path. Even though this hack speeds scalpel up quite a bit (avoids loading full genome into memory for every call), it's still somewhat slow compared to the other tools

Edit2:

In paired mode scalpel has this validation step, which expands the search radius in terms of bp around the putative InDels. I noticed this effectively leads to missing InDels in short exons, such as the EGFR exon 19 deletion (that AZ has a vested interest in). I therefore pick the InDels from the pre-validation step.

mjafin commented 10 years ago

Oh, and scalpel supports min_allele_fraction.

mjafin commented 10 years ago

So the way I'm pulling the scalpel calls together is by combining inferred somatic and common InDels, and setting the FILTER field for the common InDels to REJECT (just made a commit here https://github.com/mjafin/bcbio-nextgen/commit/3b8aea914e0700bad7c3b67e4ea9199fa7ba5d32). Another way would be to have an INFO field value for somatic or common I suppose. In tumor-only mode I'm just reporting all discovered indels (obviously not optimal).

Anyways, @chapmanb would you like me to do a pull request of my scalpel branch or would you be able run the basic tests by cloning my scalpel branch?

chapmanb commented 10 years ago

Miika; Thanks for all this. Understood about the rough edges, but I'm happy to have the basic support merged in so long as it doesn't break existing pipelines. This should make it easier for others to test and get us to a useful set of calls.

I added better tumor/normal test cases using chr22 data from the ICGC-TCGA dream challenge BAMs that includes indels. I'm not sure if it has a somatic specific indel (VarScan and FreeBayes disagree on one potential) but it definitely has more complex regions for assessment so can hopefully test Scalpel better.

Thanks again for all the work.

lbeltrame commented 10 years ago

somatic and common InDels, and setting the FILTER field for the common

How do you define "common"?

199fa7ba5d32). Another way would be to have an INFO field value for somatic

I think you should add SOMATIC to putative positive somatic indels. See my code in freebayes.py to get an idea on how to do so. The reasoning is that by doing so, we'll get also these calls flagged correctly in GEMINI if we use it, and that makes selecting somatic variants trivial.

(VarScan 2 needs more adjustments as I'd love to flag calls that are germline, because IIRC it adds SOMATIC also to those - but I'll get to look at it eventualy).

mjafin commented 10 years ago

@lbeltrame By common I mean InDels that are both in the tumour and normal, and have the same inferred zygosity.

@chapmanb Big thanks - how do I trigger bcbio to download/run the chr22 test data?

lbeltrame commented 10 years ago

@lbeltrame By common I mean InDels that are both in the tumour and normal, and have the same inferred zygosity.

Ok, thanks for the explanation: indeed marking REJECT would be probably the best choice at the moment.

I'm hoping to get scalpel to run on my data later this week, no promises though (have to attend a conference the next weekend).

mjafin commented 10 years ago

@lbeltrame I'll look into the SOMATIC info field when I get the chance, thanks for the pointer.

I ran FreeBayes previously for the ERP002442 study sample 10-497-T. However, for whatever reason, the VT (somatic, LOH, germline) tag is only present in 2 out of the 600 indels. Scalpel is identifying 300 indels it thinks are somatic.

lbeltrame commented 10 years ago

I ran FreeBayes previously for the ERP002442 study sample 10-497-T. However, for whatever reason, the VT (somatic, LOH, germline) tag is only present in 2 out of the 600 indels. Scalpel is identifying 300 indels it thinks are

Are you running FreeBayes using bcbio-nextgen? Look for SOMATIC tags if so, I
made the code strip the original VT tags.

mjafin commented 10 years ago

Are you running FreeBayes using bcbio-nextgen? Look for SOMATIC tags if so, I made the code strip the original VT tags.

Ah, I must be looking at a run finished quite some time ago then. Will need to rerun FreeBayes. Or better yet, identify a proper dataset like the DREAM one.

chapmanb commented 10 years ago

Miika; The chr22 subset is now part of the test data so will download automatically if you do:

cd bcbio-nextgen/tests
run_tests.sh cancer

It'll be in tests/data/tcga_benchmark.

mjafin commented 10 years ago

Thanks @chapmanb, the output definitely contains (scalpel) indels now!

mjafin commented 10 years ago

Closing, as per https://github.com/chapmanb/bcbio-nextgen/pull/432, thanks!

mjafin commented 10 years ago

Actually, I don't think scalpel is producing any InDels at all. Looking at the logs:

sh: /apps/bcbio-nextgen/latest-devel/rhel6-x64/Cellar/scalpel/0.1.1-beta-2014-05-27/Microassembler/Microassembler: No such file or directory

Looks like the Microassembler part hasn't compiled.

mjafin commented 10 years ago

As an aside, I ran the NA12878 exome benchmark and Scalpel isn't doing too well (even when it's working! Used my own compiled version). HaplotypeCaller is catching some 6400 indels, FreeBayes around 6k and scalpel 3600 indels using default settings. Disappointing, to be honest.

lbeltrame commented 10 years ago

HaplotypeCaller is catching some 6400 indels, FreeBayes around 6k and scalpel 3600 indels using default settings. Disappointing, to be honest.

Did you have any go at somatic indels, e.g. Scalpel compared to the Somatic Indel Detector?

mjafin commented 10 years ago

@lbeltrame running Dream challenge chr19 data now and doing a shootout of FreeBayes, SomaticIndelDetector, Scalpel and an internal variant caller. Should finish any minute now, I'll let you know how the results look like.

As another aside, it would be easier to do these comparisons in bcbio if FreeBayes REJECTed non-SOMATIC variants.

lbeltrame commented 10 years ago

As another aside, it would be easier to do these comparisons in bcbio if FreeBayes REJECTed non-SOMATIC variants.

I can patch up the FreeBayes VCF handling if you want me to. I could probably do the same for VarScan 2 for everything that's not SS=2 (0=Reference, 1=Germline, 2=Somatic, 3=LOH, 4=?, 5=Unknown).

Brad, what do you think? Would you accept such a PR?

mjafin commented 10 years ago

I also noticed in the chr20 (SNP only, synthetic I believe) results that MuTect beats everything else hands-down with it's super low false positive rate. I suspect it's something to do with the panel-of-normals (PON) it uses for further filtering. It would be a nice downstream step for FreeBayes and any other somatic callers. I need to further investigate if it's indeed the PON that's the reason.

But, back on-topic, Scalpel (the microassembler part) doesn't seem to compile properly for me when letting bcbio do the install - I wonder if I'm the odd one out?

chapmanb commented 10 years ago

Miika -- thanks for catching the scalpel run problems. It's frustrating that it fails silently: it looks like we could patch their perl wrapper calls to system to add a check and die on failure. Practically, I fixed the Microassembler compiler issue and a second issue where system points to /bin/sh but uses bash-isms, and it's working cleanly for me on the test dataset. Let me know if this doesn't work on your machine and we can patch more if needed.

I'm bummed about the comparison results. It might be worth checking back with Giuseppe to see if there are tweaks we can do to improve sensitivity. I know a lot of folks have promoted scalpel, so it may be a matter of tuning.

Miika and Luca -- more generally, harmonizing output approaches from FreeBayes and VarScan makes good sense. Sometimes callers are 99% of the way there and adding last custom filtering steps bridge the gaps to make them competitive. Having 2 or 3 equally sensitive/specific callers for cancer makes it easiest to move forward and keep improving since they push each other, based on my experience with germline calling.

lbeltrame commented 10 years ago

Miika and Luca -- more generally, harmonizing output approaches from FreeBayes and VarScan makes good sense. Sometimes callers are 99% of the

My plan would be:

Change FA in MuTect to FREQ (I believe the latter is more widely used?) for allelic fractions; [or vice versa - let me know your input!]
Put PASS / REJECT for somatic / non somatic records (helps a lot and makes things much easier!) for all somatic callers;

VarScan needs a slight adjustment because it puts SOMATIC even for records different than SS=2 (like SS=3, which is LOH) so putting REJECT would help here, IMO.

Timeline: unknown, hoepfully next week (way too busy on the aftermath - a good one! - of the latest analyses ;).

mjafin commented 10 years ago

@chapmanb agree with you there, I'll email Giuseppe about the results to see if there's any tuning opportunities for germline. On a positive note, it's doing better in tumour/normal calling. Here are some ballpark results for the Dream chr19 data:

caller      SNP/TP  InD/TP  SNP/FN  InD/FN  SNP/FP  InD/FP
vardict     851     848     7       20      2466    717
scalpel     0       728     858     140     0       1058
mutect      846     NA      12      NA      102     NA
freebayes   857     711     0       157     1678    2024
SID         NA      496     NA      372     NA      1343

Based on these very early results MuTect looks superb for SNPs. For InDels SID, with some very rudimentary filtering for allele frequency difference, is pants. Scalpel isn't too bad with InDels, but our (fairly simple) in-house caller (VarDict) is calling more true positives and less false positives in terms of indels. FreeBayes isn't bad in terms of true positives but produces quite a lot of false calls.

Here's the table for the EdgeBio NA12878 sample (true positives only, the false calls are fairly uniform):

caller          SNP/TP  InD/TP
vardict         72291   5304
gatk-haplotype  76733   6455
freebayes       76443   5989
scalpel         0       3625
gatk            76766   5097

lbeltrame commented 10 years ago

SID, with some very rudimentary filtering for allele frequency difference, is pants. Scalpel isn't too bad with InDels, but our (fairly simple)

Miika, mind doing some tests with VarScan 2 if you have the chance? When I did the analyses I'm working on now that was the only tool I could use to extract somatic indels and I'm curious on its performance.

mjafin commented 10 years ago

@lbeltrame sure, should've added it to the list from beginning! Edit: Which variants should I PASS for VarScan (LOH, SOMATIC?)?

lbeltrame commented 10 years ago

@lbeltrame sure, should've added it to the list from beginning!

Thanks! Reason is, along "regular" indels I've found some medium-large insertions / deletions that look like artifacts at first sight, and I'm curious how it performs compared to other specialized tools.

(Sooner or later this will warrant thinking about doing a specialized ensemble calling for somatic variants, I think)

mjafin commented 10 years ago

@lbeltrame ensemble variant calling in tumour is on top of our list of priorities but it requires some thinking around which features to use. Should happen sooner rather than later!

@chapmanb thanks for the fixes, I manged to brew uninstall/reinstall scalpel now!

chapmanb commented 10 years ago

Miika; Awesome, thanks for these numbers. For indels, have you tried avoiding low complexity regions, with remove_lcr: true in the configuration (https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html?highlight=lcr#variant-calling).

Low complexity regions bias the results, and I think at this point it's best just to accept that we are not good at calling indels there and evaluate the other 98% of the genome (http://bcbio.wordpress.com/2014/05/12/wgs-trio-variant-evaluation/). It would be interesting to see if that skews the results at all here.

It looks like we need better filtering to turn down the indel false positive for all the approaches. For FreeBayes specifically, this is similar to how germline results looked before applying QUAL and DP filters, so doing something similar here might help.

Thanks again, brilliant work.

mjafin commented 10 years ago

@chapmanb Haven't tried remove_lcr: true yet, thanks for the reminder/pointer! I'll do a new run from scratch using the setting (I presume I must wipe out all variant calls for remove_lcr: true to take effect?)

mjafin commented 10 years ago

This is slightly off-topic, but I'm getting these errors with VarScan:

[2014-06-05 10:23] ukapdlnx117: Genotyping with varscan: [u'19', 23986716, 27961482] 1_2014-06-04_tumor-paired-sort-19_23986716_27961482-prep.bam
[2014-06-05 10:23] ukapdlnx115: Unexpected error
Traceback (most recent call last):
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/distributed/ipythontasks.py", line 40, in _setup_logging
    yield config
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/distributed/ipythontasks.py", line 160, in variantcall_sample
    return ipython.zip_args(apply(genotype.variantcall_sample, *args), config)
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/genotype.py", line 186, in variantcall_sample
    region, call_file)
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/varscan.py", line 27, in run_varscan
    assoc_files, region, out_file)
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/samtools.py", line 37, in shared_variantcall
    or not all(realign.has_aligned_reads(x, region) for x in align_bams)):
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/samtools.py", line 37, in <genexpr>
    or not all(realign.has_aligned_reads(x, region) for x in align_bams)):
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/realign.py", line 137, in has_aligned_reads
    for item in cur_bam.fetch(region[0], int(region[1]), int(region[2])):
  File "csamtools.pyx", line 1059, in pysam.csamtools.Samfile.fetch (pysam/csamtools.c:12490)
  File "csamtools.pyx", line 989, in pysam.csamtools.Samfile._parseRegion (pysam/csamtools.c:11668)
  File "csamtools.pyx", line 923, in pysam.csamtools.Samfile.gettid (pysam/csamtools.c:10827)
  File "csamtools.pyx", line 57, in pysam.csamtools._force_bytes (pysam/csamtools.c:3393)
TypeError: Expected bytes, got unicode

..but only when running in the queue. Seems to be running fine in local single core mode. Odd.

chapmanb commented 10 years ago

For LCRs, the fastest way is to keep the variant calls and remove the validate directory, and it will apply it to the validation instead of needing to re-do and filter calls. Restarting will also work, so whatever is easiest.

Thanks for VarScan report. I saw that at scale for HaplotypeCaller as well and fixed it by removing the empty file check since GATK no longer has issues with empty BAMs/no reads. I'm not sure about the underlying issue with pysam, but we might be able to do something similar here. Let me look.

lbeltrame commented 10 years ago

This is slightly off-topic, but I'm getting these errors with VarScan: ..but only when running in the queue. Seems to be running fine in local single core mode. Odd.

Hm, fishy. It looks however like some other code path. VarScan 2 ran fine last time I did it, but I admit I didn't have time to keep up with bcbio-nextgen's development (what you get for being almost the only bioinfo person in the whole institution...).

mjafin commented 10 years ago

Well, varscan is officially also pants :) (Worse than SID in InDels)

caller  SNP/TP  InD/TP  SNP/FN InD/FN  SNP/FP   InD/FP
varscan 837     407     21     461     4256     930

lbeltrame commented 10 years ago

I assume this does include the strand bias filter? (if you did through bcbio-nextgen, yes, it was included). Grim results indeed...

chapmanb commented 10 years ago

Miika; Thanks for this. That matches what I found for VarScan in germline calling, although I was holding out hope that paired calling might be better:

http://bcbio.wordpress.com/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/

From these results it seems like the best calling approach right out of the box is MuTecT plus an indel caller. If we can tweak and improve scalpel that would be ideal. Pindel is also an option. I also have faith we can tweak FreeBayes calls to improve them, based on success with germline calling. It would be ideal to have a couple of choices, and I'd also like to have a non-licensed option.

Thanks again, that's a huge help to focus future effort. Much appreciated.

mjafin commented 10 years ago

@chapmanb Which LCR bed file would you recommend pulling? (I'm not using bcbio to manage my genomes)

chapmanb commented 10 years ago

Miika; Here is the download URL and post-processing steps to get a bgzipped indexed BED file. This also included grep magic to convert to hg19 if you prefer that over GRCh37:

https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/dbsnp.py#L170

lbeltrame commented 10 years ago

I wonder if, given these results, we can split out some bullet points of things that can be done from here? So we can, potentially, split the work ahead.

mjafin commented 10 years ago

@chapmanb Thanks I'll give that a try!

@lbeltrame

REJECT non-somatic variants in varscan/FreeBayes(/SID)
Try and improve FreeBayes false positives
Ensemble calling
...

The chr19 Dream data is synthetic so we could possibly use that as a basic benchmark data set. I took the BAM files, used samtools to extract the reads into fastq (discarding singletons) and started from there.

mjafin commented 10 years ago

Just a quick update, excluding the low complexity regions reduced the false positive InDel calls for Scalpel (1058) and VarDict (717) to around 350. Although I think Scalpel was advertised in the paper to be immune to repetitive regions..?

mjafin commented 10 years ago

OK, another quick update, with the latest fixes to FreeBayes and excluding the low complexity regions, its indel false positive rate has dropped to 185! The SNP FP rate is still relatively high but its something to work from.

I'll close the ticket for now.

bcbio / bcbio-nextgen

Scalpel InDel calling support #428