Closed mjafin closed 10 years ago
Quite interesting. It looks like it might be another good candidate for somatic variant calling.
My thoughts exactly. Could implement it as a separate tool or bunch it with MuTect.
Given that MuTect doesn't handle indels and Scalpel does exactly fit the niche, IMO it looks sensible to bundle both together.
Brad, what do you think about it?
Miika; Thanks for heading this up, Scalpel support would be very welcome. Please let me know how I can help as you get into it.
A focus on somatic calling to start with makes a lot of the sense but it would be great if you could make this general so we could inject it into different tools as an indel supplement. A lot of somatic callers focus on SNPs so I'd imagine it would be generally useful (as long as it benchmarks well). Thanks again.
cool, I'll look into this in the next few weeks
OK so I've started work on Scalpel integration here: https://github.com/mjafin/bcbio-nextgen/commit/3d6ff8172c6ae4779e70725d4ad3331738a272f9
I reckon we can make calls to scalpel.py
from other callers like mutect.py
fairly easily and combine the variants as I'm currently doing for mutect+SID.
I've logged a few tickets about Scalpel here https://sourceforge.net/p/scalpel/bugs/3/, https://sourceforge.net/p/scalpel/tickets/2/ and https://sourceforge.net/p/scalpel/tickets/1/ but in the meantime I fixed some of the show stoppers myself in https://github.com/mjafin/scalpel
I've been also testing Scalpel on real data and in single sample mode it catches indels well. However, in somatic mode it somehow produces empty vcf output always. My test cases are really very obvious indels that are clearly there in the somatic sample and not in the normal. I've checked coverage is good in both and in single sample mode Scalpel produces the indel for the somatic sample OK. I'll see if I can see something obvious in the code.
Miika -- this is great, thank you for doing this work. Definitely keep us up to date on the bugs and tumor/normal results. I'll be happy to integrate this and scalpel installation whenever you think things are in a useful state. I'm excited to test this out.
@chapmanb looking to get this working in the next 24h or so. Is there any chance you could introduce an insertion and deletion in the cancer
test data? I think it only currently has SNPs (maybe I'm wrong).
Miika; Good idea. I've been wanting to update this with data from the ICGC-TCGA Dream challenge. Synthetic dataset 3 looks like it would be good for our purposes:
https://www.synapse.org/#!Synapse:syn312572/wiki/62018
I'll download and give this a go on subsetting for a more useful test case. Unfortunately I don't believe they've released a truth VCF but we could bug them for this now that the challenge is complete.
In data lunedì 26 maggio 2014 11:32:11, Brad Chapman ha scritto:
I'll download and give this a go on subsetting for a more useful test case. Unfortunately I don't believe they've released a truth VCF but we could bug them for this now that the challenge is complete.
Agreed! The more test datasets we have for this, the better. Miika: do you think it would be OK for me to pull from your github clone, or should I wait? I have a few datasets I could use to test it with.
@chapmanb @lbeltrame Great ideas both - I know one of my colleagues is partaking in the Dream challenge and has some data but I don't know if I'm allowed to pass it on. Let me know what you think.
@lbeltrame would be awesome if you could test it out! You need to clone https://github.com/mjafin/scalpel and run make
in the root dir, then set the root dir into your PATH. In terms of bcbio you'd need to pull this https://github.com/mjafin/bcbio-nextgen/tree/scalpel. I'll do a pull request with bcbio master soon, but there are some rough corners still:
If you see any bugs lemme know!
Edit:
samtools
in path. Even though this hack speeds scalpel up quite a bit (avoids loading full genome into memory for every call), it's still somewhat slow compared to the other toolsEdit2:
Oh, and scalpel supports min_allele_fraction
.
So the way I'm pulling the scalpel calls together is by combining inferred somatic and common InDels, and setting the FILTER field for the common InDels to REJECT (just made a commit here https://github.com/mjafin/bcbio-nextgen/commit/3b8aea914e0700bad7c3b67e4ea9199fa7ba5d32). Another way would be to have an INFO field value for somatic or common I suppose. In tumor-only mode I'm just reporting all discovered indels (obviously not optimal).
Anyways, @chapmanb would you like me to do a pull request of my scalpel branch or would you be able run the basic tests by cloning my scalpel branch?
Miika; Thanks for all this. Understood about the rough edges, but I'm happy to have the basic support merged in so long as it doesn't break existing pipelines. This should make it easier for others to test and get us to a useful set of calls.
I added better tumor/normal test cases using chr22 data from the ICGC-TCGA dream challenge BAMs that includes indels. I'm not sure if it has a somatic specific indel (VarScan and FreeBayes disagree on one potential) but it definitely has more complex regions for assessment so can hopefully test Scalpel better.
Thanks again for all the work.
somatic and common InDels, and setting the FILTER field for the common
How do you define "common"?
199fa7ba5d32). Another way would be to have an INFO field value for somatic
I think you should add SOMATIC to putative positive somatic indels. See my code in freebayes.py to get an idea on how to do so. The reasoning is that by doing so, we'll get also these calls flagged correctly in GEMINI if we use it, and that makes selecting somatic variants trivial.
(VarScan 2 needs more adjustments as I'd love to flag calls that are germline, because IIRC it adds SOMATIC also to those - but I'll get to look at it eventualy).
@lbeltrame By common I mean InDels that are both in the tumour and normal, and have the same inferred zygosity.
@chapmanb Big thanks - how do I trigger bcbio to download/run the chr22 test data?
@lbeltrame By common I mean InDels that are both in the tumour and normal, and have the same inferred zygosity.
Ok, thanks for the explanation: indeed marking REJECT would be probably the best choice at the moment.
I'm hoping to get scalpel to run on my data later this week, no promises though (have to attend a conference the next weekend).
@lbeltrame I'll look into the SOMATIC info field when I get the chance, thanks for the pointer.
I ran FreeBayes previously for the ERP002442 study sample 10-497-T. However, for whatever reason, the VT (somatic, LOH, germline) tag is only present in 2 out of the 600 indels. Scalpel is identifying 300 indels it thinks are somatic.
I ran FreeBayes previously for the ERP002442 study sample 10-497-T. However, for whatever reason, the VT (somatic, LOH, germline) tag is only present in 2 out of the 600 indels. Scalpel is identifying 300 indels it thinks are
Are you running FreeBayes using bcbio-nextgen? Look for SOMATIC tags if so, I
made the code strip the original VT tags.
Are you running FreeBayes using bcbio-nextgen? Look for SOMATIC tags if so, I made the code strip the original VT tags.
Ah, I must be looking at a run finished quite some time ago then. Will need to rerun FreeBayes. Or better yet, identify a proper dataset like the DREAM one.
Miika; The chr22 subset is now part of the test data so will download automatically if you do:
cd bcbio-nextgen/tests
run_tests.sh cancer
It'll be in tests/data/tcga_benchmark
.
Thanks @chapmanb, the output definitely contains (scalpel) indels now!
Closing, as per https://github.com/chapmanb/bcbio-nextgen/pull/432, thanks!
Actually, I don't think scalpel is producing any InDels at all. Looking at the logs:
sh: /apps/bcbio-nextgen/latest-devel/rhel6-x64/Cellar/scalpel/0.1.1-beta-2014-05-27/Microassembler/Microassembler: No such file or directory
Looks like the Microassembler part hasn't compiled.
As an aside, I ran the NA12878 exome benchmark and Scalpel isn't doing too well (even when it's working! Used my own compiled version). HaplotypeCaller is catching some 6400 indels, FreeBayes around 6k and scalpel 3600 indels using default settings. Disappointing, to be honest.
HaplotypeCaller is catching some 6400 indels, FreeBayes around 6k and scalpel 3600 indels using default settings. Disappointing, to be honest.
Did you have any go at somatic indels, e.g. Scalpel compared to the Somatic Indel Detector?
@lbeltrame running Dream challenge chr19 data now and doing a shootout of FreeBayes, SomaticIndelDetector, Scalpel and an internal variant caller. Should finish any minute now, I'll let you know how the results look like.
As another aside, it would be easier to do these comparisons in bcbio if FreeBayes REJECTed non-SOMATIC variants.
As another aside, it would be easier to do these comparisons in bcbio if FreeBayes REJECTed non-SOMATIC variants.
I can patch up the FreeBayes VCF handling if you want me to. I could probably do the same for VarScan 2 for everything that's not SS=2 (0=Reference, 1=Germline, 2=Somatic, 3=LOH, 4=?, 5=Unknown).
Brad, what do you think? Would you accept such a PR?
I also noticed in the chr20 (SNP only, synthetic I believe) results that MuTect beats everything else hands-down with it's super low false positive rate. I suspect it's something to do with the panel-of-normals (PON) it uses for further filtering. It would be a nice downstream step for FreeBayes and any other somatic callers. I need to further investigate if it's indeed the PON that's the reason.
But, back on-topic, Scalpel (the microassembler part) doesn't seem to compile properly for me when letting bcbio do the install - I wonder if I'm the odd one out?
Miika -- thanks for catching the scalpel run problems. It's frustrating that it fails silently: it looks like we could patch their perl wrapper calls to system
to add a check and die
on failure. Practically, I fixed the Microassembler compiler issue and a second issue where system
points to /bin/sh
but uses bash-isms, and it's working cleanly for me on the test dataset. Let me know if this doesn't work on your machine and we can patch more if needed.
I'm bummed about the comparison results. It might be worth checking back with Giuseppe to see if there are tweaks we can do to improve sensitivity. I know a lot of folks have promoted scalpel, so it may be a matter of tuning.
Miika and Luca -- more generally, harmonizing output approaches from FreeBayes and VarScan makes good sense. Sometimes callers are 99% of the way there and adding last custom filtering steps bridge the gaps to make them competitive. Having 2 or 3 equally sensitive/specific callers for cancer makes it easiest to move forward and keep improving since they push each other, based on my experience with germline calling.
Miika and Luca -- more generally, harmonizing output approaches from FreeBayes and VarScan makes good sense. Sometimes callers are 99% of the
My plan would be:
VarScan needs a slight adjustment because it puts SOMATIC even for records different than SS=2 (like SS=3, which is LOH) so putting REJECT would help here, IMO.
Timeline: unknown, hoepfully next week (way too busy on the aftermath - a good one! - of the latest analyses ;).
@chapmanb agree with you there, I'll email Giuseppe about the results to see if there's any tuning opportunities for germline. On a positive note, it's doing better in tumour/normal calling. Here are some ballpark results for the Dream chr19 data:
caller SNP/TP InD/TP SNP/FN InD/FN SNP/FP InD/FP
vardict 851 848 7 20 2466 717
scalpel 0 728 858 140 0 1058
mutect 846 NA 12 NA 102 NA
freebayes 857 711 0 157 1678 2024
SID NA 496 NA 372 NA 1343
Based on these very early results MuTect looks superb for SNPs. For InDels SID, with some very rudimentary filtering for allele frequency difference, is pants. Scalpel isn't too bad with InDels, but our (fairly simple) in-house caller (VarDict) is calling more true positives and less false positives in terms of indels. FreeBayes isn't bad in terms of true positives but produces quite a lot of false calls.
Here's the table for the EdgeBio NA12878 sample (true positives only, the false calls are fairly uniform):
caller SNP/TP InD/TP
vardict 72291 5304
gatk-haplotype 76733 6455
freebayes 76443 5989
scalpel 0 3625
gatk 76766 5097
SID, with some very rudimentary filtering for allele frequency difference, is pants. Scalpel isn't too bad with InDels, but our (fairly simple)
Miika, mind doing some tests with VarScan 2 if you have the chance? When I did the analyses I'm working on now that was the only tool I could use to extract somatic indels and I'm curious on its performance.
@lbeltrame sure, should've added it to the list from beginning! Edit: Which variants should I PASS for VarScan (LOH, SOMATIC?)?
@lbeltrame sure, should've added it to the list from beginning!
Thanks! Reason is, along "regular" indels I've found some medium-large insertions / deletions that look like artifacts at first sight, and I'm curious how it performs compared to other specialized tools.
(Sooner or later this will warrant thinking about doing a specialized ensemble calling for somatic variants, I think)
@lbeltrame ensemble variant calling in tumour is on top of our list of priorities but it requires some thinking around which features to use. Should happen sooner rather than later!
@chapmanb thanks for the fixes, I manged to brew uninstall/reinstall scalpel now!
Miika;
Awesome, thanks for these numbers. For indels, have you tried avoiding low complexity regions, with remove_lcr: true
in the configuration (https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html?highlight=lcr#variant-calling).
Low complexity regions bias the results, and I think at this point it's best just to accept that we are not good at calling indels there and evaluate the other 98% of the genome (http://bcbio.wordpress.com/2014/05/12/wgs-trio-variant-evaluation/). It would be interesting to see if that skews the results at all here.
It looks like we need better filtering to turn down the indel false positive for all the approaches. For FreeBayes specifically, this is similar to how germline results looked before applying QUAL and DP filters, so doing something similar here might help.
Thanks again, brilliant work.
@chapmanb Haven't tried remove_lcr: true
yet, thanks for the reminder/pointer! I'll do a new run from scratch using the setting (I presume I must wipe out all variant calls for remove_lcr: true
to take effect?)
This is slightly off-topic, but I'm getting these errors with VarScan:
[2014-06-05 10:23] ukapdlnx117: Genotyping with varscan: [u'19', 23986716, 27961482] 1_2014-06-04_tumor-paired-sort-19_23986716_27961482-prep.bam
[2014-06-05 10:23] ukapdlnx115: Unexpected error
Traceback (most recent call last):
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/distributed/ipythontasks.py", line 40, in _setup_logging
yield config
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/distributed/ipythontasks.py", line 160, in variantcall_sample
return ipython.zip_args(apply(genotype.variantcall_sample, *args), config)
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/genotype.py", line 186, in variantcall_sample
region, call_file)
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/varscan.py", line 27, in run_varscan
assoc_files, region, out_file)
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/samtools.py", line 37, in shared_variantcall
or not all(realign.has_aligned_reads(x, region) for x in align_bams)):
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/samtools.py", line 37, in <genexpr>
or not all(realign.has_aligned_reads(x, region) for x in align_bams)):
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/bcbio_nextgen-0.8.0a-py2.7.egg/bcbio/variation/realign.py", line 137, in has_aligned_reads
for item in cur_bam.fetch(region[0], int(region[1]), int(region[2])):
File "csamtools.pyx", line 1059, in pysam.csamtools.Samfile.fetch (pysam/csamtools.c:12490)
File "csamtools.pyx", line 989, in pysam.csamtools.Samfile._parseRegion (pysam/csamtools.c:11668)
File "csamtools.pyx", line 923, in pysam.csamtools.Samfile.gettid (pysam/csamtools.c:10827)
File "csamtools.pyx", line 57, in pysam.csamtools._force_bytes (pysam/csamtools.c:3393)
TypeError: Expected bytes, got unicode
..but only when running in the queue. Seems to be running fine in local single core mode. Odd.
For LCRs, the fastest way is to keep the variant calls and remove the validate
directory, and it will apply it to the validation instead of needing to re-do and filter calls. Restarting will also work, so whatever is easiest.
Thanks for VarScan report. I saw that at scale for HaplotypeCaller as well and fixed it by removing the empty file check since GATK no longer has issues with empty BAMs/no reads. I'm not sure about the underlying issue with pysam, but we might be able to do something similar here. Let me look.
This is slightly off-topic, but I'm getting these errors with VarScan: ..but only when running in the queue. Seems to be running fine in local single core mode. Odd.
Hm, fishy. It looks however like some other code path. VarScan 2 ran fine last time I did it, but I admit I didn't have time to keep up with bcbio-nextgen's development (what you get for being almost the only bioinfo person in the whole institution...).
Well, varscan is officially also pants :) (Worse than SID in InDels)
caller SNP/TP InD/TP SNP/FN InD/FN SNP/FP InD/FP
varscan 837 407 21 461 4256 930
I assume this does include the strand bias filter? (if you did through bcbio-nextgen, yes, it was included). Grim results indeed...
Miika; Thanks for this. That matches what I found for VarScan in germline calling, although I was holding out hope that paired calling might be better:
From these results it seems like the best calling approach right out of the box is MuTecT plus an indel caller. If we can tweak and improve scalpel that would be ideal. Pindel is also an option. I also have faith we can tweak FreeBayes calls to improve them, based on success with germline calling. It would be ideal to have a couple of choices, and I'd also like to have a non-licensed option.
Thanks again, that's a huge help to focus future effort. Much appreciated.
@chapmanb Which LCR bed file would you recommend pulling? (I'm not using bcbio to manage my genomes)
Miika; Here is the download URL and post-processing steps to get a bgzipped indexed BED file. This also included grep magic to convert to hg19 if you prefer that over GRCh37:
https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/dbsnp.py#L170
I wonder if, given these results, we can split out some bullet points of things that can be done from here? So we can, potentially, split the work ahead.
@chapmanb Thanks I'll give that a try!
@lbeltrame
The chr19 Dream data is synthetic so we could possibly use that as a basic benchmark data set. I took the BAM files, used samtools to extract the reads into fastq (discarding singletons) and started from there.
Just a quick update, excluding the low complexity regions reduced the false positive InDel calls for Scalpel (1058) and VarDict (717) to around 350. Although I think Scalpel was advertised in the paper to be immune to repetitive regions..?
OK, another quick update, with the latest fixes to FreeBayes and excluding the low complexity regions, its indel false positive rate has dropped to 185! The SNP FP rate is still relatively high but its something to work from.
I'll close the ticket for now.
Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/
Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me