Closed pdiakumis closed 5 years ago
Peter; Thanks much for this. I know you've had good success with PURPLE and would be keen to get it automated within bcbio. I'm agreed that the main work is trying to avoid extra re-work on variant calling, coverage calculations and segmentation.
For AMBER, it seems like we should be able to extract these directly from a somatic caller than includes germline calls like VarDict, and then convert directly into the AMBER output format. We do this for TitanCNA inputs (https://github.com/bcbio/bcbio-nextgen/blob/6245de333e994da60a069269a84456600e2bce96/bcbio/structural/titancna.py#L126).
We could also use the pre-calculated CNVkit coverage bins and segmentation approach although that looks trickier to convert into the format PURPLE needs.
Do you have examples of PURPLE runs with the formats for the input files for PURPLE? That might help figure out how close we are and design a conversion approach.
If conversion from existing files worked, then the first pass would be just running PURPLE on these and comparing with the full pipeline run including AMBER and COBALT to check equivalency on some samples. They we could decide if we need to back into re-running those as part of PURPLE itself or could use the general calling, coverage and segmentation/normalization approaches.
How does that sound for an initial path forward? Thanks again for starting this discussion and helping with the validations.
If I remember correctly Amber does not use the variant calls as a starting point, but a list of likely heterozygous positions from the 1000G project. That was part of the reason that step is so fast - trying this with 3+ SNVs from the germline calls took a lot longer.
Peter and Oliver; Thanks much for this discussion. I've started to add some initial plumbing for PURPLE in bcbio so we can begin to generate and test the outputs. So far I've converted germline calls into AMBER format, which should work with a caller like vardict. For the next step I'd like to covert existing CNVkit binning and counts into COBALT. Are you able to share some example files from the cobalt directory in an existing run? I should be able to use this as a baseline to generate these.
I'd like to at least have basic support for running and then can move forward with testing and tuning like deciding how to handle the number of variant positions and how COBALT binning/segmentation/normalization compares to CNVkit.
Thanks again for the push to integrate and all the helpful pointers.
Wow, thanks heaps Brad, you're already condarising and integrating stuff! I can point you to a COBALT/AMBER/PURPLE run on Spartan (will DM you). I was in the middle of expanding a bit on the previous points. I'll post this, just for reference - apologies about the length.
As Oliver mentioned, they feed sambamba a panel of ~800,000 common SNPs (16M unzipped BED) and get a pileup for both the somatic and normal BAMs. Command is (run for tumor + normal):
sambamba mpileup -t 6 \
-L /path/to/CytoScanHD_hg19_SNPs_sorted.bed \
/path/to/normal.bam \
--samtools -q 1 -f /path/to/GRCh37.fa \
> /path/to/mpileup/normal.mpileup
==> normal.mpileup <== (Note: have shortened columns)
1 882803 A 84 GGGGgGgGgggggGGgggggGgggGGGggg :FFFFFFFF8FFFFFFFFFF
1 888659 T 95 cCcCCccCCcCCcccccCCCCccccCCccc FFFFFFFFFFFFFFFFFFFF
1 903426 C 93 .,,,...,.,,.....,,..,...,...., FFFFFFFFFFFFFFFFFFFF
1 904355 C 86 ,$.,,.,.,..,,,,,,,,..,,,,.,.,. ;BBFBFFFFFFFFFFFFFFF
1 918270 C 107 .$,$...,,,.,,...,.,..,,.,.,... EEFFFFFFFFFFFFFFFFFF
1 918573 A 88 ,$g.GgGGGg.,gg,G..g.Ggg,g,,.g. CFFFFFFFFFFFFFkFFFFF
java -jar amber.jar \
-sample tumor_name \
-output_dir /run_dir/amber \
-reference /path/to/mpileup/normal.mpileup \
-tumor /path/to/mpileup/tumor.mpileup
==> amber.version <==
version=1.6
build.date=2018-08-08 04:28
==> tumor_name.amber.baf <==
Chromosome Position TumorBAF TumorModifiedBAF TumorDepth NormalBAF NormalModifiedBAF NormalDepth
1 918573 0.5288 0.5288 104 0.5000 0.5000 88
1 959155 0.5969 0.5969 129 0.4375 0.5625 96
1 1130093 0.6180 0.6180 89 0.5763 0.5763 59
1 1131441 0.6316 0.6316 95 0.4881 0.5119 84
1 1759054 0.4719 0.5281 90 0.6000 0.6000 70
==> tumor_name.amber.baf.pcf <==
sampleID chrom arm start.pos end.pos n.probes mean
TumorModifiedBAF 1 p 918573 121230790 7173 0.6012
TumorModifiedBAF 1 q 144957815 196843711 3427 0.5981
TumorModifiedBAF 1 q 196847411 224198354 1926 0.7524
TumorModifiedBAF 1 q 224240313 238118749 992 0.601
TumorModifiedBAF 1 q 238124366 249170711 1022 0.7485
==> tumor_name.amber.baf.pcf1 <==
chrom pos TumorModifiedBAF
1 918573 0.6012
1 959155 0.6012
1 1130093 0.6012
1 1131441 0.6012
1 1759054 0.6012
==> tumor_name.amber.qc <==
QCStatus PASS
MeanBAF 0.4915
COBALT runs independently from the previous step - it simply takes the BAMs as input, plus a GC content file (1kb bins).
java -jar cobalt.jar \
-reference normal_name \
-reference_bam /path/to/normal.bam \
-tumor tumor_name \
-tumor_bam /path/to/tumor.bam \
-output_dir /run_dir/cobalt \
-threads 24 \
-gc_profile /path/to/GC_profile.1000bp.cnp
==> normal_name.cobalt.gc.median <==
#SampleMean SampleMedian
498 489
#GCBucket Median
20 377
21 402
22 439
==> normal_name.cobalt.ratio.pcf <==
sampleID chrom arm start.pos end.pos n.probes mean
S1 1 p 755001 2582001 838 -0.0033
S1 1 p 2583001 2583001 1 3.3491
S1 1 p 2695001 4998001 1661 0.0117
S1 1 p 4999001 4999001 1 1.382
S1 1 p 5000001 5726001 701 1e-04
==> normal_name.cobalt.ratio.pcf1 <==
chrom pos S1
1 755001 -0.0033
1 756001 -0.0033
1 757001 -0.0033
1 758001 -0.0033
1 805001 -0.0033
==> tumor_name.chr.len <==
1 249250621
2 243199373
3 198022430
4 191154276
5 180915260
6 171115067
==> tumor_name.cobalt <==
Chromosome Position ReferenceReadCount TumorReadCount ReferenceGCRatio TumorGCRatio ReferenceGCDiploidRatio
1 1 -1 -1 -1 -1 -1
1 9001 2 11 -1 -1 -1
1 10001 910 2163 -1 -1 -1
1 12001 245 376 -1 -1 -1
1 13001 804 1336 -1 -1 -1
==> tumor_name.cobalt.gc.median <==
#SampleMean SampleMedian
963 937
#GCBucket Median
20 810
21 832
22 901
==> tumor_name.cobalt.ratio.pcf <==
sampleID chrom arm start.pos end.pos n.probes mean
S1 1 p 755001 905001 33 0.0537
S1 1 p 918001 1358001 173 -0.2833
S1 1 p 1359001 1952001 325 -0.498
S1 1 p 1954001 2582001 307 -0.2381
S1 1 p 2583001 2583001 1 2.2202
==> tumor_name.cobalt.ratio.pcf1 <==
chrom pos S1
1 755001 0.0537
1 756001 0.0537
1 757001 0.0537
1 758001 0.0537
1 805001 0.0537
==> cobalt.version <==
version=1.4
build.date=2018-04-10 04:31
PURPLE feeds off the AMBER/COBALT outputs:
java -jar purple.jar \
-run_dir ${out_dir} \
-output_dir ${out_dir}/purple \
-ref_sample normal_name \
-tumor_sample tumor_name \
-threads 2 \
-gc_profile /path/to/GC_profile.1000bp.cnp \
-somatic_vcf /path/to/snvs.vcf \
-structural_vcf /path/to/svs.vcf \
-circos /path/to/circos
==> tumor_name.purple.cnv <==
#chromosome start end copyNumber bafCount observedBAF actualBAF segmentStartSupport segmentEndSupport method
1 1 906000 3.76484848484848 0 0.0 0.79554164595544 TELOMERE NONE BAF_WEIGHTED
1 906001 1953000 2.1585153428312127 10 0.5754400000000001 0.6433869571528602 NONE NONE BAF_WEIGHTED
1 1953001 6421000 2.8629465974423733 274 0.6018700729927007 0.673614312727703 NONE NONE BAF_WEIGHTED
1 6421001 6844000 1.911575952197064 7 0.6374 0.7811557318526419 NONE NONE BAF_WEIGHTED
1 6844001 7310000 2.828429691551997 25 0.6078 0.6840260418365561 NONE NONE BAF_WEIGHTED
==> tumor_name.purple.fitted <==
#chromosome start end germlineStatus fittedPloidy bafCount observedBAF modelBAF bafDeviation observedTumorRatio observedNo
rmalRatio modelTumorRatio cnvDeviation deviation tumorCopyNumber segmentTumorCopyNumber segmentBAF refNormalisedCopyNumber ratioSupport supp
ort observedTumorRatioCount tumorBAF gcContent svCluster ploidyPenalty
1 1 755000 UNKNOWN 1 0 0.0 0.0 0.0 0.0 0.0 0.5475000000000001 0.10950000000000003 0.0 0.0 3.76484848
484848 0.79554164595544 0.0 true TELOMERE 0 0.0 0.0 false 2.0
1 755001 906000 DIPLOID 1 0 0.0 0.0 0.0 1.0520848484848482 1.0557484848484848 0.5475000000000001 0.1009169696969696
2 0.0 3.76484848484848 3.76484848484848 0.79554164595544 3.6533515151515106 true NONE 33 0.0 0.604109756097561
false 2.0
1 906001 1359000 DIPLOID 3 4 0.60745 0.6000000000000001 0.007449999999999957 0.8271450867052027 1.0046566473988439 0.91250000
00000005 0.017070982658959566 0.014895270916184963 2.5323018449600125 2.1585153428312127 0.6433869571528602 2.5229885501623253 true
NONE 173 0.6923135009399498 0.6068381642512085 false 2.0
1 1359001 1953000 DIPLOID 2 6 0.5541 0.5300068289904385 0.024093171009561543 0.7134516923076926 0.9918015384615391 0.73000000
00000003 0.0033096615384615503 0.015183909514859598 1.9093243414120125 2.1585153428312127 0.6433869571528602 1.925721264488934 true
NONE 325 0.6107692612948005 0.5407372093023256 false 1.0
1 1953001 2583000 DIPLOID 3 17 0.6048 0.6000000000000001 0.0047999999999999154 0.8558159609120523 1.0043159609120522 0.91250000
00000005 0.011336807817589657 0.009759541368078174 2.689402525545489 2.8629465974423733 0.673614312727703 2.680770603721385 true
NONE 307 0.6827355258311459 0.5901624790619758 false 2.0
==> tumor_name.purple.gene.cnv <==
Chromosome Start End Gene MinCopyNumber MaxCopyNumber MeanCopyNumber SomaticRegions GermlineHomRegions GermlineHet2HomRegions TranscriptId TranscriptVersion ChromosomeBand MinRegions MinRegionStart MinRegionEnd MinRegionStartSupport MinRegionEndSupport MinRegionMethod NonsenseBiallelicCount NonsenseNonBiallelicCount NonsenseNonBiallelicPloidy SpliceBiallelicCount SpliceNonBiallelicCount SpliceNonBiallelicPloidy
MissenseBiallelicCount MissenseNonBiallelicCount MissenseNonBiallelicPloidy MinMinorAllelePloidy ExonicBases
1 11869 14409 DDX11L1 3.76484848484848 3.76484848484848 0 1 0 0 ENST00000456328 2 p36.33 1 1 906000 TELOMERE NONE BAF_WEIGHTED 0 0 0.0 0 0 0.0 0 0 0.0 0.7697547244392758
1 14363 29370 WASH7P 3.76484848484848 3.76484848484848 0 1 0 0 ENST00000438504 2 p36.33 1 1 906000 TELOMERE NONE BAF_WEIGHTED 0 0 0.0 0 0 0.0 0 0 0.0 0.7697547244392758
1 29554 31097 MIR1302-10 3.76484848484848 3.76484848484848 0 1 0 0 ENST00000473358 1 p36.33 1 1
906000 TELOMERE NONE BAF_WEIGHTED 0 0 0.0 0 0 0.0 0 0 0.0 0.7697547244392758
1 34554 36081 FAM138A 3.76484848484848 3.76484848484848 0 1 0 0 ENST00000417324 1 p36.33 1 1 906000 TELOMERE NONE BAF_WEIGHTED 0 0 0.0 0 0 0.0 0 0 0.0 0.7697547244392758
1 69091 70008 OR4F5 3.76484848484848 3.76484848484848 0 1 0 0 ENST00000335137 3 p36.33 1 1 906000 TELOMERE NONE BAF_WEIGHTED 0 0 0.0 0 0 0.0 0 0 0.0 0.7697547244392758
==> tumor_name.purple.germline.cnv <==
#chromosome start end copyNumber bafCount observedBAF actualBAF segmentStartSupport segmentEndSupport method
1 24257001 24259000 0.04812876712328765 0 0.0 0.6643998708939189 NONE UNKNOWN GERMLINE_HOM_DELETION
1 24259001 24261000 0.46649013976232906 0 0.0 0.6643998708939189 NONE UNKNOWN GERMLINE_HET2HOM_DELETION
1 24261001 24264000 0.00147945205479452 0 0.0 0.6643998708939189 NONE UNKNOWN GERMLINE_HOM_DELETION
1 24265001 24363000 0.02100503275759379 0 0.0 0.6643998708939186 NONE UNKNOWN GERMLINE_HOM_DELETION
1 109573001 109574000 0.49815834126557074 0 0.0 0.6813939584040009 NONE UNKNOWN GERMLINE_HET2HOM_DELETION
==> tumor_name.purple.purity <==
#Purity NormFactor Score DiploidProportion Ploidy Gender Status PolyclonalProportion MinPurity MaxPurity MinPloidy MaxPloidy
MinDiploidProportion MaxDiploidProportion Version
0.5000000000000002 0.7300000000000003 0.03164964627984315 0.18083207615324154 3.570137536019887 MALE NORMAL 0.3594812925170068 0.47000000000000025 0.5200000000000002 3.539350525509694 3.6021811592039565 0.17203535686207877 0.18452393600068023 2.14
==> tumor_name.purple.purity.range <==
#Purity NormFactor Score DiploidProportion Ploidy
0.5000000000000002 0.7300000000000003 0.03164964627984315 0.18083207615324154 3.570137536019887
0.49000000000000027 0.7300000000000003 0.03197506803144911 0.17641246839343022 3.6021811592039565
0.5100000000000002 0.7300000000000003 0.032370958658219706 0.18452393600068023 3.539350525509694
0.48000000000000026 0.7400000000000003 0.03304675959015634 0.1800193357839494 3.557151465786201
0.5200000000000002 0.7200000000000003 0.03381291805949315 0.18245224486326855 3.584135151301441
==> tumor_name.purple.qc <==
QCStatus FAIL_SEGMENT
SegmentPass false
GenderPass true
DeletedGenesPass true
SegmentScore 852
UnsupportedSegments 3043
==> purple.version <==
version=2.14
build.date=2018-05-11 04:20
Also outputs circos
and plot
directories, with the circos templates/inputs and generated plots, respectively.
circos/:
normal_name.ratio.circos
tumor_name.baf.circos
tumor_name.circos.conf
tumor_name.circos.error
tumor_name.circos.out
tumor_name.cnv.circos
tumor_name.indel.circos
tumor_name.input.conf
tumor_name.input.error
tumor_name.input.out
tumor_name.link.circos
tumor_name.map.circos
tumor_name.ratio.circos
tumor_name.snp.circos
gaps.txt
ideogram.conf
ticks.conf
plot/:
tumor_name.circos.png
tumor_name.copyNumber.png
tumor_name.input.png
tumor_name.minor_allele.png
tumor_name.variant.png
Very interesting. I'll try to run a few tests with my samples once the plumbing is there.
Luca,
keen to get your thoughts. In our hands Purple seems to consistently do better than cnvkit, Facets and others (but is limited to WGS tumor/normal, and does seem to struggle once cellularity hits 20%). It also did surprisingly well with FFPE samples even though we try to avoid them.
Peter, Oliver and Luca; Thanks much for the discussion. I did some additional exploration of PURPLE and unfortunately we won't be able to use CNVkit read count inputs. PURPLE requires 1000bp even bins across the genome, so the flexible target/anti-target approach CNVkit uses won't work. I'll work on integrating COBALT directly instead. My hope was to be able to compare different binning/counting/normalization approaches and plug in and out with the actual ploidy/purity/SV calls, but it doesn't look like that can happen right now with the current state of PURPLE. If we find COBALT/PURPLE useful, we could generalize it as an alternative input to CNVkit and do the comparison that way, so some longer term work to do. The COBALT normalizing is unique in terms of how it uses GC bins, so would be interesting to see what this contributes versus the CNVkit approach.
I wanted to keep you up to date on the current status and will keep pushing on this. Thanks again for all the feedback and discussion.
I think we stalled on support for this, am I right about that? Is this still something that would be good to add support for?
Thank you for all the incredible work supporting this. I think we're good to close this for now. The Hartwig folks have improved the speed of the pre-processing steps significantly since we first looked at this, which allows us to run the newer version within half an hour or so in our downstream pipeline. But we might revisit the whole integration if we want to use the newer GRIDSS version which is believed to help refine the PURPLE calls. Will keep you posted.
Thank you for following up!
Hi Brad and everyone -
Is there a purity caller like ABSOLUTE or others mentioned here part of the BCBIO default pipeline in v1.6 or v1.9?
Thanks for the question. bcbio has some preliminary work on purity calling but nothing productionized that is used by default. If you're looking for a finished implementation, unfortunately that is not available. If you're interested in developing on this there is some support for preparing for different purity methods and we'd welcome work to help validate and improve it. Hope this helps.
Hi folks,
We've been trying out a few somatic CNV callers for a while now and we're quite happy with PURPLE (PUrity & PLoidy Estimator) by the Hartwig Medical Foundation. Some helpful links:
There's a comprehensive description of the pipeline in the preprint. In summary:
sambamba/samtools
and AMBERIt also outputs some nice circos plots (via http://circos.ca/), summarising total/minor allele CN, BAF, RD ratios and optionally somatic SNV AFs + SV links.
The three main tools (AMBER, COBALT and PURPLE) are written in Java, and the whole pipeline generally takes around 1 hour for a typical tumor/normal sample, which is pretty impressive I believe (the most time-consuming step is for the pileup generation with sambamba/samtools).
We'd like to try and incorporate it into bcbio, but first it would be good to figure out if certain steps in the process can be substituted by other bcbio prep modules, instead of duplicating functionality. For instance, are there files generated from/for CNVkit that could be used instead of running the COBALT/AMBER steps? So just thought of starting this discussion here so we can hopefully move ahead with prioritising things to work on.
Kind regards - Peter