PURPLE: purity & ploidy estimator support in bcbio

pdiakumis commented 6 years ago

Hi folks,

We've been trying out a few somatic CNV callers for a while now and we're quite happy with PURPLE (PUrity & PLoidy Estimator) by the Hartwig Medical Foundation. Some helpful links:

PURPLE GitHub repo: https://github.com/hartwigmedical/hmftools/tree/master/purity-ploidy-estimator
Preprint detailing PURPLE method (in section 7): https://www.biorxiv.org/content/early/2018/09/20/415133

There's a comprehensive description of the pipeline in the preprint. In summary:

Step 1: Calculate BAF in tumor using sambamba/samtools and AMBER
Step 2. Calculate Read Depth Ratios using COBALT
Step 3. Segmentation with bioconductor-copynumber
Step 4. Purity and Ploidy Fitting with PURPLE
Step 5. Smoothing with PURPLE

It also outputs some nice circos plots (via http://circos.ca/), summarising total/minor allele CN, BAF, RD ratios and optionally somatic SNV AFs + SV links.

The three main tools (AMBER, COBALT and PURPLE) are written in Java, and the whole pipeline generally takes around 1 hour for a typical tumor/normal sample, which is pretty impressive I believe (the most time-consuming step is for the pileup generation with sambamba/samtools).

We'd like to try and incorporate it into bcbio, but first it would be good to figure out if certain steps in the process can be substituted by other bcbio prep modules, instead of duplicating functionality. For instance, are there files generated from/for CNVkit that could be used instead of running the COBALT/AMBER steps? So just thought of starting this discussion here so we can hopefully move ahead with prioritising things to work on.

Kind regards - Peter

chapmanb commented 6 years ago

Peter; Thanks much for this. I know you've had good success with PURPLE and would be keen to get it automated within bcbio. I'm agreed that the main work is trying to avoid extra re-work on variant calling, coverage calculations and segmentation.

For AMBER, it seems like we should be able to extract these directly from a somatic caller than includes germline calls like VarDict, and then convert directly into the AMBER output format. We do this for TitanCNA inputs (https://github.com/bcbio/bcbio-nextgen/blob/6245de333e994da60a069269a84456600e2bce96/bcbio/structural/titancna.py#L126).

We could also use the pre-calculated CNVkit coverage bins and segmentation approach although that looks trickier to convert into the format PURPLE needs.

Do you have examples of PURPLE runs with the formats for the input files for PURPLE? That might help figure out how close we are and design a conversion approach.

If conversion from existing files worked, then the first pass would be just running PURPLE on these and comparing with the full pipeline run including AMBER and COBALT to check equivalency on some samples. They we could decide if we need to back into re-running those as part of PURPLE itself or could use the general calling, coverage and segmentation/normalization approaches.

How does that sound for an initial path forward? Thanks again for starting this discussion and helping with the validations.

ohofmann commented 6 years ago

If I remember correctly Amber does not use the variant calls as a starting point, but a list of likely heterozygous positions from the 1000G project. That was part of the reason that step is so fast - trying this with 3+ SNVs from the germline calls took a lot longer.

chapmanb commented 6 years ago

Peter and Oliver; Thanks much for this discussion. I've started to add some initial plumbing for PURPLE in bcbio so we can begin to generate and test the outputs. So far I've converted germline calls into AMBER format, which should work with a caller like vardict. For the next step I'd like to covert existing CNVkit binning and counts into COBALT. Are you able to share some example files from the cobalt directory in an existing run? I should be able to use this as a baseline to generate these.

I'd like to at least have basic support for running and then can move forward with testing and tuning like deciding how to handle the number of variant positions and how COBALT binning/segmentation/normalization compares to CNVkit.

Thanks again for the push to integrate and all the helpful pointers.

pdiakumis commented 6 years ago

Wow, thanks heaps Brad, you're already condarising and integrating stuff! I can point you to a COBALT/AMBER/PURPLE run on Spartan (will DM you). I was in the middle of expanding a bit on the previous points. I'll post this, just for reference - apologies about the length.

sambamba/AMBER

As Oliver mentioned, they feed sambamba a panel of ~800,000 common SNPs (16M unzipped BED) and get a pileup for both the somatic and normal BAMs. Command is (run for tumor + normal):

sambamba mpileup -t 6 \
    -L /path/to/CytoScanHD_hg19_SNPs_sorted.bed \
    /path/to/normal.bam \
    --samtools -q 1 -f /path/to/GRCh37.fa \
    > /path/to/mpileup/normal.mpileup

Output pileup looks like:

==> normal.mpileup <== (Note: have shortened columns)
1       882803  A       84      GGGGgGgGgggggGGgggggGgggGGGggg :FFFFFFFF8FFFFFFFFFF
1       888659  T       95      cCcCCccCCcCCcccccCCCCccccCCccc FFFFFFFFFFFFFFFFFFFF
1       903426  C       93      .,,,...,.,,.....,,..,...,...., FFFFFFFFFFFFFFFFFFFF
1       904355  C       86      ,$.,,.,.,..,,,,,,,,..,,,,.,.,. ;BBFBFFFFFFFFFFFFFFF
1       918270  C       107     .$,$...,,,.,,...,.,..,,.,.,... EEFFFFFFFFFFFFFFFFFF
1       918573  A       88      ,$g.GgGGGg.,gg,G..g.Ggg,g,,.g. CFFFFFFFFFFFFFkFFFFF

AMBER then takes the two pileups:

java -jar amber.jar \
  -sample tumor_name \
  -output_dir /run_dir/amber \
  -reference /path/to/mpileup/normal.mpileup \
  -tumor /path/to/mpileup/tumor.mpileup

Output:

==> amber.version <==
version=1.6
build.date=2018-08-08 04:28

==> tumor_name.amber.baf <==
Chromosome      Position        TumorBAF        TumorModifiedBAF        TumorDepth      NormalBAF       NormalModifiedBAF       NormalDepth
1       918573  0.5288  0.5288  104     0.5000  0.5000  88
1       959155  0.5969  0.5969  129     0.4375  0.5625  96
1       1130093 0.6180  0.6180  89      0.5763  0.5763  59
1       1131441 0.6316  0.6316  95      0.4881  0.5119  84
1       1759054 0.4719  0.5281  90      0.6000  0.6000  70

==> tumor_name.amber.baf.pcf <==
sampleID        chrom   arm     start.pos       end.pos n.probes        mean
TumorModifiedBAF        1       p       918573  121230790       7173    0.6012
TumorModifiedBAF        1       q       144957815       196843711       3427    0.5981
TumorModifiedBAF        1       q       196847411       224198354       1926    0.7524
TumorModifiedBAF        1       q       224240313       238118749       992     0.601
TumorModifiedBAF        1       q       238124366       249170711       1022    0.7485

==> tumor_name.amber.baf.pcf1 <==
chrom   pos     TumorModifiedBAF
1       918573  0.6012
1       959155  0.6012
1       1130093 0.6012
1       1131441 0.6012
1       1759054 0.6012

==> tumor_name.amber.qc <==
QCStatus        PASS
MeanBAF 0.4915

COBALT

COBALT runs independently from the previous step - it simply takes the BAMs as input, plus a GC content file (1kb bins).

Command run:

java -jar cobalt.jar \
  -reference normal_name \
  -reference_bam /path/to/normal.bam \
  -tumor tumor_name \
  -tumor_bam /path/to/tumor.bam \
  -output_dir /run_dir/cobalt \
  -threads 24 \
  -gc_profile /path/to/GC_profile.1000bp.cnp

Output:

==> normal_name.cobalt.gc.median <==
#SampleMean     SampleMedian
498     489
#GCBucket       Median
20      377
21      402
22      439

==> normal_name.cobalt.ratio.pcf <==
sampleID        chrom   arm     start.pos       end.pos n.probes        mean
S1      1       p       755001  2582001 838     -0.0033
S1      1       p       2583001 2583001 1       3.3491
S1      1       p       2695001 4998001 1661    0.0117
S1      1       p       4999001 4999001 1       1.382
S1      1       p       5000001 5726001 701     1e-04

==> normal_name.cobalt.ratio.pcf1 <==
chrom   pos     S1
1       755001  -0.0033
1       756001  -0.0033
1       757001  -0.0033
1       758001  -0.0033
1       805001  -0.0033

==> tumor_name.chr.len <==
1       249250621
2       243199373
3       198022430
4       191154276
5       180915260
6       171115067

==> tumor_name.cobalt <==
Chromosome      Position        ReferenceReadCount      TumorReadCount  ReferenceGCRatio        TumorGCRatio    ReferenceGCDiploidRatio
1       1       -1      -1      -1      -1      -1
1       9001    2       11      -1      -1      -1
1       10001   910     2163    -1      -1      -1
1       12001   245     376     -1      -1      -1
1       13001   804     1336    -1      -1      -1

==> tumor_name.cobalt.gc.median <==
#SampleMean     SampleMedian
963     937
#GCBucket       Median
20      810
21      832
22      901

==> tumor_name.cobalt.ratio.pcf <==
sampleID        chrom   arm     start.pos       end.pos n.probes        mean
S1      1       p       755001  905001  33      0.0537
S1      1       p       918001  1358001 173     -0.2833
S1      1       p       1359001 1952001 325     -0.498
S1      1       p       1954001 2582001 307     -0.2381
S1      1       p       2583001 2583001 1       2.2202

==> tumor_name.cobalt.ratio.pcf1 <==
chrom   pos     S1
1       755001  0.0537
1       756001  0.0537
1       757001  0.0537
1       758001  0.0537
1       805001  0.0537

==> cobalt.version <==
version=1.4
build.date=2018-04-10 04:31

PURPLE

PURPLE feeds off the AMBER/COBALT outputs:

Command:

java -jar purple.jar \
    -run_dir ${out_dir} \
    -output_dir ${out_dir}/purple \
    -ref_sample normal_name \
    -tumor_sample tumor_name \
    -threads 2 \
    -gc_profile /path/to/GC_profile.1000bp.cnp \
    -somatic_vcf /path/to/snvs.vcf \
    -structural_vcf /path/to/svs.vcf \
    -circos /path/to/circos

Output:

==> tumor_name.purple.cnv <==
#chromosome     start   end     copyNumber      bafCount        observedBAF     actualBAF       segmentStartSupport     segmentEndSupport       method
1       1       906000  3.76484848484848        0       0.0     0.79554164595544        TELOMERE        NONE    BAF_WEIGHTED
1       906001  1953000 2.1585153428312127      10      0.5754400000000001      0.6433869571528602      NONE    NONE    BAF_WEIGHTED
1       1953001 6421000 2.8629465974423733      274     0.6018700729927007      0.673614312727703       NONE    NONE    BAF_WEIGHTED
1       6421001 6844000 1.911575952197064       7       0.6374  0.7811557318526419      NONE    NONE    BAF_WEIGHTED
1       6844001 7310000 2.828429691551997       25      0.6078  0.6840260418365561      NONE    NONE    BAF_WEIGHTED

==> tumor_name.purple.fitted <==
#chromosome     start   end     germlineStatus  fittedPloidy    bafCount        observedBAF     modelBAF        bafDeviation    observedTumorRatio      observedNo
rmalRatio     modelTumorRatio cnvDeviation    deviation       tumorCopyNumber segmentTumorCopyNumber  segmentBAF      refNormalisedCopyNumber ratioSupport    supp
ort observedTumorRatioCount tumorBAF        gcContent       svCluster       ploidyPenalty
1       1       755000  UNKNOWN 1       0       0.0     0.0     0.0     0.0     0.0     0.5475000000000001      0.10950000000000003     0.0     0.0     3.76484848
484848        0.79554164595544        0.0     true    TELOMERE        0       0.0     0.0     false   2.0
1       755001  906000  DIPLOID 1       0       0.0     0.0     0.0     1.0520848484848482      1.0557484848484848      0.5475000000000001      0.1009169696969696
2     0.0     3.76484848484848        3.76484848484848        0.79554164595544        3.6533515151515106      true    NONE    33      0.0     0.604109756097561
       false   2.0
1       906001  1359000 DIPLOID 3       4       0.60745 0.6000000000000001      0.007449999999999957    0.8271450867052027      1.0046566473988439      0.91250000
00000005      0.017070982658959566    0.014895270916184963    2.5323018449600125      2.1585153428312127      0.6433869571528602      2.5229885501623253      true
    NONE    173     0.6923135009399498      0.6068381642512085      false   2.0
1       1359001 1953000 DIPLOID 2       6       0.5541  0.5300068289904385      0.024093171009561543    0.7134516923076926      0.9918015384615391      0.73000000
00000003      0.0033096615384615503   0.015183909514859598    1.9093243414120125      2.1585153428312127      0.6433869571528602      1.925721264488934       true
    NONE    325     0.6107692612948005      0.5407372093023256      false   1.0
1       1953001 2583000 DIPLOID 3       17      0.6048  0.6000000000000001      0.0047999999999999154   0.8558159609120523      1.0043159609120522      0.91250000
00000005      0.011336807817589657    0.009759541368078174    2.689402525545489       2.8629465974423733      0.673614312727703       2.680770603721385       true
    NONE    307     0.6827355258311459      0.5901624790619758      false   2.0

==> tumor_name.purple.gene.cnv <==
Chromosome      Start   End     Gene    MinCopyNumber   MaxCopyNumber   MeanCopyNumber  SomaticRegions  GermlineHomRegions      GermlineHet2HomRegions  TranscriptId    TranscriptVersion       ChromosomeBand  MinRegions      MinRegionStart  MinRegionEnd    MinRegionStartSupport   MinRegionEndSupport     MinRegionMethod NonsenseBiallelicCount  NonsenseNonBiallelicCount       NonsenseNonBiallelicPloidy      SpliceBiallelicCount    SpliceNonBiallelicCount SpliceNonBiallelicPloidy
        MissenseBiallelicCount  MissenseNonBiallelicCount       MissenseNonBiallelicPloidy      MinMinorAllelePloidy    ExonicBases
1       11869   14409   DDX11L1 3.76484848484848        3.76484848484848        0       1       0       0       ENST00000456328 2       p36.33  1       1       906000  TELOMERE        NONE    BAF_WEIGHTED    0       0       0.0     0       0       0.0     0       0       0.0     0.7697547244392758
1       14363   29370   WASH7P  3.76484848484848        3.76484848484848        0       1       0       0       ENST00000438504 2       p36.33  1       1       906000  TELOMERE        NONE    BAF_WEIGHTED    0       0       0.0     0       0       0.0     0       0       0.0     0.7697547244392758
1       29554   31097   MIR1302-10      3.76484848484848        3.76484848484848        0       1       0       0       ENST00000473358 1       p36.33  1       1
       906000  TELOMERE        NONE    BAF_WEIGHTED    0       0       0.0     0       0       0.0     0       0       0.0     0.7697547244392758
1       34554   36081   FAM138A 3.76484848484848        3.76484848484848        0       1       0       0       ENST00000417324 1       p36.33  1       1       906000  TELOMERE        NONE    BAF_WEIGHTED    0       0       0.0     0       0       0.0     0       0       0.0     0.7697547244392758
1       69091   70008   OR4F5   3.76484848484848        3.76484848484848        0       1       0       0       ENST00000335137 3       p36.33  1       1       906000  TELOMERE        NONE    BAF_WEIGHTED    0       0       0.0     0       0       0.0     0       0       0.0     0.7697547244392758

==> tumor_name.purple.germline.cnv <==
#chromosome     start   end     copyNumber      bafCount        observedBAF     actualBAF       segmentStartSupport     segmentEndSupport       method
1       24257001        24259000        0.04812876712328765     0       0.0     0.6643998708939189      NONE    UNKNOWN GERMLINE_HOM_DELETION
1       24259001        24261000        0.46649013976232906     0       0.0     0.6643998708939189      NONE    UNKNOWN GERMLINE_HET2HOM_DELETION
1       24261001        24264000        0.00147945205479452     0       0.0     0.6643998708939189      NONE    UNKNOWN GERMLINE_HOM_DELETION
1       24265001        24363000        0.02100503275759379     0       0.0     0.6643998708939186      NONE    UNKNOWN GERMLINE_HOM_DELETION
1       109573001       109574000       0.49815834126557074     0       0.0     0.6813939584040009      NONE    UNKNOWN GERMLINE_HET2HOM_DELETION

==> tumor_name.purple.purity <==
#Purity NormFactor      Score   DiploidProportion       Ploidy  Gender  Status  PolyclonalProportion    MinPurity       MaxPurity       MinPloidy       MaxPloidy
       MinDiploidProportion    MaxDiploidProportion    Version
0.5000000000000002      0.7300000000000003      0.03164964627984315     0.18083207615324154     3.570137536019887       MALE    NORMAL  0.3594812925170068      0.47000000000000025     0.5200000000000002      3.539350525509694       3.6021811592039565      0.17203535686207877     0.18452393600068023     2.14

==> tumor_name.purple.purity.range <==
#Purity NormFactor      Score   DiploidProportion       Ploidy
0.5000000000000002      0.7300000000000003      0.03164964627984315     0.18083207615324154     3.570137536019887
0.49000000000000027     0.7300000000000003      0.03197506803144911     0.17641246839343022     3.6021811592039565
0.5100000000000002      0.7300000000000003      0.032370958658219706    0.18452393600068023     3.539350525509694
0.48000000000000026     0.7400000000000003      0.03304675959015634     0.1800193357839494      3.557151465786201
0.5200000000000002      0.7200000000000003      0.03381291805949315     0.18245224486326855     3.584135151301441

==> tumor_name.purple.qc <==
QCStatus        FAIL_SEGMENT
SegmentPass     false
GenderPass      true
DeletedGenesPass        true
SegmentScore    852
UnsupportedSegments     3043

==> purple.version <==
version=2.14
build.date=2018-05-11 04:20

Also outputs circos and plot directories, with the circos templates/inputs and generated plots, respectively.

circos/:
normal_name.ratio.circos
tumor_name.baf.circos
tumor_name.circos.conf
tumor_name.circos.error
tumor_name.circos.out
tumor_name.cnv.circos
tumor_name.indel.circos
tumor_name.input.conf
tumor_name.input.error
tumor_name.input.out
tumor_name.link.circos
tumor_name.map.circos
tumor_name.ratio.circos
tumor_name.snp.circos
gaps.txt
ideogram.conf
ticks.conf

plot/:
tumor_name.circos.png
tumor_name.copyNumber.png
tumor_name.input.png
tumor_name.minor_allele.png
tumor_name.variant.png

lbeltrame commented 6 years ago

Very interesting. I'll try to run a few tests with my samples once the plumbing is there.

ohofmann commented 5 years ago

Luca,

keen to get your thoughts. In our hands Purple seems to consistently do better than cnvkit, Facets and others (but is limited to WGS tumor/normal, and does seem to struggle once cellularity hits 20%). It also did surprisingly well with FFPE samples even though we try to avoid them.

chapmanb commented 5 years ago

Peter, Oliver and Luca; Thanks much for the discussion. I did some additional exploration of PURPLE and unfortunately we won't be able to use CNVkit read count inputs. PURPLE requires 1000bp even bins across the genome, so the flexible target/anti-target approach CNVkit uses won't work. I'll work on integrating COBALT directly instead. My hope was to be able to compare different binning/counting/normalization approaches and plug in and out with the actual ploidy/purity/SV calls, but it doesn't look like that can happen right now with the current state of PURPLE. If we find COBALT/PURPLE useful, we could generalize it as an alternative input to CNVkit and do the comparison that way, so some longer term work to do. The COBALT normalizing is unique in terms of how it uses GC bins, so would be interesting to see what this contributes versus the CNVkit approach.

I wanted to keep you up to date on the current status and will keep pushing on this. Thanks again for all the feedback and discussion.

roryk commented 5 years ago

I think we stalled on support for this, am I right about that? Is this still something that would be good to add support for?

pdiakumis commented 5 years ago

Thank you for all the incredible work supporting this. I think we're good to close this for now. The Hartwig folks have improved the speed of the pre-processing steps significantly since we first looked at this, which allows us to run the newer version within half an hour or so in our downstream pipeline. But we might revisit the whole integration if we want to use the newer GRIDSS version which is believed to help refine the PURPLE calls. Will keep you posted.

roryk commented 5 years ago

Thank you for following up!

audyavar commented 4 years ago

Hi Brad and everyone -

Is there a purity caller like ABSOLUTE or others mentioned here part of the BCBIO default pipeline in v1.6 or v1.9?

chapmanb commented 4 years ago

Thanks for the question. bcbio has some preliminary work on purity calling but nothing productionized that is used by default. If you're looking for a finished implementation, unfortunately that is not available. If you're interested in developing on this there is some support for preparing for different purity methods and we'd welcome work to help validate and improve it. Hope this helps.

bcbio / bcbio-nextgen