etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Export Sequenza "seqz" format #204

Open etal opened 7 years ago

etal commented 7 years ago

Ahead of supporting Sequenza clonality estimation (#200) and segmentation (#198), let's just try exporting a .cnr and VCF file to Sequenza's input format, which is a gzipped TSV file with 14 columns:

See the sequenza-utils project for how to generate this.

Start as a separate script cnr2seqz.py for testing, then maybe include as a sub-command export seqz if it works well. (It's not guaranteed to work well on target panel data with uneven bin sizes.)

seedgeorge commented 6 years ago

Any luck with this? Was playing with GATK4 and feeding different .vcfs to CNVkit recently and was wondering what the best way to link up with sequenza would be.

etal commented 6 years ago

Ah, no action here yet. If you can get this working I'd be interested to hear how you did it.

For a starting point, consider cnvkit.py export nexus-ogt or cnvkit.py call Sample.cnr -v Sample.snv.vcf -m none ... to align SNP b-allele frequencies to .cnr bins, and then use R or pandas to align the .cnr file with the original reference.cnn to get GC percentages for each bin. Not sure about the rest, or which columns are optional.

For the GATK4 pipeline, you might see some trouble with using Mutect2 VCFs with CNVkit (or any other tumor clonality estimation tool), since it filters out germline variants automatically. As an alternative, here's a dodgy script to force-call tumor allele frequencies at predefined SNP sites, e.g. dbSNP, 1000 Genomes, or gnomAD: https://github.com/etal/cnvkit/blob/master/scripts/snpfilter.sh

seedgeorge commented 6 years ago

Ah, well I've run into the Mutect2 limitation - my solution (so far) is to run Haplotypecaller with a) both tumour and normal .bams and b) in genotype mode with 1000genomes high quality snps. This forces the software to output allele frequencies in both samples at all snp positions, even if haplotypecaller wouldn't normally call a heterozygous variant due to weird frequency in the tumour sample. (I ignored all the gvcf/variant database shenanigans for this purpose). Nice work on putting in tumour/normal parameters for multi-sample vcfs into cnvkit - works great, detects the relevant (heterozygous) variants from the normal sample and then calculates+uses the AF's from the tumour sample.

I'll try to hack together something for .seqz files, will take a look at how varscan2 snp/copynumber output is parsed into it I think.