fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Make the naive calling methods output (indexed) BCF, instead of VCF #25

Closed fedarko closed 2 years ago

fedarko commented 2 years ago

I was hoping to avoid doing this for now, but at the very least it looks like we'll need to index the VCF in order to iterate over the calls for a certain specified contig (in downstream steps). And in order to index the VCF, we need to first bgzip it, so at that point we may as well go all in.

There are two main ways of doing this that I see:

  1. Everything stays mostly the same, but we just take the VCF output and convert it after the fact to an indexed BCF file.

    • As I understand it, the pipeline goes something like strainFlye -> vcf output -> bgzip -> vcf.gz output -> bcftools view -Ob -> bcf output -> bcftools index -> bcf and bcf.csi files we can load easily in pysam later.
  2. Rather than outputting a VCF file in the first place, we directly create a BCF file from Python using pysam. We can use the same header as for the VCF file, but rather than writing out text records we can write out actual records to the BCF using pysam's VariantFile.new_record() command (see here).

The second of these options will probably be faster, and requires less temporary storage. It will be a bit harder to implement, but I think it's feasible.

... Although, looking into it a bit, the API here seems subject to change -- see https://pysam.readthedocs.io/en/latest/api.html#pysam.VariantHeader.new_record. Maybe let's stick with option 1 for now, and make "direct BCF output" a nice-to-have feature.