PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Atomize VCF script #72

Closed timothymillar closed 2 months ago

timothymillar commented 3 years ago

See #18 for description of wide vs long format. Currently the assemble program outputs wide format VCF files i.e. each line contains a full haplotype block. This is the most suitable output for the tool giving posterior probabilities etc for full haplotypes.

Long format VCF files (phased SNPs) would be useful and these can be generated by "atomizing" the haplotypes in the wide format VCF. This process will likely result in removal of some information relating to the full haplotype.

timothymillar commented 3 years ago

This can be more or less achieved with vcfallelicprimitives from vcflib. It drops a lot of metadata and doesn't make use of the PS tag but it's probably a better option than writing anther tool.

timothymillar commented 2 years ago

Reopening this as vcfallelicprimitives is a bit limited. It would be good to convert more of the metadata from wide to long format.

There are some more difficult things to carry over like the MCMC QC metrics. It would also be nice if we could some how carry over variant depths but that would require storing an array of SNP depths in the original output which would need to be optional if included at all. Perhaps there could be a general --snp-metrics option to include arrays of more detailed output on single SNP positions?

timothymillar commented 2 months ago

Done in v0.10.0