PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Indicate when the reference allele is reported only as a requirement of the VCF format #146

Closed timothymillar closed 2 years ago

timothymillar commented 2 years ago

In mchap assemble, If the reference allele is not assembled for any sample it is still reported in the output as a requirement of the VCF format. This can have unexpected downstream side-effects, for example, if those samples are then recalled using mchap call the reference allele will be used as a valid haplotype. This can even result in situations where the reference allele is the only input allele resulting in genotypes which are homozygous for that allele even if there is no evidence of it being present in any sample.

A solution to this problem would be for mchap assemble to report in an info field/flag if the reference allele is only reported as a requirement of the VCF format rather than actually being observed. Then it could be excluded in downstream analysis. This may result in no input haplotypes for the downstream analysis in which case all haplotypes would be reported as unknown.

timothymillar commented 2 years ago

Related to #145

timothymillar commented 2 years ago

The clearest way to signpost the issue not successfully assembling any haplotypes at a loci may be to just use the filter column and skip filtered loci in downstream analysis by default

timothymillar commented 2 years ago

Done in #147