PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Drop zero-count alleles #113

Closed timothymillar closed 2 years ago

timothymillar commented 3 years ago

MCHap sub-programs can report alleles with a count of 0 across all samples if the allele is identified by the assemble subprogram or input to the call programs. It is good to report these alleles in the output VCF as they represent part of the domain of the posterior distribution even if they are not called in any samples. This also makes it easy to compare results between the assemble program and subsequent runs of the call programs. However, removing these alleles from the input of subsequent programs would be an effective way of reducing the parameter space with low risk of biasing the results. This could be added as an optional argument for filtering the input to each subprogram e.g. --drop-zero-count-alleles.

timothymillar commented 2 years ago

Alternatively we could drop alleles bellow a specific frequency in the input vcf. This would allow filtering of low frequency alleles that are likely incorrect in a mapping population. E.g --drop-allele-frequency 0.01 drops alleles with frequency <=0.01. The default would be -1 so that even zero count alleles are kept.

Related to #120 and #123

timothymillar commented 2 years ago

Another option would be to take the n top alleles. This may be useful when there is a logical limit to the maximum number of possible alleles (i.e. a mapping pop). This could work in combination with dropping low frequency alleles.

timothymillar commented 2 years ago

Removal of alleles by either mechanism should be able to remove the reference alleles for the purpose of estimating posterior distributions. The reference allele can be added back after estimation for compliance with the VCF standard.