PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Document an example workflow for calling haplotypes in a mapping population #123

Closed timothymillar closed 1 year ago

timothymillar commented 2 years ago

Edit: The parameters suggested here are for version 0.5.1, in newer versions the --base-error-rate and --ignore-base-phred-scores are not needed (new defaults)

I recently was asked about running MCHap on a tetraploid mapping population. This is an obvious use-case for a well documented example. The current default parameters aren't ideal because they result in too many alleles being reported.

The following is a draft example of how to approach a balanced mapping population. The main aim is to minimize the number of spurious allele calls. We can do this based on the assumption that all samples are closely related and all parental alleles are present in some of the progeny. Therefore, we can be much more restrictive when reporting alleles because every allele should be present in a reasonable number of 'good' samples.

The main parameters to adjust are:

Initial assembly

The initial haplotype assembly should look something like this:

mchap assemble
  --targets targets.bed \
  --variants snps.vcf.gz \
  --reference reference.fasta.gz \
  --bam-list bamfiles.txt \
  --ploidy 4 \
  --inbreeding 0.01 \
  --base-error-rate 0.0025 \
  --ignore-base-phred-scores \
  --mcmc-burn 1000 \
  --mcmc-steps 2000 \
  --haplotype-posterior-threshold 0.9 \
  | bgzip > assemble.vcf.gz

It's a good idea to check the results in assemble.vcf.gz before going any further. The next step using mchap call requires that a reasonable number of alleles have been called at each locus (hopefully all of the real alleles). If the majority of samples are mostly null alleles (e.g, 0/././.) then it is likely that some of the real alleles have been missed. This is usually a result of low read depth. If this happens then you may be able to improve the result by lowering --haplotype-posterior-threshold, or possibly by removing some SNPs from the input VCF. Alternatively you can filter these loci or just accept the risk of missing some alleles.

If many of the genotypes contain a mixture of nulls and alleles (e.g. 0/0/2/.) then this probably means that the real alleles have been called but that there wasn't enough data to resolve the allelic dosage. In this case it should be OK to continue to the next step. If just a small fraction of the genotypes are bad then it should also be OK to move to the next step because the real alleles should have been called in the good samples.

Recalling genotypes

Following the initial haplotype assembly we should re-call genotypes using mchap call which calls genotypes from a known set of haplotypes. This is a good idea for a mapping population because every sample is related to the other samples and hence they share alleles. We are more likely to get an accurate genotype as a result (even with low dosage) because we are re-calling genotypes with a much smaller set of "good" alleles. This is effectively using the "good" samples to help correct the bad samples. The other advantage of mchap call is that it always reports complete genotypes so there shouldn't be any null alleles (.) in the output.

The haplotype recalling should look something like this (note thatassemble.vcf.gz needs to be sorted an indexed first):

mchap call
  --haplotypes assemble.vcf.gz \
  --bam-list bamfiles.txt \
  --ploidy 4 \
  --inbreeding 0.01 \
  --base-error-rate 0.0025 \
  --ignore-base-phred-scores \
  --mcmc-burn 1000 \
  --mcmc-steps 2000 \
  | bgzip > recalled.vcf.gz

The genotypes in recalled.vcf.gz should be compleate (no missing alleles) but still may be of low quality for some samples. In a mapping population we would assume that the alleles themselves are generally correct but that the dosage may be inaccurate. One way to check this is using the GPM and PHPM tags in the output VCF. The GPM is the posterior probability for the genotype (including dosage) and the PHPM is the posterior probability for the "allelic phenotype", i.e. the alleles called ignoring dosage. If PHPM is high and GPM is low this indicates that MCHap is uncertain about the dosage of alleles. Unfortunately dosage errors can still be present with a high GPM if the read sequences are biased towards one allele (this is hard to identify).

Future improvements

One of the planned options that will be added to mchap call allows specification of prior population allele frequencies (see #120). This will be particularly useful for mapping populations because any 'bad' alleles called in the initial assembly should have low frequency compared to the 'good' alleles. So using those frequencies will lower the chance that they are re-called in the recalling step. These examples should be updated when this feature is available.