Document an example workflow for calling haplotypes in a mapping population

Edit: The parameters suggested here are for version 0.5.1, in newer versions the --base-error-rate and --ignore-base-phred-scores are not needed (new defaults)

I recently was asked about running MCHap on a tetraploid mapping population. This is an obvious use-case for a well documented example. The current default parameters aren't ideal because they result in too many alleles being reported.

The following is a draft example of how to approach a balanced mapping population. The main aim is to minimize the number of spurious allele calls. We can do this based on the assumption that all samples are closely related and all parental alleles are present in some of the progeny. Therefore, we can be much more restrictive when reporting alleles because every allele should be present in a reasonable number of 'good' samples.

The main parameters to adjust are:

--inbreeding: using the default value of 0.0 is unrealistic for most real autopolyploid populations. Even if the parents are unrelated there is probably some some level of inbreeding in each due to historic population structure and double reduction. Additionally mchap assemble assumes that all of the possible alleles are present in a natural population. It is probably better to slightly over estimate the inbreeding coefficients rather than to underestimate them. If you don't have any information about expected inbreeding coefficients it is probably still better to set a positive values (e.g. 0.01 to 0.1) rather than 0.0.
--base-error-rate: By default, MCHap uses the base phred scores from each bam file to estimate the base calling error rate. This generally underestimates the true error rate which means reads with errors in them are more likely to be incorrectly called as alleles. This should be changed in a future version of MCHap. If you don't have an estimate for your data then you can use one from the literature e.g. Pfeiffer et al (2018) who report around 0.0025 for Illumina short reads. When specifying a base error rate it is worth using --ignore-base-phred-scores which ensures that the error rate is a constant across all reads (this can speed up the assembly).
--haplotype-posterior-threshold: This parameter controls the posterior support required for an allele to be reported in the output VCF. A value of 0 means that all alleles will be reported (this can create a huge VCF file!) and a value of 1 means that alleles will only be reported if the are called with 100% probability in at least one sample. For a large mapping population we want only well supported alleles so this value should be at least 0.9. This will result in a lot of null alleles (.) in the output VCF but that doesn't matter because we will re-call the alleles across all samples to produce a second VCF without nulls.

Initial assembly

The initial haplotype assembly should look something like this:

mchap assemble
  --targets targets.bed \
  --variants snps.vcf.gz \
  --reference reference.fasta.gz \
  --bam-list bamfiles.txt \
  --ploidy 4 \
  --inbreeding 0.01 \
  --base-error-rate 0.0025 \
  --ignore-base-phred-scores \
  --mcmc-burn 1000 \
  --mcmc-steps 2000 \
  --haplotype-posterior-threshold 0.9 \
  | bgzip > assemble.vcf.gz

It's a good idea to check the results in assemble.vcf.gz before going any further. The next step using mchap call requires that a reasonable number of alleles have been called at each locus (hopefully all of the real alleles). If the majority of samples are mostly null alleles (e.g, 0/././.) then it is likely that some of the real alleles have been missed. This is usually a result of low read depth. If this happens then you may be able to improve the result by lowering --haplotype-posterior-threshold, or possibly by removing some SNPs from the input VCF. Alternatively you can filter these loci or just accept the risk of missing some alleles.

If many of the genotypes contain a mixture of nulls and alleles (e.g. 0/0/2/.) then this probably means that the real alleles have been called but that there wasn't enough data to resolve the allelic dosage. In this case it should be OK to continue to the next step. If just a small fraction of the genotypes are bad then it should also be OK to move to the next step because the real alleles should have been called in the good samples.

Recalling genotypes

Following the initial haplotype assembly we should re-call genotypes using mchap call which calls genotypes from a known set of haplotypes. This is a good idea for a mapping population because every sample is related to the other samples and hence they share alleles. We are more likely to get an accurate genotype as a result (even with low dosage) because we are re-calling genotypes with a much smaller set of "good" alleles. This is effectively using the "good" samples to help correct the bad samples. The other advantage of mchap call is that it always reports complete genotypes so there shouldn't be any null alleles (.) in the output.

The haplotype recalling should look something like this (note thatassemble.vcf.gz needs to be sorted an indexed first):

mchap call
  --haplotypes assemble.vcf.gz \
  --bam-list bamfiles.txt \
  --ploidy 4 \
  --inbreeding 0.01 \
  --base-error-rate 0.0025 \
  --ignore-base-phred-scores \
  --mcmc-burn 1000 \
  --mcmc-steps 2000 \
  | bgzip > recalled.vcf.gz

The genotypes in recalled.vcf.gz should be compleate (no missing alleles) but still may be of low quality for some samples. In a mapping population we would assume that the alleles themselves are generally correct but that the dosage may be inaccurate. One way to check this is using the GPM and PHPM tags in the output VCF. The GPM is the posterior probability for the genotype (including dosage) and the PHPM is the posterior probability for the "allelic phenotype", i.e. the alleles called ignoring dosage. If PHPM is high and GPM is low this indicates that MCHap is uncertain about the dosage of alleles. Unfortunately dosage errors can still be present with a high GPM if the read sequences are biased towards one allele (this is hard to identify).

Future improvements

One of the planned options that will be added to mchap call allows specification of prior population allele frequencies (see #120). This will be particularly useful for mapping populations because any 'bad' alleles called in the initial assembly should have low frequency compared to the 'good' alleles. So using those frequencies will lower the chance that they are re-called in the recalling step. These examples should be updated when this feature is available.

PlantandFoodResearch / MCHap