genepi / imputationserver

Michigan Imputation Server: A new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity
https://imputationserver.sph.umich.edu/
GNU Affero General Public License v3.0
77 stars 41 forks source link

Confirm minimac command for apps@1000g-phase3-low@1.0.0 (hg38) #141

Open andrew-slater opened 4 months ago

andrew-slater commented 4 months ago

I'm attempting to replicate the Server's imputation with the 1000G Phase 3 GRCh38 Reference Panel. I started by confirming my local execution of the Eagle pre-phasing matches that of the Server by running 2 jobs on the Server:

  1. Full pipeline (including pre-phasing) on unphased input genotypes
  2. Imputation-only on genotypes pre-phased locally using Eagle following the example command in the documentation

These 2 jobs produced identical results indicating my local execution of Eagle is the same as on the Server. I then used the same pre-phased genotypes from Eagle to run Minimac4 (v1.0.2) locally following the example command in the documentation but cannot get these results to match those from the Server.

I sourced my reference haplotypes from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ and, after removing the singletons, confirmed the variant-level contents match the results from the Server (no additional or missing variants) indicating they are almost certainly from the same source (although, possibly not the same samples? I'm using the full set of 2548 samples).

I generated the M3VCF files with parameter estimates using Minimac3 v2.0.1 with the default options:

Minimac3 --processReference --refHaps ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingletons.vcf.gz --prefix ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingletons

minimac4 --refHaps ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingletons.m3vcf.gz --haps eagle_out/14.55000000-85000000.vcf.gz --start 60000000 --end 80000000 --window 500000 --prefix minimac_out/14.60000000-80000000 --cpus 1 --chr 14 --noPhoneHome --format GT,DS,GP,HDS --allTypedSites --meta --minRatio 0.00001

I thought perhaps the Server may be using a genetic recombination map file but could not locate geneticMapFile.b38.map.txt.gz and tried the map file from Eagle instead:

minimac4 --refHaps ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.noSingletons.m3vcf.gz --haps eagle_out/14.55000000-85000000.vcf.gz --start 60000000 --end 80000000 --window 500000 --prefix minimac_out/14.60000000-80000000 --cpus 1 --chr 14 --noPhoneHome --format GT,DS,GP,HDS --allTypedSites --meta --minRatio 0.00001 --referenceEstimates --mapFile genetic_map_hg38_withX.tsv.gz

These results matched those from the Server a bit better than using the M3VCF parameter estimates but still some discordance.

I think it would help me try to troubleshoot if you could please:

  1. Confirm if the Server is using the full 2548-sample haplotype set from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL)/ ?
  2. Clarify how your M3VCF files were generated (e.g. did you use non-default values for --rounds or --states or a different version of Minimac3)?
  3. Confirm the precise minimac command (i.e. is it using the --mapFile and if so, which one)?