HALFpipe / ImputationProtocol

The ENIGMA Imputation Protocol as a container for your local workstation or high-performance compute cluster
3 stars 4 forks source link

Imputation Protocol (as a container)

build

This project is currently at an experimental stage and has not been validated by an experienced geneticist. Please use at your own risk.

With the advent of stricter data privacy laws in many jurisdictions, it has become impossible for some researchers to use the Michigan Imputation Server to phase and impute genotype data. This project allows you to use the open source code behind the server on your local workstation or high-performance compute cluster, together with the 1000 Genomes Phase 3 v5 reference, and the rest of the ENIGMA Imputation Protocol, .

System requirements

This document assumes that you have either Singularity or Docker installed on your system.

Usage

The container comes with all the software needed to run the imputation protocol. You can either do this manually based on the official instructions, or use the step-by-step guide below.

The guide makes some suggestions for folder names (e.g. mds, raw, qc), but these can also be chosen freely. The only exceptions to this rule are the folders cloudgene, hadoop, downloads, input and output inside the working directory (in /data inside the container).

  1. You need to download the container file using one of the following commands. This will use approximately one gigabyte of storage.

    Container platform Version Command
    Singularity 3.x wget http://download.gwas.science/singularity/imputation-protocol-latest.sif
    Docker docker pull gwas.science/imputation-protocol:latest
  2. You will now need to create a working directory that can be used for intermediate files and outputs. This directory should be empty and should have sufficient space available. We will store the path of the working directory in the variable working_directory, and then create the new directory and some subfolders.

    Usually, you will only need to do this once, as you can re-use the working directory for multiple datasets. Note that this variable will only exist for the duration of your terminal session, so you should re-define it if you exit and then resume later.

    ```bash export working_directory=/mnt/scratch/imputation mkdir -p -v ${working_directory}/{raw,mds,qc} ```
  3. Copy your raw data to the raw subfolder of the working directory. If you have multiple .bed file sets that you want to process, copy them all.

    ```bash cp -v my_sample.bed my_sample.bim my_sample.fam ${working_directory}/raw ```
  4. Next, start an interactive shell inside the container using one of the following commands.

    The --bind (Singularity) or --volume (Docker) parameters are used to make the working directory available inside the container at the path /data. This means that in all subsequent commands, we can use the path /data to refer to the working directory.

    Container platform Command
    Singularity singularity shell --hostname localhost --bind ${working_directory}:/data --bind /tmp imputation-protocol-latest.sif
    Docker docker run --interactive --tty --volume ${working_directory}:/data --bind /tmp gwas.science/imputation-protocol /bin/bash
  5. Inside the container, we will first go to /data/mds, and run the script enigma-mds for your .bed file set. The script creates the files mdsplot.pdf and HM3_b37mds2R.mds.csv, which are summary statistics that you will need to share with your working group as per the ENIGMA Imputation Protocol.

    Note that this script will create all output files in the current folder, so you should use cd to change to the /data/mds/sample folder before running it.

    If you have multiple .bed file sets, you should run the script in a separate folder for each one. Otherwise the script may overwrite previous results when you run it again.

    If you have just one dataset:

    ```bash mkdir /data/mds/sample cd /data/mds/sample enigma-mds --bfile /data/raw/my_sample ```

    Alternatively, for multiple datasets:

    ```bash mkdir /data/mds/{sample_a,sample_b} cd /data/mds/sample_a enigma-mds --bfile /data/raw/sample_a cd /data/mds/sample_b enigma-mds --bfile /data/raw/sample_b ```
  6. Next, we will set up our local instance of the Michigan Imputation Server.

    The setup-hadoop command will start a Hadoop instance on your computer, which consists of four background processes. When you are finished processing all your samples, you can stop them with the stop-hadoop command. If you are using Docker, then these processes will be stopped automatically when you exit the container shell.

    The setup-imputationserver script will then verify that the Hadoop instance works, and then install the 1000 Genomes Phase 3 v5 genome reference that will be used for imputation (around 15 GB of data, so it may take a while).

    If you are resuming analyses in an existing working directory, and do not still have the Hadoop background processes running, then you should re-run the setup commands. If they are still running, then you can skip this step.

    ```bash setup-hadoop --n-cores 8 setup-imputationserver ```

    If you encounter any warnings or messages while running these commands, you should consult with an expert to find out what they mean and if they may be important. However, processing will usually complete without issues, even if some warnings occur.

    If something important goes wrong, then you will usually see a clear error message that contains the word "error". Please note that if the commands take more than an hour to run the setup, then that may also indicate that an error occurred.

  7. Next, go to /data/qc, and run enigma-qc for your .bed file sets. This will drop any strand ambiguous SNPs, then screen for low minor allele frequency, missingness and *Hardy-Weinberg equilibrium*, then remove duplicate SNPs (if necessary), and finally convert the data to sorted .vcf.gz format for imputation.

    The script places intermediate files in the current folder, and the final .vcf.gz files in /data/input/my_sample where they can be accessed by the imputationserver script in the next step.

    Note that this script will create some output files in the current folder, so you should use cd to change to the /data/qc/sample folder (or similar) before running it.

    The input path is hard-coded, because the imputation server is quite strict and expects a directory with just the .vcf.gz files and nothing else, so to avoid any problems we create that directory automatically.

    If you have just one dataset:

    ```bash mkdir /data/qc/sample cd /data/qc/sample enigma-qc --bfile /data/raw/my_sample --study-name my_sample ```

    Alternatively, for multiple datasets:

    ```bash mkdir /data/qc/{sample_a,sample_b} cd /data/qc/sample_a enigma-qc --bfile /data/raw/sample_a --study-name sample_a cd /data/qc/sample_b enigma-qc --bfile /data/raw/sample_b --study-name sample_b ```
  8. Finally, run the imputationserver command for the correct sample population. This parameter can be set to mixed if the sample has multiple populations or the population is unknown.

    If a specific population (not mixed) is specified, then the imputation server will compare the minor allele frequency of each variant in the sample to the population reference using a chi-squared test, and excludes outliers. If the population is mixed, then the step is skipped.

    Regardless of the population paramater, the imputation step always uses the entire 1000 Genomes Phase 3 v5 genome reference.

    If you see any errors (such as obvious stand flips detected), you may need to follow one or more of the workarounds in the troubleshooting section.

    ```bash imputationserver --study-name my_sample --population mixed ```

    This process will likely take a few hours, and once it finishes for all your .bed file sets, you can exit the container using the exit command.

    All outputs can be found in the working directory created earlier. The quality control report can be found at ${working_directory}/output/my_sample/qcreport/qcreport.html (only if the population is not mixed), and the imputation results at ${working_directory}/output/my_sample/local. The .zip files are encrypted with the password password.

  9. To merge all output files into a compact and portable .zip archive, the container includes the make-archive command. It will create a single output file at ${working_directory}/my_sample.zip with all output files.

    ```bash make-archive --study-name my_sample ```

    Add the `--zstd-compress` option to the command to use a more efficient compression algorithm. This will take longer to run, but will create a smaller output file (around 70% smaller).

  10. Once you have finished processing all your datasets, you can stop all background processes of the imputation server with the stop-hadoop command. Then you can exit the container, move the output files to a safe location and delete the working directory (in case you need the disk space).

    ```bash stop-hadoop exit mv ${working_directory}/my_sample.zip /storage rm -rI ${working_directory} ```

Troubleshooting

My data is not in .bed file set format

You will need to convert your data before you start. Usually, this is very straight-forward with the plink command.

.ped file set format

plink --ped my_sample.ped --map my_sample.map --make-bed --out my_sample

Genome reference

Error: No chunks passed the QC step. Imputation cannot be started!

This error happens when your input data uses a different genome reference from what the imputation server expects (hg19). You can do this manually with LiftOver. To automatically do that, the container comes with the check hg19 command, which is based on the RICOPILI command buigue.

check hg19 --bfile /data/raw/my_sample 

The command will create a .bed file set at /data/raw/my_sample.hg19 which will have the correct genome reference. You should now use this corrected file to re-run the enigma-qc script, and then retry running the imputationserver command.

Strand flips

Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started!

If the imputationserver command fails with this error, then you will need to resolve strand flips in your data. To automatically do that, the container comes with the check flip command, which is based on the RICOPILI called checkflip4 and check-bim.

check flip --bfile /data/raw/my_sample

The command will create a .bed file set at /data/raw/my_sample.check-flip which will have all strand flips resolved. You should now use this corrected file to re-run the enigma-qc script, and then retry running the imputationserver command.

Velocity

Job execution failed: Velocity could not be initialized!

You likely have bad permissions in your working directory. You can either try to fix them, or start over with a fresh working directory.

VCF violates the official specification

Warning: At least one VCF allele code violates the official specification; other tools may not accept the file. (Valid codes must either start with a '<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or represent a breakend.)

This message occurs for example when you have indels in your raw data. You can ignore this message, because the Michigan Imputation Server will remove these invalid variants in its quality control step.

The commands are stuck, for example at Init HadoopUtil null

This suggests that your Hadoop instance may not be accepting new jobs. The fastest way to solve this is to stop and delete the instance, and then to re-run the setup.

# stop and delete
stop-hadoop
rm -rf /data/hadoop

# re-run setup
setup-hadoop --n-cores 8
setup-imputationserver

Error Application or file imputationserver not found

Please delete the /data/cloudgene folder inside the container and run setup-imputationserver again.