Refactor preprocessing pipeline and add github action

This PR adds example data for the preprocessing pipeline and instructions for testing the pipeline. There is also a small patch to the preprocessing script to support relative paths when excluding variants.

The simulated example data is located in example/preprocess.

To test the example data follow the instructions in the README:

Run the preprocess pipeline with example data

The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.

cd into the preprocessing example dir

cd <path_to_repo>
cd example/preprocess

Download the fasta file

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference

Unpack the fasta file

gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz

Run with the example config

snakemake -j 1 --snakefile ../../pipelines/preprocess.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml

Enjoy the preprocessed data 🎉

ls -l workdir/preprocesed
total 48
-rw-r--r--  1 user  staff  6404 Aug  2 14:06 genotypes.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr21.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr22.h5

A new job is added to actions that first run a smoke test of the preprocessing pipeline and then the full preprocessing pipeline using the example data. The slowest part of running the example pipeline is downloading the fasta file. In github actions this step is cached.

Screenshot 2023-08-03 at 10 35 27

You can view the actions here: https://github.com/PMBio/deeprvat/actions/runs/5741091552

PMBio / deeprvat

Refactor preprocessing pipeline and add github action #12

Run the preprocess pipeline with example data