PMBio / deeprvat

Other
31 stars 2 forks source link

Refactor preprocessing pipeline and add github action #12

Closed endast closed 1 year ago

endast commented 1 year ago

This PR adds example data for the preprocessing pipeline and instructions for testing the pipeline. There is also a small patch to the preprocessing script to support relative paths when excluding variants.

The simulated example data is located in example/preprocess.

To test the example data follow the instructions in the README:

Run the preprocess pipeline with example data

The vcf files in the example data folder was generated using fake-vcf (with some manual editing). hence does not contain real data.

  1. cd into the preprocessing example dir
cd <path_to_repo>
cd example/preprocess
  1. Download the fasta file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference
  1. Unpack the fasta file
gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
  1. Run with the example config
snakemake -j 1 --snakefile ../../pipelines/preprocess.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml
  1. Enjoy the preprocessed data 🎉
ls -l workdir/preprocesed
total 48
-rw-r--r--  1 user  staff  6404 Aug  2 14:06 genotypes.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr21.h5
-rw-r--r--  1 user  staff  6354 Aug  2 14:06 genotypes_chr22.h5

A new job is added to actions that first run a smoke test of the preprocessing pipeline and then the full preprocessing pipeline using the example data. The slowest part of running the example pipeline is downloading the fasta file. In github actions this step is cached.

Screenshot 2023-08-03 at 10 35 27 Screenshot 2023-08-03 at 10 35 35

You can view the actions here: https://github.com/PMBio/deeprvat/actions/runs/5741091552