This repo contains versions 1 and 2 of the pipeline. The pipeline has been entirely re-written in a non-backward compatible way for newer versions (3+), and you are suggested to use those newer versions instead. See https://github.com/dms-vep/dms-vep-pipeline-3 for the newer versions.
This repository contains a snakemake pipeline for analysis of deep mutational scanning of barcoded viral entry proteins. In order to use the pipeline, you can include this repository as a git submodule in your own repo, which will contain the data and project-specific code.
In other words, if the repository for your specific deep mutational scanning project is called <my_dms_repo>
, you would add dms-vep-pipeline as a submodule to that, while the master snakemake Snakefile
, its configuration (config.yaml
), its input data, etc would reside in <my_dms_repo>
.
In other words, the directory structure would look like this:
<my_dms_repo>
├── dms-vep-pipeline [added as git submodule]
├── README.md [README for main project]
├── Snakefile [top-level snakemake file]
├── config.yaml [configuration for Snakefile]
├── data [subdirectory with input data]
├── results [subdirectory with results created by pipeline]
├── docs [sphinx summary of results created by pipeline]
└── <other files / subdirectories that are part of project>
The top-level Snakefile
then includes the snakemake rules defined by the *.smk
files in the dms-vep-pipeline submodule, and uses them to run the analysis.
This also requires properly setting up the top-level config.yaml
to specify details for your project (see more below).
A test example use of the pipeline is in ./test_example.
In order to make the example contained in the pipeline, the organization is a bit different for this ./test_example: it is contained as a subdirectory of the pipeline whereas for actual use of the repo you will make the pipeline a submodule of <my_dms_repo>
as described above.
Therefore, your config.yaml
will have different values for pipeline_path
and docs
as indicated in the comments in ./test_example/config.yaml.
Despite these differences, ./test_example provides an example of how to set up your repo.
Running the Snakefile
in ./test_example performs the whole analysis for the test example and creates the sphinx rendering in ./docs which can be displayed via GitHub pages as here: https://dms-vep.github.io/dms-vep-pipeline/.
The Snakefile
you create will include pipeline.smk (which has the analysis pipeline) and docs.smk (which builds the sphinx HTML documentation).
You can also optionally add other rules into your Snakefile
.
If they define an output
named nb
that is a Jupyter notebook (like some of the rules in pipeline.smk and its included .smk
files), then that will be included into the HTML documentation.
You then run the pipeline with:
snakemake -j <n_jobs> --use-conda
Or if you are only using the dms-vep-pipeline
conda environment in environment.yml and have already built that, you can also just do:
conda activate dms-vep-pipeline
snakemake -j <n_jobs>
If the ./docs
output directory has already been built and you want to force a re-run, just delete it and then run above.
This will create the results in ./results/
and the HTML documentation in ./docs/
.
To display the HTML documentation via GitHub pages, set up your repo to serve documentation via GitHub pages from the /docs
folder of the main (or master) branch as described here.
The documentation will then be at https://dms-vep.github.io/<my_dms_repo>
(assuming you are using the https://github.com/dms-vep organization; otherwise replace dms-vep
with whatever account contains your repo).
Note that dms-vep-pipeline has its own conda environment specified in environment.yml. There is a separate environment, environment_align_parse_PacBio_ccs.yml, for aligning and parsing the PacBio CCSs so that isn't re-run every time main environment is updated.
To add dms-vep-pipeline as a submodule in your repo (<my_dms_repo>
), do as follows.
git submodule add https://github.com/dms-vep/dms-vep-pipeline
This adds the file .gitmodules and the submodule dms-vep-pipeline, which can then be committed with:
git commit -m 'added `dms-vep-pipeline` as submodule'
Note that if you want a specific commit or tag of dms-vep-pipeline, follow the steps here:
cd dms-vep-pipeline
git checkout <commit>
and then cd ../
back to the top-level directory of <my_dms_repo>
, and add and commit the updated dms-vep-pipeline
submodule.
You can also make changes to the dms-vep-pipeline submodule in <my_dms_repo>
by going into that directory, making changes on a branch, and then pushing back to dms-vep-pipeline and opening a pull request.
Here are the different contents of this repo:
conda
environmentThe conda environment for the pipeline is in environment.yml.
The Python code should be formatted with black by running black .
Comparable formatting is done for the snakemake
file (*.smk
files) with snakefmt by running snakefmt .
.
The overall snakemake
pipeline is linted by going to ./test_example and running snakemake --lint
.
The code and Jupyter notebooks are linted with flake8_nb by running flake8_nb
.
The pipeline is tested with GitHub Actions by checking all the formatting and linting above, and then also running the pipeline on the example in ./test_example. See .github/workflows/test.yaml for details.
The repo was configured to strip output from Jupyter notebooks as described here by running:
nbstripout --install --attributes .gitattributes
git lfs
The large data files for the test example in ./test_example/sequencing_data/ are tracked with git lfs. Note the first time you set up the repo, you have to run:
git lfs install
If you add noteboooks in subindices, to avoid sphinx
errors they must have the "orphan" tag: https://nbsphinx.readthedocs.io/en/0.8.8/orphan.html