bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement
26 stars 8 forks source link

Docker/Singularity support #88

Closed gasperpetelin closed 3 years ago

gasperpetelin commented 3 years ago

Greetings

I was wondering if there is any plan to include support for Docker/Singularity? Publishing a Docker image would probably simplify the paprica installation for many users and eliminate the need to manually fix incompatible dependencies. It would also help with reproducibility and simplify the use on systems where installing packages is not always possible (shared research clusters that support Singularity). If adding Docker support would be useful I can create a pull request that will add the required code?

Setup

While I am not really familiar with the whole library, adding Docker support should hopefully be fairly simple. It requires a new file named Dockerfile with roughly the following content (this is just a first draft of the Dockerfile. It is based on the linux_install.sh script):

FROM python:3.9
RUN apt-get update \
    && apt-get install -qy --no-install-recommends \
    make \
    git \
    cmake \
    autotools-dev \
    libtool \
    flex \
    bison \
    cmake \
    automake \
    autoconf \
    build-essential \ 
    git \
    zip

## Install python dependencies, including external python tools
RUN pip3 install numpy biopython joblib pandas seqmagick termcolor

RUN cd ~

## Install RAxML
#git clone https://github.com/stamatak/standard-RAxML.git
#cd standard-RAxML
#sudo make -f Makefile.AVX2.PTHREADS.gcc
#rm -f *.o

## Install RAxML-ng
RUN wget https://github.com/amkozlov/raxml-ng/releases/download/0.9.0/raxml-ng_v0.9.0_linux_x86_64.zip && unzip raxml-ng_v0.9.0_linux_x86_64.zip && rm raxml-ng_v0.9.0_linux_x86_64.zip

## Install infernal
RUN cd ~
RUN wget http://eddylab.org/infernal/infernal-1.1.2-linux-intel-gcc.tar.gz && tar -xzvf infernal-1.1.2-linux-intel-gcc.tar.gz && mv infernal-1.1.2-linux-intel-gcc infernal

## Install gappa
RUN git clone --recursive https://github.com/lczech/gappa.git && cd gappa && make
RUN cd ~

## Install epa-ng
## Double check that you have all dependencies as described here: https://github.com/Pbdas/epa-ng#installation.
## If the compiler yells at you about not having zlib, you will need to have zlib1g-dev installed, not just zlib1g!

RUN git clone https://github.com/Pbdas/epa-ng.git && cd epa-ng && make
RUN cd ~

## Modify PATH
ENV PATH="/pplacer:${PATH}"
ENV PATH="/.local/bin:${PATH}"
ENV PATH="/infernal/binaries:${PATH}"
ENV PATH="/infernal/easel:${PATH}"
ENV PATH="/raxml-ng:${PATH}"
ENV PATH="/paprica:${PATH}"
ENV PATH="/epa-ng/bin:${PATH}"
ENV PATH="/gappa/bin:${PATH}"
#ENV PATH="export PATH" >> .bashrc

## Download paprica - redundant cause that's probably how you got this script
RUN git clone https://github.com/bowmanjeffs/paprica.git && cd paprica && chmod a+x *py && chmod a+x *sh
CMD "/bin/bash"

The second step is then building and publishing the image. This requires a project maintainer to create a Dockerhub account and connecting it with Github (approximately 2-5 min of work). After that Docker image is built automatically on every new commit and requires no further actions and is thus very low maintenance.

Advantages

  1. Running everything with docker is fairly simple. If Docker is installed and there is a file test.fasta in /path/to/fasta/data
    docker run -it -v /path/to/fasta/data:/data  maintainersprofile/paprica
    > cd /data
    > paprica-pick_domain.py -in test
  2. Docker containers are portable between different operating systems
  3. Requires almost no maintenance once it is set up
  4. When a specific version of the container is created, it will always produce the same results making reproducibility very simple
bowmanjeffs commented 3 years ago

Gašper thanks for this suggestion. It's been in the back of my mind for a while but I haven't acted on it. If you're willing to create the required code your contribution will be much appreciated! So that I understand the workflow correctly, I should create a Dockerhub account and connect it to GitHub AFTER accepting your pull request?

gasperpetelin commented 3 years ago

Yes, ideally it should be done after the pull request since there is still no Dockerfile in the repo and Dockerhub has no template for building. I am not that familiar with Github/Dockerhub integration but I think the following should hopefully work.

  1. Create an account
  2. Connect to Github account with Account Settings/Linked Accounts/Click Connect button next to Gitlab account.
  3. Create a new repository with Repositories/Create Repository
    • Enter Name (probably paprica) and Description
    • Visibility: public
    • Organization: bowmanjeffs/paprica
    • Build Settings: Everithing should be similar to what the image is showing. The only difference might be a regular expression for tags (It is important that regular expression /^v([0-9.]+)$/ matches how versions will be created in the future. If release versions are tagged paprica_v0.7.0 then /^paprica_v([0-9.]+)$/ might is required.) scrnli_6_10_2021_7-53-57 PM
    • Click Create and Build

This should create 2 things. First is a new container with tag latest. This will rebuild the container on every new master commit. If a new user downloads a container it will be the default one. The second thing is a more permanent container versioning system. On every new release tagged paprica_v{number}, a container with a specific version version-{number} will be created. This way someone can download a container that matches releases on Github. Hopefully, this works. If not the regular expression can be easily fixed later.

I will try to create a good Dockerfile based on linux_install.sh. It might not be the best and most efficient one but it can be improved in the future. The current one takes about 2 min to build and produces an image with a size of 1-1.5GB.

Just last node: Free Dockerhub account has some limitations. Image building might take some time to start if no free resources are available. But I have never experienced long waiting times. Usualy image building starts 5-10 minutes after commit.

bowmanjeffs commented 3 years ago

Sounds good. We'll give it a go!

bowmanjeffs commented 3 years ago

Alright, I haven't gotten around to actually testing the image but the automated builds seem to be working. Thanks for contributing the docker file! If you try the image before I do let me know how it goes.

gasperpetelin commented 3 years ago

Since I am not familiar with the majority of the features of paprica my test should be taken with a grain of salt. It appears that everything works with Docker and Singularity. Containerized ./paprica-run.sh test bacteria produces the same results as a non-containerized application.

I only have 2 more short questions but otherwise, this issue can be closed I think:

  1. Should README be updated so users know that there exists an official container and they can use it?
  2. And one more general one. Is there some security reason why all .py and .sh scripts are nonexecutable and chmod a+x *py && chmod a+x *sh is required? Couldn't these scripts be made executable on Gitlab and users would not have to run the command?
bowmanjeffs commented 3 years ago

I'll made that change to README and will update shortly. File permissions is a problem that I haven't solved yet. There's no security reason this is the case, I just haven't been able to get the permissions to persist. It's clear how to do this from the Git command line but for convenience I usually push commits from GitHub Desktop. Let me know if you know of a solution!