StaPH-B / docker-builds

:package: :whale: Dockerfiles and documentation on tools for public health bioinformatics
GNU General Public License v3.0
182 stars 116 forks source link

Enable GCS, S3, and libdeflate support for bcftools #1019

Closed pettyalex closed 2 weeks ago

pettyalex commented 2 months ago

Enable AWS S3, GCS, and libdeflate support for bcftools by running ./configure before compiling

This fixes https://github.com/StaPH-B/docker-builds/issues/1018

If you want to merge this, I don't see a way to mark another build number for an already published package, but I'd be glad to update that if it exists.

I'd also be glad to add tests that test reading from AWS S3 or GCS storage directly to validate that these features are working.

Pull Request (PR) checklist:

kapsakcj commented 2 months ago

Could you please mark this PR as draft? The dockerfile doesn't build successfully yet (according to the GH Actions log) and I think it will require some edits prior to review from our team.

We would love to have additional tests for these features built into the dockerfile, preferably in the test stage of the dockerfile

And my last thought - it may be good to also update the samtools and htslib dockerfiles as well as I imagine they are also missing these features (I have not checked though, don't quote me). Can be done as part of this PR or separately.

Kincekara commented 1 month ago

@pettyalex Thank you for raising this issue and making a pull request. GCS/S3 and libdeflate support are important features that we missed while building version 1.20. As a general principle, we avoid overwriting images we created before because we don't want to break people's pipelines and validations. Another common practice here is the "one tool, one PR". It is very easy to miss something in a crowded pull request. I personally check the build logs beside the tests at the end to catch the silent errors.

So, I will request a few changes from you:

Any further tests, recommendations, and feedback will be appreciated. Thank you,

# for easy upgrade later. ARG variables only persist during build time
ARG BCFTOOLS_VER="1.20"

FROM ubuntu:jammy as builder

# re-instantiate variable
ARG BCFTOOLS_VER

# install dependencies, cleanup apt garbage
RUN apt-get update && apt-get install --no-install-recommends -y \
  wget \
  ca-certificates \
  perl \
  bzip2 \
  autoconf \
  automake \
  make \
  gcc \
  zlib1g-dev \
  libbz2-dev \
  liblzma-dev \
  libcurl4-gnutls-dev \
  libssl-dev \
  libperl-dev \
  libgsl0-dev \
  libdeflate-dev \
  procps && \
  rm -rf /var/lib/apt/lists/* && apt-get autoclean

# download, compile, and install bcftools
RUN wget https://github.com/samtools/bcftools/releases/download/${BCFTOOLS_VER}/bcftools-${BCFTOOLS_VER}.tar.bz2 && \
  tar -xjf bcftools-${BCFTOOLS_VER}.tar.bz2 && \
  rm -v bcftools-${BCFTOOLS_VER}.tar.bz2 && \
  cd bcftools-${BCFTOOLS_VER} && \
  ./configure --enable-libgsl --enable-perl-filters &&\
  make && \
  make install && \
  make test 

### start of app stage ###
FROM ubuntu:jammy as app

# re-instantiate variable
ARG BCFTOOLS_VER

# putting the labels in
LABEL base.image="ubuntu:jammy"
LABEL dockerfile.version="1"
LABEL software="bcftools"
LABEL software.version="${BCFTOOLS_VER}"
LABEL description="Variant calling and manipulating files in the Variant Call Format (VCF) and its binary counterpart BCF"
LABEL website="https://github.com/samtools/bcftools"
LABEL license="https://github.com/samtools/bcftools/blob/develop/LICENSE"
LABEL maintainer="Erin Young"
LABEL maintainer.email="eriny@utah.gov"
LABEL maintainer2="Curtis Kapsak"
LABEL maintainer2.email="kapsakcj@gmail.com"

# install dependencies required for running bcftools
# https://github.com/samtools/bcftools/blob/develop/INSTALL#L29
RUN apt-get update && apt-get install --no-install-recommends -y \
    perl\
    zlib1g \
    gsl-bin \
    bzip2 \
    liblzma5 \
    libcurl4-gnutls-dev \
    libdeflate0 \  
    procps \
    && apt-get autoclean && rm -rf /var/lib/apt/lists/*

# copy in bcftools executables from builder stage
COPY --from=builder /usr/local/bin/* /usr/local/bin/
# copy in bcftools plugins from builder stage
COPY --from=builder /usr/local/libexec/bcftools/* /usr/local/libexec/bcftools/

# set locale settings for singularity compatibility
ENV LC_ALL=C

# set final working directory
WORKDIR /data

# default command is to pull up help optoins
CMD ["bcftools", "--help"]

### start of test stage ###
FROM app as test

# running --help and listing plugins
RUN bcftools --help && bcftools plugin -lv

# install wget for downloading test files
RUN apt-get update && apt-get install -y wget vcftools

RUN echo "downloading test SC2 BAM and FASTA and running bcftools mpileup and bcftools call test commands..." && \
  wget -q https://raw.githubusercontent.com/artic-network/artic-ncov2019/master/primer_schemes/nCoV-2019/V4/SARS-CoV-2.reference.fasta && \
  wget -q https://raw.githubusercontent.com/StaPH-B/docker-builds/master/tests/SARS-CoV-2/SRR13957123.primertrim.sorted.bam && \
  bcftools mpileup -A -d 200 -B -Q 0 -f SARS-CoV-2.reference.fasta SRR13957123.primertrim.sorted.bam | \
  bcftools call -mv -Ov -o SRR13957123.vcf

RUN echo "testing plugins..." && \
  bcftools +counts SRR13957123.vcf

RUN echo "testing polysomy..." && \
  wget https://samtools.github.io/bcftools/howtos/cnv-calling/usage-example.tgz &&\
  tar -xvf usage-example.tgz &&\
  zcat test.fcr.gz | ./fcr-to-vcf -b bcftools -a map.tab.gz -o outdir/ &&\
  bcftools cnv -o cnv/ outdir/test.vcf.gz &&\
  bcftools polysomy -o psmy/ outdir/test.vcf.gz &&\
  head psmy/dist.dat

RUN echo "reading test data from Google Cloud to validate GCS support" && \
  bcftools head -h 20 gs://genomics-public-data/references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz

RUN  echo "reading test data from S3 to validate AWS support" && \
 bcftools head -h 20 s3://human-pangenomics/T2T/CHM13/assemblies/variants/GATK_CHM13v2.0_Resource_Bundle/resources-broad-hg38-v0-1000G_phase1.snps.high_confidence.hg38.t2t-chm13-v2.0.vcf.gz
pettyalex commented 1 month ago

Thank you for the feedback!

About libcurl4-gnutls-dev vs libcurl3-gnutls: https://askubuntu.com/questions/469360/what-is-the-difference-between-libcurl3-and-libcurl4

Libcurl3 is ABI compatible with libcurl4, so the name of the compiled library has not been incremented. That means that libcurl3-gnutls is the correct runtime library for libcurl4-gnutls-dev, and if you look in the libcurl4-gnutls-dev package it indeed contains libcurl3-gnutls

Kincekara commented 1 month ago

@pettyalex Thank you very much for the changes. This looks great!

I need one minor change as you see in the checklist. You will need to add <li>[1.20.c](./bcftools/1.20.c/)</li> to main README.md line 120 as below. If you enable "Allow edits from maintainers", I can make any more cosmetic changes if necessary. I will merge and deploy this image. Thanks!

Before:

| [bcftools](https://hub.docker.com/r/staphb/bcftools/) <br/> [![docker pulls](https://badgen.net/docker/pulls/staphb/bcftools)](https://hub.docker.com/r/staphb/bcftools) | <ul><li>[1.10.2](./bcftools/1.10.2/)</li><li>[1.11](./bcftools/1.11/)</li><li>[1.12](./bcftools/1.12/)</li><li>[1.13](./bcftools/1.13/)</li><li>[1.14](./bcftools/1.14/)</li><li>[1.15](./bcftools/1.15/)</li><li>[1.16](./bcftools/1.16/)</li><li>[1.17](./bcftools/1.17/)</li><li>[1.18](bcftools/1.18/)</li><li>[1.19](./bcftools/1.19/)</li><li>[1.20](./bcftools/1.20/)</li></ul> | https://github.com/samtools/bcftools |

After:

| [bcftools](https://hub.docker.com/r/staphb/bcftools/) <br/> [![docker pulls](https://badgen.net/docker/pulls/staphb/bcftools)](https://hub.docker.com/r/staphb/bcftools) | <ul><li>[1.10.2](./bcftools/1.10.2/)</li><li>[1.11](./bcftools/1.11/)</li><li>[1.12](./bcftools/1.12/)</li><li>[1.13](./bcftools/1.13/)</li><li>[1.14](./bcftools/1.14/)</li><li>[1.15](./bcftools/1.15/)</li><li>[1.16](./bcftools/1.16/)</li><li>[1.17](./bcftools/1.17/)</li><li>[1.18](bcftools/1.18/)</li><li>[1.19](./bcftools/1.19/)</li><li>[1.20](./bcftools/1.20/)</li><li>[1.20.c](./bcftools/1.20.c/)</li></ul> | https://github.com/samtools/bcftools | 
Kincekara commented 2 weeks ago

@pettyalex Thank you for your contribution! You can check the image deployment from here: https://github.com/StaPH-B/docker-builds/actions/runs/10493960570. The image will be available on both Dockerhub and Quay.io