ababaian / palmid

RNA dependent RNA Polymerase Palmprint Analysis
GNU Affero General Public License v3.0
10 stars 5 forks source link

Add `Multi-Stage Builds` to the main `Dockerfile` to reduce image size #25

Open ababaian opened 1 year ago

ababaian commented 1 year ago

Problem: Currently the palmID container image is 5.42GB which is highly bloated. The goal of this issue is to reduce down the image size to a target of sub 2GB to improve deployment and run-times of the microservice.

The sudo docker image history serratusbio/palmid:latest

IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
8d48d6a678c5   3 days ago    CMD ["/home/palmid/palmid.sh"]                  0B        buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   6.8kB     buildkit.dockerfile.v0
<missing>      3 days ago    COPY data/* inst/extdata/* img/* data/ # bui…   1.21MB    buildkit.dockerfile.v0
<missing>      3 days ago    COPY palmid.Rmd scripts/* ./ # buildkit         836kB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   381MB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   726kB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   186MB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   36.2MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   1.38GB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   637MB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   1.73MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   4.27MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   1.06MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   13.6MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   2.03GB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   129MB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   35.3MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   231MB     buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   51.5MB    buildkit.dockerfile.v0
<missing>      3 days ago    RUN |3 PROJECT=palmid TYPE=base VERSION=0.0.…   159MB     buildkit.dockerfile.v0
<missing>      4 days ago    LABEL tags=palmscan, diamond, muscle, R, pal…   0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL software.license=GPLv3                    0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL container.description=palmid-base image   0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL container.version=0.0.6                   0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL container.type=base                       0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL project.website=https://github.com/aba…   0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL project.name=palmid                       0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL container.base.image=amazonlinux:2        0B        buildkit.dockerfile.v0
<missing>      4 days ago    LABEL author=ababaian                           0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV R=4                                         0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV PALMDBVERSION=2021-03-14                    0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV PALMSCANVERSION=1.0                         0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV MUSCLEVERSION=3.8.31                        0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV DIAMONDVERSION=2.0.6-dev                    0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV SEQKITVERSION=2.0.0                         0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV PALMIDVERSION=0.0.6                         0B        buildkit.dockerfile.v0
<missing>      4 days ago    ARG VERSION=0.0.6                               0B        buildkit.dockerfile.v0
<missing>      4 days ago    ARG TYPE=base                                   0B        buildkit.dockerfile.v0
<missing>      4 days ago    ARG PROJECT=palmid                              0B        buildkit.dockerfile.v0
<missing>      4 days ago    WORKDIR /home/palmid                            0B        buildkit.dockerfile.v0
<missing>      4 days ago    ENV BASEDIR=/home/palmid                        0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      4 weeks ago   /bin/sh -c #(nop) COPY dir:54e5777658be1a3cc…   143MB  

The largest stages are 2.03GB and 1.38GB which correspond to installation of R on Line 182 and installation of postgres on line 101 respectively. Not sure how much these installs can be reduced directly, but certainly a stripped down second stage will be helpful without much of the installation software (like g++ etc...)

The final build should have the same name serratusbio/palmid:latest while the new build-stage should be called something like serratubio/palmid-builder:latest to keep downstream deployment simple.

avinashsingh77 commented 1 year ago

@ababaian I have picked this up. Please assign the issue to me

avinashsingh77 commented 1 year ago

gcc-c++ is required for configuring geos. Absence of gcc-c++ throws this error while configuring geos: configure: error: *** A compiler with support for C++11 language features is required.

avinashsingh77 commented 1 year ago

Added cd .. && rm -rf geos-3.9.1* &&\ at Line 112 and built locally. I can see size has reduced to 4.75GB. Surprisingly the layer size of postgres(where I added rm -rf ) has increased but layer size of R has reduced. Details:

IMAGE          CREATED              CREATED BY                                      SIZE      COMMENT
17ffc664fd91   About a minute ago   /bin/sh -c #(nop)  CMD ["/home/palmid/palmid…   0B        
c263b7bdd08f   About a minute ago   |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   6.8kB     
679f9d0f4bf9   About a minute ago   /bin/sh -c #(nop) COPY multi:105165ac8efc98e…   1.21MB    
a67f70726545   About a minute ago   /bin/sh -c #(nop) COPY multi:2ef4c23b6c9f1ac…   28.7kB    
ae6038592905   About a minute ago   |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   381MB     
483ce6bed3ef   37 minutes ago       |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   726kB     
95fb21d64429   37 minutes ago       |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   186MB     
d325fe5723ab   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   50.4MB    
55b9013e1002   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   1.39GB    
f3bfd8a74a43   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   637MB     
210552fd7e4d   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   1.73MB    
3e1d12dcef12   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   4.27MB    
d9724851068c   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   1.06MB    
25ff490e6d00   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   13.6MB    
45c07171926d   About an hour ago    |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   1.28GB    
011477b3f9bc   4 hours ago          |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   144MB     
a2a6ef013958   4 hours ago          |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   49.6MB    
335f633ca71f   4 hours ago          |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   245MB     
c0a30d139dc0   4 hours ago          |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   65.8MB    
618e4176a15a   4 hours ago          |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b…   160MB     
8caf760c62ee   4 hours ago          /bin/sh -c #(nop)  LABEL tags=palmscan, diam…   0B        
3433017d48f8   4 hours ago          /bin/sh -c #(nop)  LABEL software.license=GP…   0B        
9b31c19a052d   4 hours ago          /bin/sh -c #(nop)  LABEL container.descripti…   0B        
68f279e36859   4 hours ago          /bin/sh -c #(nop)  LABEL container.version=0…   0B        
f4420b8598c2   4 hours ago          /bin/sh -c #(nop)  LABEL container.type=base    0B        
7557b322cd1c   4 hours ago          /bin/sh -c #(nop)  LABEL project.website=htt…   0B        
8bb23c0b8272   4 hours ago          /bin/sh -c #(nop)  LABEL project.name=palmid    0B        
9ad1ed1c0d14   4 hours ago          /bin/sh -c #(nop)  LABEL container.base.imag…   0B        
e854a5ea55db   4 hours ago          /bin/sh -c #(nop)  LABEL author=ababaian        0B        
7b6605b674b8   4 hours ago          /bin/sh -c #(nop)  ENV R=4                      0B        
09662c5cdced   4 hours ago          /bin/sh -c #(nop)  ENV PALMDBVERSION=2021-03…   0B        
79b2cf56ebe7   4 hours ago          /bin/sh -c #(nop)  ENV PALMSCANVERSION=1.0      0B        
38e0083f268e   4 hours ago          /bin/sh -c #(nop)  ENV MUSCLEVERSION=3.8.31     0B        
1eed014857c1   4 hours ago          /bin/sh -c #(nop)  ENV DIAMONDVERSION=2.0.6-…   0B        
23c01a538152   4 hours ago          /bin/sh -c #(nop)  ENV SEQKITVERSION=2.0.0      0B        
48a0378551ea   4 hours ago          /bin/sh -c #(nop)  ENV PALMIDVERSION=0.0.6      0B        
eab4cdd116cb   4 hours ago          /bin/sh -c #(nop)  ARG VERSION=0.0.6            0B        
bf4e11e03b8f   4 hours ago          /bin/sh -c #(nop)  ARG TYPE=base                0B        
15de5e32ab2f   4 hours ago          /bin/sh -c #(nop)  ARG PROJECT=palmid           0B        
814d88197feb   4 hours ago          /bin/sh -c #(nop) WORKDIR /home/palmid          0B        
f446891a0a77   4 hours ago          /bin/sh -c #(nop)  ENV BASEDIR=/home/palmid     0B        
d027b21cae33   2 days ago           /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      2 days ago           /bin/sh -c #(nop) COPY dir:1ca4a277361366916…   144MB  

Note: I have built image using legacy docker builder.

avinashsingh77 commented 1 year ago

Working on how to break builds into stages for a multi-stage build.

ababaian commented 1 year ago

It should look like the Multi-stage build in Serratus See: https://github.com/ababaian/serratus/blob/1f0ed52cf4f947a45943ea281a3d81837be9aa0b/containers/serratus-align/Dockerfile#L59

So the start of the file should have

FROM amazonlinux:2023 AS palmid_builder
...

and later

FROM amazonlinux:2023 AS palmid_base

...

# diamond
COPY --from=palmid_builder /usr/local/bin/diamond /usr/local/bin/
...

and such. Essentially build all the software initially, and then create a second stage and import all the necessary binaries and libraries to make palmID run without errors without all the dependency/builder installion files.

The goal here is to essentially use the current image as the builder and copy over the minimal set of binaries/libraries which are required to make the package run.

Please create a fork/branch and make frequent commits so that changes can be tracked :+1:

avinashsingh77 commented 1 year ago

Refactored dockerfile as suggested for multistage build. Getting this error, trying to fix the same:

v  checking for file '/tmp/RtmpmikFmp/remotes6421f4212fa/ababaian-palmid-6f08186/DESCRIPTION' ...
-  preparing 'palmid':
v  checking DESCRIPTION meta-information ...
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
     NB: this package now depends on R (>= 3.5.0)
     WARNING: Added dependency on R >= 3.5.0 because serialized objects in
     serialize/load version 3 cannot be read in older versions of R.
     File(s) containing such objects:
       'palmid/data/palmdb.RData' 'palmid/data/waxsys.msa.RData'
       'palmid/data/waxsys.palm.sra.RData'
       'palmid/data/waxsys.palmprint.RData'
       'palmid/data/waxsys.pro.df.RData' 'palmid/data/waxsys.stat.sra.RData'
       'palmid/data/waxsys.tree.df.RData'
       'palmid/data/waxsys.tree.phy.RData'
-  building 'palmid_0.0.6.tar.gz'

Installing package into '/usr/lib64/R/library'
(as 'lib' is unspecified)
ERROR: dependencies 'ggmsa', 'leaflet', 'RPostgreSQL' are not available for package 'palmid'
* removing '/usr/lib64/R/library/palmid'
Warning messages:
1: In i.p(...) : installation of package 'proj4' had non-zero exit status
2: In i.p(...) : installation of package 'ggalt' had non-zero exit status
3: In i.p(...) :
  installation of package '/tmp/RtmpmikFmp/file6425341288f/ggmsa_1.6.0.tar.gz' had non-zero exit status
4: packages 'treeio', 'Biostrings' are not available for this version of R

Versions of these packages for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages 
5: In i.p(...) : installation of package 'terra' had non-zero exit status
6: In i.p(...) : installation of package 'raster' had non-zero exit status
7: In i.p(...) :
  installation of package 'RPostgreSQL' had non-zero exit status
8: In i.p(...) :
  installation of package 'leaflet' had non-zero exit status
9: In i.p(...) :
  installation of package '/tmp/RtmpmikFmp/file6424ed51355/palmid_0.0.6.tar.gz' had non-zero exit status

I am committing WIP changes here - https://github.com/avinashsingh77/palmid/blob/avsingh_multistagebuilds/Dockerfile

ababaian commented 1 year ago

FROM amazonlinux:2023 AS palmid_base

Should be far lower in the file in the final product. It's essentially the very last step, use the builder to make the entire Dockerfile and then create the new base image with a series of COPY commands to make a "clean" image which has only the final binaries required in it.

You don't want to do any of this "build" steps in the palmid_base, you want to copy the directories of the already installed software to the base image.

avinashsingh77 commented 1 year ago

@ababaian I have pushed the latest dockerfile here - https://github.com/avinashsingh77/palmid/blob/avsingh_multistagebuilds/Dockerfile

Built two image tags :

docker build -t serratubio/palmid-builder:latest --target palmid_builder .
docker build -t serratusbio/palmid:latest --target pamlid_base .

Images sizes as follows:

REPOSITORY                                                                            TAG       IMAGE ID       CREATED             SIZE
serratusbio/palmid                                                                    latest    423e9580701d   2 hours ago         4.59GB
serratubio/palmid-builder                                                             latest    662bf0cb3e12   2 hours ago         5.53GB

Can you verify the contents of Dockerflie and test the images locally. Once that is done, I will make changes to help instructions in palmid.sh as per new image names

Meanwhile, I wll try to cherry pick the exact binary files only to the final image, as needed by the palmid.sh

ababaian commented 1 year ago

Can you please test the images with the following command:

# Run palmid analysis suite
# uses the "scripts/palmid.sh" script as entrypoint
#
# palmid -i <input_fasta> -o <output_path>
# -v | -w flags are to mount the work dir into the conntainer
#
sudo docker run  -v `pwd`:`pwd` -w `pwd`  \
  --entrypoint "/bin/bash" serratusbio/palmid:latest \
  /home/palmid/palmid.sh -i data/waxsys.fa -d test -o waxsys

Make sure to build the final image with --no-cache (takes a minute) prior to testing as this will ensure there isn't a hidden cache layer which is being used and it's your Dockerfile exactly.