Open ababaian opened 1 year ago
@ababaian I have picked this up. Please assign the issue to me
gcc-c++
is required for configuring geos. Absence of gcc-c++
throws this error while configuring geos:
configure: error: *** A compiler with support for C++11 language features is required.
Added cd .. && rm -rf geos-3.9.1* &&\
at Line 112 and built locally. I can see size has reduced to 4.75GB. Surprisingly the layer size of postgres(where I added rm -rf ) has increased but layer size of R has reduced. Details:
IMAGE CREATED CREATED BY SIZE COMMENT
17ffc664fd91 About a minute ago /bin/sh -c #(nop) CMD ["/home/palmid/palmid… 0B
c263b7bdd08f About a minute ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 6.8kB
679f9d0f4bf9 About a minute ago /bin/sh -c #(nop) COPY multi:105165ac8efc98e… 1.21MB
a67f70726545 About a minute ago /bin/sh -c #(nop) COPY multi:2ef4c23b6c9f1ac… 28.7kB
ae6038592905 About a minute ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 381MB
483ce6bed3ef 37 minutes ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 726kB
95fb21d64429 37 minutes ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 186MB
d325fe5723ab About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 50.4MB
55b9013e1002 About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 1.39GB
f3bfd8a74a43 About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 637MB
210552fd7e4d About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 1.73MB
3e1d12dcef12 About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 4.27MB
d9724851068c About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 1.06MB
25ff490e6d00 About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 13.6MB
45c07171926d About an hour ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 1.28GB
011477b3f9bc 4 hours ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 144MB
a2a6ef013958 4 hours ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 49.6MB
335f633ca71f 4 hours ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 245MB
c0a30d139dc0 4 hours ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 65.8MB
618e4176a15a 4 hours ago |3 PROJECT=palmid TYPE=base VERSION=0.0.6 /b… 160MB
8caf760c62ee 4 hours ago /bin/sh -c #(nop) LABEL tags=palmscan, diam… 0B
3433017d48f8 4 hours ago /bin/sh -c #(nop) LABEL software.license=GP… 0B
9b31c19a052d 4 hours ago /bin/sh -c #(nop) LABEL container.descripti… 0B
68f279e36859 4 hours ago /bin/sh -c #(nop) LABEL container.version=0… 0B
f4420b8598c2 4 hours ago /bin/sh -c #(nop) LABEL container.type=base 0B
7557b322cd1c 4 hours ago /bin/sh -c #(nop) LABEL project.website=htt… 0B
8bb23c0b8272 4 hours ago /bin/sh -c #(nop) LABEL project.name=palmid 0B
9ad1ed1c0d14 4 hours ago /bin/sh -c #(nop) LABEL container.base.imag… 0B
e854a5ea55db 4 hours ago /bin/sh -c #(nop) LABEL author=ababaian 0B
7b6605b674b8 4 hours ago /bin/sh -c #(nop) ENV R=4 0B
09662c5cdced 4 hours ago /bin/sh -c #(nop) ENV PALMDBVERSION=2021-03… 0B
79b2cf56ebe7 4 hours ago /bin/sh -c #(nop) ENV PALMSCANVERSION=1.0 0B
38e0083f268e 4 hours ago /bin/sh -c #(nop) ENV MUSCLEVERSION=3.8.31 0B
1eed014857c1 4 hours ago /bin/sh -c #(nop) ENV DIAMONDVERSION=2.0.6-… 0B
23c01a538152 4 hours ago /bin/sh -c #(nop) ENV SEQKITVERSION=2.0.0 0B
48a0378551ea 4 hours ago /bin/sh -c #(nop) ENV PALMIDVERSION=0.0.6 0B
eab4cdd116cb 4 hours ago /bin/sh -c #(nop) ARG VERSION=0.0.6 0B
bf4e11e03b8f 4 hours ago /bin/sh -c #(nop) ARG TYPE=base 0B
15de5e32ab2f 4 hours ago /bin/sh -c #(nop) ARG PROJECT=palmid 0B
814d88197feb 4 hours ago /bin/sh -c #(nop) WORKDIR /home/palmid 0B
f446891a0a77 4 hours ago /bin/sh -c #(nop) ENV BASEDIR=/home/palmid 0B
d027b21cae33 2 days ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0B
<missing> 2 days ago /bin/sh -c #(nop) COPY dir:1ca4a277361366916… 144MB
Note: I have built image using legacy docker builder.
Working on how to break builds into stages for a multi-stage build.
It should look like the Multi-stage build in Serratus
See: https://github.com/ababaian/serratus/blob/1f0ed52cf4f947a45943ea281a3d81837be9aa0b/containers/serratus-align/Dockerfile#L59
So the start of the file should have
FROM amazonlinux:2023 AS palmid_builder
...
and later
FROM amazonlinux:2023 AS palmid_base
...
# diamond
COPY --from=palmid_builder /usr/local/bin/diamond /usr/local/bin/
...
and such. Essentially build all the software initially, and then create a second stage and import all the necessary binaries and libraries to make palmID
run without errors without all the dependency/builder installion files.
The goal here is to essentially use the current image as the builder and copy over the minimal set of binaries/libraries which are required to make the package run.
Please create a fork/branch and make frequent commits so that changes can be tracked :+1:
Refactored dockerfile as suggested for multistage build. Getting this error, trying to fix the same:
v checking for file '/tmp/RtmpmikFmp/remotes6421f4212fa/ababaian-palmid-6f08186/DESCRIPTION' ...
- preparing 'palmid':
v checking DESCRIPTION meta-information ...
- checking for LF line-endings in source and make files and shell scripts
- checking for empty or unneeded directories
NB: this package now depends on R (>= 3.5.0)
WARNING: Added dependency on R >= 3.5.0 because serialized objects in
serialize/load version 3 cannot be read in older versions of R.
File(s) containing such objects:
'palmid/data/palmdb.RData' 'palmid/data/waxsys.msa.RData'
'palmid/data/waxsys.palm.sra.RData'
'palmid/data/waxsys.palmprint.RData'
'palmid/data/waxsys.pro.df.RData' 'palmid/data/waxsys.stat.sra.RData'
'palmid/data/waxsys.tree.df.RData'
'palmid/data/waxsys.tree.phy.RData'
- building 'palmid_0.0.6.tar.gz'
Installing package into '/usr/lib64/R/library'
(as 'lib' is unspecified)
ERROR: dependencies 'ggmsa', 'leaflet', 'RPostgreSQL' are not available for package 'palmid'
* removing '/usr/lib64/R/library/palmid'
Warning messages:
1: In i.p(...) : installation of package 'proj4' had non-zero exit status
2: In i.p(...) : installation of package 'ggalt' had non-zero exit status
3: In i.p(...) :
installation of package '/tmp/RtmpmikFmp/file6425341288f/ggmsa_1.6.0.tar.gz' had non-zero exit status
4: packages 'treeio', 'Biostrings' are not available for this version of R
Versions of these packages for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
5: In i.p(...) : installation of package 'terra' had non-zero exit status
6: In i.p(...) : installation of package 'raster' had non-zero exit status
7: In i.p(...) :
installation of package 'RPostgreSQL' had non-zero exit status
8: In i.p(...) :
installation of package 'leaflet' had non-zero exit status
9: In i.p(...) :
installation of package '/tmp/RtmpmikFmp/file6424ed51355/palmid_0.0.6.tar.gz' had non-zero exit status
I am committing WIP changes here - https://github.com/avinashsingh77/palmid/blob/avsingh_multistagebuilds/Dockerfile
FROM amazonlinux:2023 AS palmid_base
Should be far lower in the file in the final product. It's essentially the very last step, use the builder to make the entire Dockerfile and then create the new base image with a series of COPY commands to make a "clean" image which has only the final binaries required in it.
You don't want to do any of this "build" steps in the palmid_base
, you want to copy the directories of the already installed software to the base image.
@ababaian I have pushed the latest dockerfile here - https://github.com/avinashsingh77/palmid/blob/avsingh_multistagebuilds/Dockerfile
Built two image tags :
docker build -t serratubio/palmid-builder:latest --target palmid_builder .
docker build -t serratusbio/palmid:latest --target pamlid_base .
Images sizes as follows:
REPOSITORY TAG IMAGE ID CREATED SIZE
serratusbio/palmid latest 423e9580701d 2 hours ago 4.59GB
serratubio/palmid-builder latest 662bf0cb3e12 2 hours ago 5.53GB
Can you verify the contents of Dockerflie and test the images locally. Once that is done, I will make changes to help instructions in palmid.sh as per new image names
Meanwhile, I wll try to cherry pick the exact binary files only to the final image, as needed by the palmid.sh
Can you please test the images with the following command:
# Run palmid analysis suite
# uses the "scripts/palmid.sh" script as entrypoint
#
# palmid -i <input_fasta> -o <output_path>
# -v | -w flags are to mount the work dir into the conntainer
#
sudo docker run -v `pwd`:`pwd` -w `pwd` \
--entrypoint "/bin/bash" serratusbio/palmid:latest \
/home/palmid/palmid.sh -i data/waxsys.fa -d test -o waxsys
Make sure to build the final image with --no-cache
(takes a minute) prior to testing as this will ensure there isn't a hidden cache layer which is being used and it's your Dockerfile exactly.
Problem: Currently the
palmID
container image is 5.42GB which is highly bloated. The goal of this issue is to reduce down the image size to a target of sub 2GB to improve deployment and run-times of the microservice.The
sudo docker image history serratusbio/palmid:latest
The largest stages are
2.03GB
and1.38GB
which correspond to installation ofR
on Line 182 and installation ofpostgres
on line 101 respectively. Not sure how much these installs can be reduced directly, but certainly a stripped down second stage will be helpful without much of the installation software (likeg++
etc...)The final build should have the same name
serratusbio/palmid:latest
while the new build-stage should be called something likeserratubio/palmid-builder:latest
to keep downstream deployment simple.