ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
203 stars 131 forks source link

[🐕 Batch]: Decrease the size of Docker Images #900

Closed miquelduranfrigola closed 3 months ago

miquelduranfrigola commented 10 months ago

Is your feature request related to a problem? Please describe.

Context

Ersilia's Docker images are stored in DockeHub. For example, Ersilia's image for model eos4wt0 can be found here. All images are public in the ersiliaos DockerHub profile.

Problem

Ersilia's Docker images are very large, most often above 2GB. This creates issues in low-resource settings. Most likely, this large volume is not related to the model parameters themselves, but the large amount of (unnecessary) dependencies and tools.

Describe the solution you'd like.

We would like to have a GitHub Action workflow that cleans docker images a posteriori. That is, given a model identifier, this action would fetch the corresponding image, clear unnecessary dependencies, and push it back to DockerHub. I am unsure whether this is feasible or the best approach at all. Any advice would be much appreciated.

More specifically, the steps towards a solution would include.

  1. Development of a clean_ersilia_docker_image.sh script. This script would be able to remove unnecessary dependencies in a given Ersilia model. The script could be stored here, for example.
  2. Creation of a GitHub Actions workflow named clean_docker_images.yml. This workflow would (a) pull docker images, (b) run the clean_ersilia_docker_image.sh and (c) push back the image to DockerHub. This workflow file could be stored here.
  3. A testing module to ensure that the slim docker image still works would be helpful.

Describe alternatives you've considered

Alternatives I've considered include rebuilding Docker images with a slim base image (e.g. Alpine). While this sounds interesting, it would be cumbersome retrospectively.

Additional context.

A good model to start with is the one mentioned above (eos4wt0). This model should be really lightweight.

miquelduranfrigola commented 10 months ago

We will work on this with @mjwarren3 who helped us previously with dockerization of Ersilia models.

mjwarren3 commented 9 months ago

Hey @miquelduranfrigola I started exploring this today, and wanted to share initial findings / thoughts (primarily from here). Let me know if you have any heads-ups or perspectives on next steps.

First Finding: Base Ersilia Image / Repository is responsible for >50% of the Image Size (510 MB) git clone https://github.com/ersilia-os/ersilia.git && cd /root/ersilia && pip install -e . # buildkit

Next Step: Explore whether we could trim down the Ersilia base repository to only what is needed to fetch and serve a model

Second Finding: Fetching the model from GitHub responsible for ~415 MB of space ersilia -v fetch $MODEL --from_github # buildkit

Next Step: Explore the fetch function within Ersilia to see if it is bringing in any extra packages / content that is not needed to serve the model

My initial feel is that the full Ersilia repository and Fetch functionality might be needed for installation, but we could trim it down after the model is ready to be served

miquelduranfrigola commented 9 months ago

Hello @mjwarren3 this is very useful, thanks!

This would be a great start indeed. I think that, within Ersilia, BentoML (a dependency) might be responsible for a significant part of the space.

One question: what would be the best way to explore this, in practice? Run the docker container in exec mode?

miquelduranfrigola commented 9 months ago

And one more question, @mjwarren3 - how relevant is the base OS image? For example, using alpine vs ubuntu?

GemmaTuron commented 7 months ago

Hi @miquelduranfrigola @mjwarren3

What is the status of this?

mjwarren3 commented 7 months ago

Hi @GemmaTuron no major updates since the notes above ^

Saw that this topic was being discussed in another issue though: https://github.com/ersilia-os/eos7d58/issues/2

Would be great to have someone else take a look if Dhanshree or Hellen wanted to

mjwarren3 commented 7 months ago

And one more question, @mjwarren3 - how relevant is the base OS image? For example, using alpine vs ubuntu?

The latest Alpine image is ~7mb and the latest Ubuntu image is ~78MB, so there is some improvement potential there as well. I'd think we're trying to cut this in half, or more, so won't be the major driver though

mjwarren3 commented 7 months ago

I also just tried running the Docker Image through Dive, and have some initial findings from that Overall Metrics

Largest Build Steps

Largest Files / Packages in the Final Image - in Directory Format

/usr is 2.1 GB
  /bin/conda is 1.4 GB
      /envs is 646 MB
          /eos4wt0 is 454MB
              /python3.10/site-packages is 278 MB
                  rdkit is 66mb, rdkit.libs is 49mb, numpy is 34mb, numpy.libs is 38mb, botocore is 6mb
          /eosbase-bentoml is 192 MB
              /python3.10/site-packages is 135 MB
                  numpy.libs is 38MB, numpy is 34MB, botocore is 16MB **(OVERLAP - packages installed twice)**
      /pkgs is 378 MB
          /cache is 101 MB (can we delete this?)
         /python3.11 is 85 MB
         /python 3.10.13 is 54 MB **(Two installations of Python? Can we delete 3.11?)**
      / lib is 289 MB
          /python3.11 is 95MB --> site-packages is 45 MB --> pip is 13MB, setuptools, conda, etc. also in there
          /libicudata is 52MB
          /libpython3.11 is 25MB
          /bunch of other libs that are <10mb
  /local is 376 MB
      /lib/Python3.7 is 372MB
          /site-packages is 347MB
          botocore is 80MB, rdkit is 63MB, numpy is 62MB, pip is 10MB, SQLAlchemy is 18MB
  /lib is 253 MB
       /x86_64_linux-gnu is 127MB
           /libicudata.so.63.1 is 27MB
           /perl is 23MB

SUMMARY / PRIMARY FINDINGS

mjwarren3 commented 7 months ago

Based on the findings above, any other thoughts on how we could reduce the overlap? Ideally we end up with just one installation of Python, and we only install Python packages once to reduce overlap.

mjwarren3 commented 7 months ago

@miquelduranfrigola Would be great to get your thoughts on the above when you get a chance - looks like some significant overlap of Python packages / installations that could potentially enable us to reduce the size of the Docker Image

miquelduranfrigola commented 7 months ago

Hello @mjwarren3 , this is tremendously useful. Thanks so much. I am looping @DhanshreeA in since she has now joined Ersilia and she is much better than I am at these things.

I completely agree with your suggested approach. Since we are using the images to eventually serve the model, there is a lot of stuff we can remove safely after fetch is done. So, basically, the docker build should:

  1. Install ersilia
  2. Fetch model
  3. Delete unnecessary stuff
  4. Serve model

Point 3 is currently missing.

We can definitely:

(As an aside, I think we want to eventually remove the bentoml dependency. This will certainly help, but it will require more work and it will probably apply to futur models, not legacy ones.)

DhanshreeA commented 7 months ago

This discussion is super helpful, thank you both. I'm putting it on the top of my priority list for this week!

miquelduranfrigola commented 7 months ago

As discussed earlier today, let's start by assuming that bentoml will remain there. In the future, we may achieve smaller images if bentoml is removed, but for now, let's not touch it.

DhanshreeA commented 6 months ago

Hey folks, updating some metrics here:

After introducing multistage builds in the PR above #1022, models eos3b5e and eos4wt0 got reduced from 2.35GB to 979MB and 2.41 GB to 985MB respectively. Will update more as I scale it to other models in the hub.

miquelduranfrigola commented 6 months ago

This is quite amazing - thanks for the update @DhanshreeA

DhanshreeA commented 6 months ago

Another interesting observation and potentially an issue with docker builds: https://github.com/ersilia-os/ersilia/issues/1097

DhanshreeA commented 3 months ago

@miquelduranfrigola @GemmaTuron do you think it's safe to close this issue now?

GemmaTuron commented 3 months ago

Yes, I think so!