[🐕 Batch]: 🚀 Test model inference pipeline from GDI

ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

219 stars 147 forks source link

[🐕 Batch]: 🚀 Test model inference pipeline from GDI #926

Closed miquelduranfrigola closed 1 month ago

miquelduranfrigola commented 10 months ago

Summary

The Good Data Institute team has developed a solution to do model inference at scale using GitHub Actions. By default, we calculate 2M molecules available in ChEMBL. Results are stored in S3 buckets and eventually written in an AWS DynamoDB table.

Their work is nicely documented in the model-inference-pipeline repository.

Objectives

We need to make a plan of adoption as soon as possible, including:

[ ] Which models do we run?
[ ] What frequency of model runs do we have? Should we set a cron job?
[ ] What happens with non-deterministic models?
[ ] How do we access predictions from the Ersilia CLI?

Let's please use this thread to start a discussion around this.

Documentation

DhanshreeA commented 10 months ago

I've gone through the repository, and have a question - the issues within the repo are for a future roadmap or is this still work under progress?

miquelduranfrigola commented 10 months ago

Hi @DhanshreeA , the current issues are for a future roadmap.

miquelduranfrigola commented 10 months ago

The only realistic way to tackle this will be via a meeting. Let's discuss it in the upcoming group meeting.

miquelduranfrigola commented 10 months ago

Hello @DhanshreeA,

Quick summary of our conversation earlier today. Please correct me if I am missing something:

No need to check for ChEMBL updates (for now).
Let's start by the easiest models: eos3b5e and eos4e40, for example.
Not urgent, but important at some point: categorize models as deterministic/non-deterministic and annotate it into AirTable.
Our interface to DynamoDb will be Isaura. We will have to refactor the library a little bit and eventually integrate it within the Ersilia CLI.
It would be nice to create a cron job that, periodically, selects a model (randomly?) and runs it

DhanshreeA commented 9 months ago

Hi @miquelduranfrigola

Sharing a few quick updates here:

I could validate the pipeline with eos3b5e, and the pipeline run was quite fast given thateos3b5e is a very simple model.
eos4e40 ran for roughly ~6 hours after which it timed out. (6 hours is apparently the limit for any workflow job in GH) Of course no predictions were saved in s3 as a consequence. I guess it is because chemprop is inefficient to run on CPU for 39988 smiles (partition size for each of the reference library partitions). I'm going to benchmark it locally.
If this continues to be the case across several models, then we will have to look into one of the following: using GPU runners for this pipeline; increasing the CPU/RAM compute resources in the workflow runners; moving out of the GH infra altogether (for example Spark, or Dask - but that will be an overload).

miquelduranfrigola commented 9 months ago

Thanks @DhanshreeA - this is useful.

We will always find cases where it is too slow, so in my opinion the best we can do is, at least, save what we have produced, and allow for a second run to complete the remaining. This is not built in yet, but certainly within reach. Moving to Spark or others is not an option now, unfortunately.

We definitely need to record in our AirTable the time it takes to make (for example) 1, 10, 100, 1000 predictions for each model in a standard runner. This is very valuable information for scheduling, and most likely we want to collect this info at model submission time. What do you think?

GemmaTuron commented 1 month ago

Hi @miquelduranfrigola and @DhanshreeA

This is very related to work done in the last internship and the current GDI engagement. Could we update the issue with the current state of affairs and suggest a plan for moving this forward/finishing it?

miquelduranfrigola commented 1 month ago

Agree. The best will be to close this issue and refer to everything that's happening in https://github.com/ersilia-os/model-inference-pipeline, including the major PRs and the issues

GemmaTuron commented 1 month ago

ok I close this issue