ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
219 stars 147 forks source link

[🐕 Batch]: 🚀 Test model inference pipeline from GDI #926

Closed miquelduranfrigola closed 1 month ago

miquelduranfrigola commented 10 months ago

Summary

The Good Data Institute team has developed a solution to do model inference at scale using GitHub Actions. By default, we calculate 2M molecules available in ChEMBL. Results are stored in S3 buckets and eventually written in an AWS DynamoDB table.

Their work is nicely documented in the model-inference-pipeline repository.

Objectives

We need to make a plan of adoption as soon as possible, including:

Let's please use this thread to start a discussion around this.

Documentation

DhanshreeA commented 10 months ago

I've gone through the repository, and have a question - the issues within the repo are for a future roadmap or is this still work under progress?

miquelduranfrigola commented 10 months ago

Hi @DhanshreeA , the current issues are for a future roadmap.

miquelduranfrigola commented 10 months ago

The only realistic way to tackle this will be via a meeting. Let's discuss it in the upcoming group meeting.

miquelduranfrigola commented 10 months ago

Hello @DhanshreeA,

Quick summary of our conversation earlier today. Please correct me if I am missing something:

DhanshreeA commented 9 months ago

Hi @miquelduranfrigola

Sharing a few quick updates here:

miquelduranfrigola commented 9 months ago

Thanks @DhanshreeA - this is useful.

We will always find cases where it is too slow, so in my opinion the best we can do is, at least, save what we have produced, and allow for a second run to complete the remaining. This is not built in yet, but certainly within reach. Moving to Spark or others is not an option now, unfortunately.

We definitely need to record in our AirTable the time it takes to make (for example) 1, 10, 100, 1000 predictions for each model in a standard runner. This is very valuable information for scheduling, and most likely we want to collect this info at model submission time. What do you think?

GemmaTuron commented 1 month ago

Hi @miquelduranfrigola and @DhanshreeA

This is very related to work done in the last internship and the current GDI engagement. Could we update the issue with the current state of affairs and suggest a plan for moving this forward/finishing it?

miquelduranfrigola commented 1 month ago

Agree. The best will be to close this issue and refer to everything that's happening in https://github.com/ersilia-os/model-inference-pipeline, including the major PRs and the issues

GemmaTuron commented 1 month ago

ok I close this issue