🐅 Epic: Test and incorporate GDI caching functionalities

Summary

Hi @Abellegese and @DhanshreeA

The GDI engagement has come to an end and we need to plan the adoption of their contributions. Below, I am writing some thoughts with the hope that, from here, we can come up with a clear sequence of tasks.

Background

The GDI contribution is aimed at caching calculations in an S3 bucket. These precalculations can be queried with Athena. At a high level, GDI work in two fronts. First, they created a model-inference-pipeline repository that leverages GitHub Actions to make pre-calculations on a reference library of 2M compounds for a given model identifier. This pipeline eventually caches results in AWS. Second, they contributed on the Ersilia CLI to provide a client that is able to query the results seamlessly using the Ersilia commands.

Below is a tentative list of tasks to be completed in order.

1. The model inference pipeline

[ ] Reproduce the pipeline in GitHub Actions. We should run the pipeline for a fast model, for example, eos3b5e, and check that results are successfully stored in S3.
[ ] Select a set of 5-10 representative models to test throughout the model incorporation procedure. At this stage, it is not necessary to run predictions for all 2M compounds in the reference library. It is more important to test different types of models (slower, faster, memory-intensive, etc.).
[ ] Make sure the pipeline can also be run locally. This will be important for the more computationally demanding models.
[ ] Optionally, check is containers can be cached in the GitHub registry to speed up the fetching procedure inside the workflow.
[ ] Do some basic checks, for example: what happens if we run the pipeline twice? Is the tool able to skip molecules that are already precalculated? Or, what happens with spurious input molecules? How are results for these stored, if at all?

2. The Ersilia CLI client

[ ] Revise the code contributed by GDI and assess how much extra development it needs. Act accordingly.
[ ] For now, if I am not mistaken, the code is not able to accept a mixture of molecules where some of them are new and some others have been precalculated. We need to write the functionality to do that.
[ ] We need to decide, from a CLI perspective, how are we going to specifiy that we want to use cached calculations. Should we do it at serve time? For example ersilia serve eos3b5e --use-cache. Or at run time? ersilia run -i input.csv -o output.csv --use-cache.
[ ] Also, we need to check the Python API since it is progressively losing functionality... Does the retrieval of cached predictions work in the Python API too?
[ ] Optionally, check if, for some models, it is always faster to calculate than to query? In those cases, what do we do? And how do we identify these fast models that do not benefit much from caching?

3. Scheduled running

[ ] As soon as points 1 and 2 are checked robustly, we need to start running precalculations in the background. A scheduling system needs to be decided. We can start with the reference library.
[ ] Decide the priority of models to be run.
[ ] Monitor scheduled runs with the Splunk dashboard.

As discussed, let's use this as a framework to start this work. Please feel free to revise the list of tasks and convert them into more granular batches and tasks. Also, feel free to add or remove tasks.

Objective(s)

Test GDI's precalculations pipeline.
Integrate the querying system in the Ersilia CLI.
Schedule prediction runs.

Documentation

Check this folder for more information about this project. The information might be currently outdated but it gives a good idea of the full scope of hte project.

ersilia-os / ersilia