ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
219 stars 147 forks source link

🐕 Batch: Running the same model in parallel #1223

Open miquelduranfrigola opened 3 months ago

miquelduranfrigola commented 3 months ago

Summary

As we work on running more than one Ersilia model in parallel, @JHlozek highlighted the scenario where we want to run the same model in multiple processes/terminals. This would be a very interesting case to consider in repositories like Olinda, where we need to make precalculations across a large set of inputs.

Objective(s)

  1. Run the same model in multiple terminals/processes (i.e. sessions).
  2. Optionally, make sure parallelization works both in docker and in conda serving modes.
  3. Ideally, parallelization should work both from the CLI (i.e. ersilia run -i ...) and from the Python API (mdl.run(...)). The Python API parallelization may be more difficult and it is less critical.

Documentation

No specific documentation available for this, although we should include parallelization as part of our main documentation in Gitbook and the README file.

miquelduranfrigola commented 3 months ago

@DhanshreeA what's your take on this? Is this something that may be relatively easy to address?

DhanshreeA commented 2 months ago

I am hoping this works out of the box really since all session artifacts go within their respective folders. I'll test this.

GemmaTuron commented 1 month ago

Hi @DhanshreeA

Was this tested? Is this something we could describe better and label as "good first issue"?

DhanshreeA commented 1 month ago

@GemmaTuron Right off the bat, this is not a safe good first issue because of several reasons as follows:

  1. Presently, even with the implementation of running multiple ersilia models simultaneously across different terminals, we cannot serve the same model across two different terminals because before ersilia starts a model container, it closes all previous containers related to that model. I am not clear on why we built it this way but my understanding is that there is probably safety in this functionality because of the following reason:

If a user has a container running from an older version of the model's image, stopping that container and re-spawning a new container would ensure the user has access to the latest model code, even if it doesn't always happen in practice because neither do our models get updated very frequently, and we don't always fetch a model before serving it. I think we should rewrite this functionality to stop all containers related to a model only when the model is being fetched, and not when it is being served. The PulledDockerImageService class should not call the method _stop_all_containers_of_image when the model is being served, which I think can be achieved with a flag.

  1. Parallelization currently does not work within the case of conda-docker. It works when all models are either docker based, or conda based, but not in a mixed scenario. This was specifically mentioned as a requirement here. There is a potential fix for it here, but we need to make sure we want this functionality.

  2. I cannot comment on this at all because we need to tackle the Python API completely and make sure it has caught up with the recent developments within Ersilia. I have proposed this as one of the key areas within Outreachy and I will come up with a plan for it.

DhanshreeA commented 1 month ago

@miquelduranfrigola we need to revisit this at some point.