ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
203 stars 131 forks source link

🐈 Task: Ersilia tries to close docker containers when any model is fetched, even if from source #1197

Closed DhanshreeA closed 1 week ago

DhanshreeA commented 2 months ago

Summary

TODO

Objective(s)

TODO

Documentation

No response

DhanshreeA commented 2 months ago

@Malikbadmus do you want to take a look into this?

Malikbadmus commented 2 months ago

@Malikbadmus do you want to take a look into this?

Sure @DhanshreeA

Malikbadmus commented 2 months ago

@DhanshreeA, could you please provide more details about the task? I need a bit more information to understand it fully. Thank you.

Malikbadmus commented 1 month ago

Environment Interference

There's a resource conflicts that occurs at different sessions when models from different sources are being run across multiple terminals.

Fetching

  1. Model's being fetched from Github, or using repo_path causes model that are being fetched from DockerHub to fail.
  2. Simultaneous fetching from Github and repo path will result in a CondaEnvironmentExistsError, this is also true if the model is being fetched from DockerHub.

Error_one.txt

Error_two.txt

Serving

Ersilia attempts to serve but is stuck in a loop where it continually tries to wake up the server, this problem does not affect models fetched from DockerHub, which can be run concurrently in different sessions without resource conflicts.

Running inference

Malikbadmus commented 1 month ago

@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.

The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.

I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.

DhanshreeA commented 1 month ago

2. Simultaneous fetching from Github and repo path will result in a CondaEnvironmentExistsError, this is also true if the model is being fetched from DockerHub.

@Malikbadmus are you running this on WSL? Could you confirm if you experience this on Codespaces as well?

Malikbadmus commented 1 month ago

2. Simultaneous fetching from Github and repo path will result in a CondaEnvironmentExistsError, this is also true if the model is being fetched from DockerHub.

@Malikbadmus are you running this on WSL? Could you confirm if you experience this on Codespaces as well?

On Ubuntu LTS, yes I encountered the issue on codespace as well.

DhanshreeA commented 1 month ago

@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.

The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.

I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.

Hi @Malikbadmus this is interesting - I feel like we can update the tmp directory prefix to include the session pid as well. You can get the session PID from utils/session.py. And then the deletion code can be made to delete the tmp directories only related to the session. What do you think?

Malikbadmus commented 1 month ago

@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.

The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.

I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.

Hi @Malikbadmus this is interesting - I feel like we can update the tmp directory prefix to include the session pid as well. You can get the session PID from utils/session.py. And then the deletion code can be made to delete the tmp directories only related to the session. What do you think?

Yes, I think this is the best way, I was also thinking along this line.

DhanshreeA commented 1 week ago

Refer to discussion here: https://github.com/ersilia-os/ersilia/issues/702#issue-1756694049

This is actually expected behavior, and not a bug, at present.