Closed DhanshreeA closed 1 week ago
@Malikbadmus do you want to take a look into this?
@Malikbadmus do you want to take a look into this?
Sure @DhanshreeA
@DhanshreeA, could you please provide more details about the task? I need a bit more information to understand it fully. Thank you.
Environment Interference
There's a resource conflicts that occurs at different sessions when models from different sources are being run across multiple terminals.
Fetching
Serving
Ersilia attempts to serve but is stuck in a loop where it continually tries to wake up the server, this problem does not affect models fetched from DockerHub, which can be run concurrently in different sessions without resource conflicts.
Running inference
@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.
The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.
I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.
2. Simultaneous fetching from Github and repo path will result in a CondaEnvironmentExistsError, this is also true if the model is being fetched from DockerHub.
@Malikbadmus are you running this on WSL? Could you confirm if you experience this on Codespaces as well?
2. Simultaneous fetching from Github and repo path will result in a CondaEnvironmentExistsError, this is also true if the model is being fetched from DockerHub.
@Malikbadmus are you running this on WSL? Could you confirm if you experience this on Codespaces as well?
On Ubuntu LTS, yes I encountered the issue on codespace as well.
@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.
The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.
I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.
Hi @Malikbadmus this is interesting - I feel like we can update the tmp directory prefix to include the session pid as well. You can get the session PID from utils/session.py
. And then the deletion code can be made to delete the tmp directories only related to the session. What do you think?
@DhanshreeA , the conflicts happening at fetching are caused by this method. It is responsible for clearing temporary directories that match the prefix "ersilia-". However, since we are running multiple sessions simultaneously, one session may delete a temporary directory that the other session is still using, in this case, /tmp/ersilia-749mg2go/docker_pull.log which keeps the log of Image that was pulled from DockerHub for models running in a docker container.
The debug logger in the clean_temp_dir method only logs the directories it finds in the current session, so it doesn't detect or log directories created by other sessions. However, it still attempts to delete those directories if they match the "ersilia-" pattern, which is causing issues as other sessions are using them.
I was thinking of two ways this can be resolved, by using a model-specific or session-specific identifier to the temporary directories, though in a case where we are fetching the same model from different sources across multiple sessions, model-specific identifier will still run into the same resource conflicts. Let me know your thought regarding this.
Hi @Malikbadmus this is interesting - I feel like we can update the tmp directory prefix to include the session pid as well. You can get the session PID from
utils/session.py
. And then the deletion code can be made to delete the tmp directories only related to the session. What do you think?
Yes, I think this is the best way, I was also thinking along this line.
Refer to discussion here: https://github.com/ersilia-os/ersilia/issues/702#issue-1756694049
This is actually expected behavior, and not a bug, at present.
Summary
TODO
Objective(s)
TODO
Documentation
No response