Remote errors in interactive mode when larger workflows are run

Matgenix / jobflow-remote

jobflow-remote is a Python package to run jobflow workflows on remote resources.

https://matgenix.github.io/jobflow-remote/

Other

17 stars 10 forks source link

Remote errors in interactive mode when larger workflows are run #174

Closed JaGeo closed 3 days ago

JaGeo commented 1 week ago

I have been running into many remote errors when I start a larger workflow in the runner's interactive mode, but this does not happen when I start, for example, the Phonon workflow. I am currently suspecting that this might be related to the one connection to the remote cluster that is only established in the interactive mode. If I rerun the jobs, they will eventually run through. Sometimes, also downloads fail and restarts enable the run.

Could we do anything about this? For example, could we add the possibility to use more than 1 connection in interactive mode to make it more stable? I would be fine with adding more than one OTP if it helps with execution.

gpetretto commented 1 week ago

Hi @JaGeo, can you provide more details to better under the problem?

which states are affected the most? Is it the download phase?
which errors do you get as remote error? Is it always the same, or does it change?
Is the workflow large in the sense that it has many jobs? Or that its jobs have a lot of data? or both?
is it clear how is this related to the fact that the workflow is large? From what you have seen, would you expect to have the same kind of error if many smaller workflows would be submitted instead?
When these errors happen, do you need to restart the runner? (I am asking this because in principle in the interactive mode if the connection drops, reconnect would require reinserting the OTP. So, if you don't need to restart the runner it means that at least the connection is still alive, even if an error happened)

I think it could be possible to enable multiple connections by inserting the OTP multiple times. I will investigate how to do that.

JaGeo commented 1 week ago

@gpetretto Thank you for your response.

I will make a few additional tests and then answer your questions in more detail.

With regard to the size of the workflow: i was referring to one with many jobs. Size of the data per job would not be bigger than a PhononDos Object from the phonon workflow or standard VASP outputs.

Restarting the job solves the issue. I don't need to restart the runner. I get REMOTE_ERROR mostly and sometimes a process stops in the middle (e.g., it gets stuck when it downloads the data)

An additional suspicion that I have is that there could be a connectivity error within the flow.

JaGeo commented 3 days ago

I looked closer into the errors: it sometimes seems to pick up an old project and the pathes of the outputs. Maybe related to #177

gpetretto commented 3 days ago

Thanks for the updates. When you mention an "old project" do you refer to really a different projct with a different configuration file that is present in the ~/.fremote folder? Or to another workflow in the same project? In principle the case in #177 is really more an issue if the user tries to insert multiple times the same instance of the Flow. Otherwise a random collision between uuid should be extremely unlikely, and would seem very difficult that this happened in your case more than once. Anyway it should be relatively easy to check: If you query by uuids the jobs that had an issue it will come up if there is more than one with the same uuid (except those with different job index).

Do you maybe have the stack trace reported when the jobs got into the REMOTE_ERROR state?

JaGeo commented 3 days ago

I am really referring to an old project. I will check if there is still an old jf runner running on a different computer and get back to you...

gpetretto commented 3 days ago

Thanks for the clarification. Do they use the same queue DB? Indeed checking if an old runner is still active is a good idea. If that is the problem #150 could prevent such an occurrence.

JaGeo commented 3 days ago

I think we can close this. I think there was simply a leftover jf remote running in the background of the other cluster from mid of August, even after logout from the cluster

gpetretto commented 3 days ago

Thanks for the update.