OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
293 stars 107 forks source link

losing interactive sessions #3198

Open johrstrom opened 12 months ago

johrstrom commented 12 months ago

From discourse - https://discourse.openondemand.org/t/interactive-jobs-missing/3105

It appears that we're able to lose the session. I'm guessing this is because of some inter-connectivity issue with the scheduler. Here's what I believe is happening

abujeda commented 12 months ago

I took a look at the session code and the error reported in discourse and I think this is only happening on submit (session creation). The call to the scheduler to create the job from OOD is timing out. OOD throws an exception and no session card is created as no jobId was returned. The scheduler continues to process the request and at some point, the job is created.

If the session card is created and OOD fails to get the status info from the scheduler (for any reason), it will show the card with the undetermined status and try again in the next card update.

Session cards will be deleted by OOD if the session Json data is corrupted and cannot be read.

johrstrom commented 12 months ago

Yea somewhere on discourse I instructed someone to put a sleep after submitting but before returning the redirect. I was going to add a configurable sleep at this point for that discourse user.

That said - if they never got the card - or if it was always in undetermined state - I'm not sure how they'd get the connection button to connect in the first place. The users in this discourse topic had a card at some point, because they're able to connect to jupyter.

I can't imagine they could have figured out the URL without the card.

abujeda commented 12 months ago

What I understood from discourse is that the card never shows up and that they are constructing the connection URL based on a pattern from other sessions.

I had a similar issue locally. My local environment runs a Slurm cluster and sometimes the Docker environment is too slow. And when submitting a session, Slurm is to slow to respond and the the OOD call times out. There is no card, but checking the active jobs app, the job appears as running.

johrstrom commented 12 months ago
This particular group has learned to work around it 
by manually constructing an appropriate URL to reach their running job.

I suppose I underestimated them! Maybe you're right. Then maybe it is just a sleep that's all we need. But I do wonder if we need in addition to that initial call, some sort of retry because instability can happen at any time and we can lose that card.

abujeda commented 12 months ago

In this particular situation I don't think that sleep or retries will work.

We could increase the timeout of the call from the OOD server to the scheduler. I think this is a Open3.capture3 call.

I don't think that if the browser request times out in the browser, that the Passenger app will detect it and kill the execution thread/process midway through, causing this issue. But I am not sure of the internals of the Phusion Passenger framework.

I think the better solution would be to create a session card in cases of timeout, without a jobId, with the default status of undetermined. Then on session updates, we need to lookup in the scheduler a job with the session id in one of the job metadata fields from the job submission, in the case of Slurm, the job_name could be re-purpose and use it to set the session_id. If the session is found, we update the session data, if not, then we can "complete it"