Advice needed for long running jobs on hubs

jmunroe commented 1 year ago

Context

Original source: https://2i2c.freshdesk.com/a/tickets/706

A community asked:

Can you point me to the documentation for how a pod will auto shutdown? We occasionally run log processes that may take a week to finish. There is a little confusion from our developers if jobs have been killed early by hitting memory limits being hit (I can see on grafana, this has happened a few times) or some if there is some other 'pod idle detection'

My response was

Regarding auto shutdown for either Jupyter server instances or kernels, the relevant docs are

https://docs.2i2c.org/admin/howto/control-user-server/#stop-user-servers-after-inactivity

and

https://infrastructure.2i2c.org/sre-guide/manage-k8s/culling/#configure-culling

In particular,

Stop user servers after inactivity To ensure efficient resource usage, user servers without interactive usage for a period of time (default 1h) are automatically stopped (via jupyterhub-idle-culler). This means your notebook server might be stopped for inactivity even if you have a long running process in the notebook. This timeout can be configured.

While the1h default is good for most interactive sessions, I don't think changing it to a really long time (like 168h) makes sense for long-running processes. Too much risk a server will be inadvertedly left running by accident.

I recall some discussion (perhaps on Slack?) about solutions other groups have used for submitting a long-running job on a JupyterHub instance but can't find anything on the service documentation.

Proposal

Could @2i2c-org/engineering please provide guidance on the original freshdesk ticket ?

Once we have resolved it for this particular community, we should then add to our service documentation advice on how to submit long running jobs on our infrastructure.

Updates and actions

No response

consideRatio commented 1 year ago

For QCL, I think they merit from having all culling logic disabled to avoid issues - but warn them that they need to shut down their own servers if they aren't using them.

If they have very expensive machines running long duration, and they incorrectly fail along the way due to culling, that is the far bigger cost I expect.

Action points

[x] Investigate basehub's jupyterhub-idle-culling configuration in jupyterhub jupyterhub-idle-culler is not configured in basehub, but it is by default in z2jh to cull servers with no activity reported in the last hour
[x] Investigate basehub's kernel culling configuration in user servers Kernels are not culled if they are busy by the kernel culling, but idle kernels are after one hour of idling
[x] Investigate if kernel culling is something you opt-in or opt-out of, so that we understand the consequences of removing config via basehub It seems that cull_idle_timeout is defaulting to 0, with culling of idle kernels disabled. We are setting it to 3600, which means that a server with a long running job will loose its state after 3600 seconds.
[x] Read up on what Min helped me understand once in https://github.com/jupyterhub/jupyterhub-idle-culler/issues/55
[x] Make a decision on how to best help QCL avoid possible disruption of long running jobs
- Idea 1: disable kernel culling to avoid loosing state after a long computation completes
- Idea 2: disable jupyterhub-idle-culler to avoid loosing server after a period of inactivity
[x] Consider if and how we want to update our docs and default config for basehub Yes. But I'll open a separate issue about it.

consideRatio commented 1 year ago

Advice provided, I'll probably reconfigure something for QCL as a followup so I re-assigned myself to the support ticket.

2i2c-org / docs