2i2c-org / docs

Documentation for 2i2c community JupyterHubs.
https://docs.2i2c.org
9 stars 17 forks source link

Advice needed for long running jobs on hubs #184

Closed jmunroe closed 1 year ago

jmunroe commented 1 year ago

Context

Original source: https://2i2c.freshdesk.com/a/tickets/706

A community asked:

Can you point me to the documentation for how a pod will auto shutdown? We occasionally run log processes that may take a week to finish. There is a little confusion from our developers if jobs have been killed early by hitting memory limits being hit (I can see on grafana, this has happened a few times) or some if there is some other 'pod idle detection'

My response was

Regarding auto shutdown for either Jupyter server instances or kernels, the relevant docs are

https://docs.2i2c.org/admin/howto/control-user-server/#stop-user-servers-after-inactivity

and

https://infrastructure.2i2c.org/sre-guide/manage-k8s/culling/#configure-culling

​In particular,

Stop user servers after inactivity To ensure efficient resource usage, user servers without interactive usage for a period of time (default 1h) are automatically stopped (via jupyterhub-idle-culler). This means your notebook server might be stopped for inactivity even if you have a long running process in the notebook. This timeout can be configured.

While the1h default is good for most interactive sessions, I don't think changing it to a really long time (like 168h) makes sense for long-running processes. Too much risk a server will be inadvertedly left running by accident.

I recall some discussion (perhaps on Slack?) about solutions other groups have used for submitting a long-running job on a JupyterHub instance but can't find anything on the service documentation.

Proposal

Could @2i2c-org/engineering please provide guidance on the original freshdesk ticket ?

Once we have resolved it for this particular community, we should then add to our service documentation advice on how to submit long running jobs on our infrastructure.

Updates and actions

No response

consideRatio commented 1 year ago

For QCL, I think they merit from having all culling logic disabled to avoid issues - but warn them that they need to shut down their own servers if they aren't using them.

If they have very expensive machines running long duration, and they incorrectly fail along the way due to culling, that is the far bigger cost I expect.

Related

Action points

consideRatio commented 1 year ago

Advice provided, I'll probably reconfigure something for QCL as a followup so I re-assigned myself to the support ticket.