Closed jmunroe closed 1 year ago
For QCL, I think they merit from having all culling logic disabled to avoid issues - but warn them that they need to shut down their own servers if they aren't using them.
If they have very expensive machines running long duration, and they incorrectly fail along the way due to culling, that is the far bigger cost I expect.
cull_idle_timeout
is defaulting to 0, with culling of idle kernels disabled.
We are setting it to 3600, which means that a server with a long running job will loose its state after 3600 seconds.Advice provided, I'll probably reconfigure something for QCL as a followup so I re-assigned myself to the support ticket.
Context
Original source: https://2i2c.freshdesk.com/a/tickets/706
A community asked:
My response was
While the
1h
default is good for most interactive sessions, I don't think changing it to a really long time (like 168h) makes sense for long-running processes. Too much risk a server will be inadvertedly left running by accident.I recall some discussion (perhaps on Slack?) about solutions other groups have used for submitting a long-running job on a JupyterHub instance but can't find anything on the service documentation.
Proposal
Could @2i2c-org/engineering please provide guidance on the original freshdesk ticket ?
Once we have resolved it for this particular community, we should then add to our service documentation advice on how to submit long running jobs on our infrastructure.
Updates and actions
No response