2i2c-org / docs

Documentation for 2i2c community JupyterHubs.
https://docs.2i2c.org
9 stars 17 forks source link

Refine docs in this repo and upstream about server and kernel culling #185

Open consideRatio opened 1 year ago

consideRatio commented 1 year ago

2i2c jupyterhub setups comes with two systems to handle inactivity setup by default. In this issue I'm summarizing what I think can be used to update the docs we provide about server culling and kernel culling.

A jupyter kernel culling system

If a jupyter notebook file is opened, a "kernel" is started. The kernel will retain state (variables' values etc) based on code that has run in the kernel (executed notebook cells). What the kernel culling system does, is that it terminates kernels that has been "idle" for one hour or more. ​ In practice this means that if you have a long running job running from a jupyter notebook where a kernel is involved, and want to retain state after the notebook execution completes, then the kernel culling system should be disabled.

Disabling it

This can be disabled for individual users by providing a ~/.jupyter/jupyter_server_config.json file like:

{
  "MappingKernelManager": {
    "cull_idle_timeout": 0
  }
}

To disable it for an entire hub, a 2i2c engineer can re-configure the file we inject in user servers via the basehub chart:

jupyterhub:
  singleuser:
    extraFiles:
    # ensure kernel culling is disabled so in-memory state of a long running
    # job is retained after it complete
    jupyter_server_config.json:
      data:
        MappingKernelManager:
          cull_idle_timeout: 0

A jupyter server culling system

When a user server is started by jupyterhub, it registeres to get information about "activity" from the user server. If the user server hasn't been accessed via the network recently (a user's browser does things), and the server reports no activity in the last hour, then its shut down. ​ A big drawback of this system is that it fails to regonize all activity. For example if a user starts a user server, then runs a command in a terminal to come back a week later and check, it could have been terminated by a lack of perceived activity. Something was running in a terminal, but it likely didn't register as server activity to this system. Not even busy kernels registers as server activity by itself, only if the busy kernel writes a status message reguarly for example. ​ A big upside of this system is that it helps protect users from forgetting to shut down a powerful server, and that can be costly. ​ I suggest three strategies to protect long running jobs:

  1. We disable the server culling system it for everyone
  2. We increase the inactivity duration from 1 hour to 24 hours or more
  3. Individual users adopt a workaround when needed by manually running this "keep alive" script in a notebook: https://github.com/jupyterhub/jupyterhub-idle-culler/issues/55#issuecomment-1413510651

Note that you can track user server activity as understood by jupyterhub, and their status by visiting https://jupyter.quantifiedcarbon.com/hub/admin. If the server culling system is disabled, it may be relevant to check in there from time to time to avoid having a large server running without a user attending to it.

Disabling it

Its a basehub chart configuration of the dependency chart jupyterhub:

jupyterhub:
  # ensure user server culling is disabled so server inactivity (includes busy
  # kernels that emit nothing while computing) doesn't get interrupted
  cull:
    enabled: false

Related