Slow startup on AMI which has large Data Science image backed in

seanturner026 commented 3 months ago

Describe the issue:

I'm trying to optimize Jupyterhub launch speeds by ensuring that some form of our large Data Science Image is always available on Nodes new or Old.

I'm using AWS EC2 Image Builder to produce an AMI that has our large Data Science image baked in. This is done using containerd (ctr) to pull the Image to the k8s.io namespace that EKS uses.

The Image Builder pipeline looks like this:

name: ml-image-pull
description: Pulls the latest ml-image Docker Image.
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: pull-ml-image
        action: ExecuteBash
        inputs:
          commands:
            - password=$(aws ecr get-login-password --region us-west-2)
            - echo "pulling ml-image:latest..."
              # Redirecting stdout because the process creates thousands of log lines.
            - sudo ctr --namespace k8s.io images pull --user AWS:$password account_id.dkr.ecr.us-west-2.amazonaws.com/ml-image:latest > /dev/null
              # This command also has a ton of output which creates noise, so only printing what we want.
            - sudo ctr --namespace k8s.io images list | head -n 1
            - sudo ctr --namespace k8s.io images list | grep ml-image
  - name: test
    steps:
      - name: confirm-ml-image-pulled
        action: ExecuteBash
        inputs:
          commands:
            - set -e
            - sudo ctr --namespace k8s.io images list | grep ml-image

This AMI is then launched by Karpenter which is always deploying the newest version of the AMI whenever the Cluster needs to scale

While this reduces the time needed to pull the image (takes 300ms to 20 seconds depending on code changes), it takes almost a minute to load extensions:

Defaulted container "notebook" out of: notebook, block-cloud-metadata (init)
Coiled user token is not set. Skipping login.
[I 2024-08-02 17:56:59.286 SingleUserLabApp mixins:547] Starting jupyterhub single-user server version 4.0.0
[I 2024-08-02 17:56:59.286 SingleUserLabApp mixins:561] Extending jupyterlab.labhubapp.SingleUserLabApp from jupyterlab 3.6.3
[I 2024-08-02 17:56:59.286 SingleUserLabApp mixins:561] Extending jupyter_server.serverapp.ServerApp from jupyter_server 1.23.6
[D 2024-08-02 17:56:59.484 SingleUserLabApp application:190] Searching ['/home/explorer/.config/jupyter', '/python/etc/jupyter', '/usr/local/etc/jupyter', '/etc/xdg/jupyter'] for co
nfig files
[D 2024-08-02 17:56:59.485 SingleUserLabApp application:902] Looking for jupyter_config in /etc/xdg/jupyter
[D 2024-08-02 17:56:59.485 SingleUserLabApp application:902] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2024-08-02 17:56:59.485 SingleUserLabApp application:902] Looking for jupyter_config in /python/etc/jupyter
[D 2024-08-02 17:56:59.485 SingleUserLabApp application:902] Looking for jupyter_config in /home/explorer/.config/jupyter
[D 2024-08-02 17:56:59.486 SingleUserLabApp application:902] Looking for jupyter_server_config in /etc/xdg/jupyter
[D 2024-08-02 17:56:59.486 SingleUserLabApp application:902] Looking for jupyter_server_config in /usr/local/etc/jupyter
[D 2024-08-02 17:56:59.486 SingleUserLabApp application:902] Looking for jupyter_server_config in /python/etc/jupyter
[D 2024-08-02 17:56:59.486 SingleUserLabApp application:902] Looking for jupyter_server_config in /home/explorer/.config/jupyter
[D 2024-08-02 17:56:59.488 SingleUserLabApp config_manager:93] Paths used for configuration of jupyter_server_config:
        /etc/xdg/jupyter/jupyter_server_config.json
[D 2024-08-02 17:56:59.488 SingleUserLabApp config_manager:93] Paths used for configuration of jupyter_server_config:
        /usr/local/etc/jupyter/jupyter_server_config.json
[D 2024-08-02 17:56:59.488 SingleUserLabApp config_manager:93] Paths used for configuration of jupyter_server_config:
        /python/etc/jupyter/jupyter_server_config.d/dask_labextension.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter-lsp-jupyter-server.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter-server-proxy.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter_resource_usage.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter_server_fileid.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter_server_mathjax.json
        /python/etc/jupyter/jupyter_server_config.d/jupyter_server_ydoc.json
        /python/etc/jupyter/jupyter_server_config.d/jupyterlab.json
        /python/etc/jupyter/jupyter_server_config.d/jupyterlab_git.json
        /python/etc/jupyter/jupyter_server_config.d/jupyterlab_link_share.json
        /python/etc/jupyter/jupyter_server_config.d/nbclassic.json
        /python/etc/jupyter/jupyter_server_config.d/nbdime.json
        /python/etc/jupyter/jupyter_server_config.d/notebook_shim.json
        /python/etc/jupyter/jupyter_server_config.d/panel-client-jupyter.json
        /python/etc/jupyter/jupyter_server_config.d/trame_jupyter_extension.json
        /python/etc/jupyter/jupyter_server_config.d/voila.json
        /python/etc/jupyter/jupyter_server_config.json
[D 2024-08-02 17:56:59.490 SingleUserLabApp config_manager:93] Paths used for configuration of jupyter_server_config:
        /home/explorer/.config/jupyter/jupyter_server_config.json
# NOTE(SMT): This takes 50 seconds
# 17:56.59 ---> 17:57:48
[I 2024-08-02 17:57:48.213 SingleUserLabApp manager:344] dask_labextension | extension was successfully linked.
[I 2024-08-02 17:57:48.213 SingleUserLabApp manager:344] jupyter_lsp | extension was successfully linked.

Compare this to our non custom AMI pods which load the extensions in 2 seconds.

Is it the dask extension specifically?
Is there a way to lazy load the dask extension?
Is there some weirdness due to ctr? Are perhaps the layers not warmed up even though they’re present on the Node?

Minimal Complete Verifiable Example:

Anything else we need to know?:

Environment:

Dask version: 2024.5.1
Python version: 3.11
Operating System: linux
Install method (conda, pip, source): pip

dask-labextension==6.1.0 jupyterhub=4.0.0 jupyterlab==3.6.3

seanturner026 commented 2 months ago

Using jupyterlab 4.2.4 does not really improve performance

Launching a second notebook server to a Node with this AMI will load the extension instantly (e.g. the first notebook server will take 50 seconds, the second will take .5 seconds)

mrocklin commented 2 months ago

My guess is that this has very little to do with the dask-labextension project, but cc'ing @jacobtomlinson who can maybe say so with more authority

jacobtomlinson commented 2 months ago

I'd be curious to know what happens if you stop the container and start it again. Is this some kind of first-run penalty for extensions, or does it happen every load?

I'd also recommend that you head over to the Jupyter forum and ask there as they will be able to give far better guidance than we can. If they suggest it might be extension related and theres something we can do to help the feel free to report back here and we can see what can be done.

seanturner026 commented 2 months ago

I'd be curious to know what happens if you stop the container and start it again. Is this some kind of first-run penalty for extensions, or does it happen every load?

Big time first time penalty. Launching a second notebook server to the same Node will load the extension in .2 seconds or something. I have also noticed that the Node takes twice as long to come online as the regular karpenter AMIs (that weren't image built by us). Likely related to the performance issues we're seeing.

Appreciate the feedback, and I have actually opened up a thread on the jupyter forum already. Will go ahead and close this as this is likely a symptom of the underlying issue rather than the problem itself (but feel free to respond if anything comes to mind :) ).

https://discourse.jupyter.org/t/slow-startup-on-ami-which-has-large-data-science-image-backed-in/27369/4

dask / dask-labextension

Slow startup on AMI which has large Data Science image backed in #271