RhodiumGroup / helm-chart

Helm Chart for Rhodiums jupyterhub deployment
https://rhodiumgroup.github.io/helm-chart/
0 stars 0 forks source link

everything is dead #8

Closed delgadom closed 4 years ago

delgadom commented 4 years ago

We're in a simultaneous multi-cluster crash loop backoff. Fun times.

If you're coming upon this thread from the interwebs... we're running multiple z2jh-based clusters based loosely on the pangeo-data/pangeo hub. We're in various stages of trying to upgrade these clusters to be more in line with the pangeo master, but in what appears to be an unrelated turn, all our hubs went down simultaneously (user notebooks & dask clusters are still running), each reporting slightly different errors. If you find yourself in this situation... may the force be with you.

This is the stacktrace for compute-test, deployed from the attempted-upgrade branch with helm2:

[E 2019-12-08 17:22:00.354 JupyterHub app:1623]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1620, in launch_instance_async
    yield self.initialize(argv)
  File "/usr/lib/python3.6/types.py", line 204, in __next__
    return next(self.__wrapped)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1358, in initialize
    self.load_config_file(self.config_file)
  File "<decorator-gen-5>", line 2, in load_config_file
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 598, in load_config_file
    raise_config_file_errors=self.raise_config_file_errors,
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 562, in _load_config_files
    config = loader.load_config()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 457, in load_config
    self._read_file_as_dict()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
    py3compat.execfile(conf_filename, namespace)
  File "/usr/local/lib/python3.6/dist-packages/ipython_genutils/py3compat.py", line 198, in execfile
    exec(compiler(f.read(), fname, 'exec'), glob, loc)
  File "/srv/jupyterhub_config.py", line 46, in <module>
    c.KubeSpawner.singleuser_image_spec = os.environ['SINGLEUSER_IMAGE']
  File "/usr/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'SINGLEUSER_IMAGE'

This error occurred trying to use the following spec:

We're seeing similar, but not identical, issues on other, more updated versions of these packages.

Rebuilding the stacktrace with links to the source material

Note on finding the correct version of the python package jupyterhub/jupyterhub. v0.8.2 of jupyterhub/k8s-hub's Chart.yaml file suggests we should be using jupyterhub v0.9.6. However, the below stacktrace does not match any version >=0.9.0. There is no tagged release or pypi package version 0.8.2 of jupyterhub/jupyterhub. The only version I can find of jupyterhub/jupyterhub that matches the lines shown in the stacktrace is 0.8.1, which I've linked to here.

  1. jupyterhub/app.py@0.8.1#L1358, line 1620, in launch_instance_async: yield self.initialize(argv)
  2. File "/usr/lib/python3.6/types.py", line 204, in next return next(self.__wrapped)
  3. jupyterhub/app.py@0.8.1#L1358, in initialize self.load_config_file(self.config_file)
  4. File "", line 2, in load_config_file
  5. traitlets/config/application.py@v4.3.2#L87, in catch_config_error return method(app, *args, **kwargs)
  6. traitlets/config/application.py@v4.3.2#L598, in load_config_file raise_config_file_errors=self.raise_config_file_errors,
  7. traitlets/config/application.py@v4.3.2#L562, in _load_config_files config = loader.load_config()
  8. traitlets/config/loader.py@v4.3.2#L457, in load_config self._read_file_as_dict()
  9. traitlets/config/loader.py@v4.3.2#L489, in _read_file_as_dict py3compat.execfile(conf_filename, namespace)
  10. File "/usr/local/lib/python3.6/dist-packages/ipython_genutils/py3compat.py", line 198, in execfile exec(compiler(f.read(), fname, 'exec'), glob, loc)
  11. THIS DOES NOT MATCH THE k8s-hub@0.8.2 SOURCE: /srv/jupyterhub_config.py, line 46, in c.KubeSpawner.singleuser_image_spec = os.environ['SINGLEUSER_IMAGE']
  12. File "/usr/lib/python3.6/os.py", line 669, in getitem raise KeyError(key) from None

KeyError: 'SINGLEUSER_IMAGE'

The thing is, our stacktrace at line 11 displays a line, c.KubeSpawner.singleuser_image_spec = os.environ['SINGLEUSER_IMAGE'], that is not in the github source for that version. I haven't yet been able to track down where this line comes from.

In subsequent versions of jupyterhub/k8s-hub, the jupyterhub_config.py file has been removed. The Dockerfile in that directory does call jupyterhub_config.py, meaning it must be created somewhere. A quick search of the jupyterhub/jupyterhub repo shows that that file is a configuration file that is meant to be specified by the user.

delgadom commented 4 years ago

The z2jh guide: administrator: advanced: arbitrary extra code and configuration in jupyterhub_config.py explains that arbitrary code can be injected into jupyterhub_config.py using the chart values hub.extraConfig.

Rabbit hole No. 2, here we come...

delgadom commented 4 years ago

Ok I think that jupyterhub_config.py gets mounted from some repo as a drive (same way as our templates): https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/0.8.2/jupyterhub/templates/hub/deployment.yaml#L69

If that got updated and isn't pinned in 0.8.2 maybe we're pulling the live version of the default jupyterhub config and that's what's failing...

delgadom commented 4 years ago

This is helpful: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/371de88048e4e31388bcbfaff671c173b2337d8c/jupyterhub/schema.yaml#L325

delgadom commented 4 years ago

This is super weird. In our values.yaml, we seem to be relying on the version of z2jk-k8s that comes after this PR, while the stacktrace clearly has code from before. Twilight zone theme commence.

Our chart:

jupyterhub:
  singleuser:
    image:
      name: rhodium/notebook
      tag: c66b6d910bccff118e37534c2414569ea3b6b023

In jupyterhub/images/hub/jupyterhub_config.py before the PR (consistent with our stacktrace):

# Use env var for this, since we want hub to restart when this changes  
c.KubeSpawner.image_spec = os.environ['SINGLEUSER_IMAGE']   

And after (consistent with our spec):

c.KubeSpawner.image_spec = get_config('singleuser.image-spec')

Tied to this, in templates/configmap.yaml after the PR:

singleuser.image-spec: {{ .Values.singleuser.image.name }}:{{ .Values.singleuser.image.tag }}

this is missing in templates/configmap.yaml before the PR (consistent with our stacktrace)

delgadom commented 4 years ago

Bizarrely, this PR wasn't deployed until jupyterhub/helm-chart@0.9.0-alpha1 on Oct 17, 2019

delgadom commented 4 years ago

Progress? Updated our jupyterhub/helm-chart dependency to 0.9.0-alpha1 and our error is now:

[C 2019-12-08 20:47:14.740 JupyterHub app:2461] Failed to start proxy
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2459, in start
    await self.proxy.start()
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 650, in start
    cmd, env=env, start_new_session=True, shell=shell
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'configurable-http-proxy': 'configurable-http-proxy
delgadom commented 4 years ago

huh. there's an issue (https://github.com/jupyterhub/jupyterhub/issues/193) and a fix (jupyterhub/jupyterhub#195):

self.log.error(
    "Failed to find proxy %r\n"
    "The proxy can be installed with `npm install -g configurable-http-proxy`"

But how to install an npm package? the rabbit hole gets deeper...

delgadom commented 4 years ago

upgrading everything simultaneously and slowly and carefully and painfully and not-backwards-compatibly solved the issues. ugh.