Closed echarles closed 5 years ago
Hi Eric - please keep in mind this is completely experimental stuff and is subject to change (which includes its complete removal) at any time - caveat emptor. 😄
Yes - that is the Notebook PR on which to base your testing. Given the branch is 2+ years old and completely reworks kernel management, it will certainly have merge issues. In addition, I've had to make changes to get remote kernels working but haven't conveyed those back to @takluyver.
I'll try to add further instructions for setting up an environment. It will consist of manual branch pulls , builds, and installs. I've been able to run kernels in YARN Cluster mode so far - looking at Kubernetes now. There are still issues with restarts and configuration that need to be worked out.
Not sure what your yarn_kernel_provider question is about. The yarn provider implementation was posted 5 days ago in https://github.com/gateway-experiments/yarn_kernel_provider which is another thing that might change (i.e., one repo per provider + the base).
Happy to cook, pull and merge branches when you provide the receipt.
Oops, I was meaning docker_kernel_provider
(upon yarn and k8s ones)
Ok - once kubernetes is in a working state, docker should just follow suit. However, before we include more than k8s, yarn (and the remote base), I want to take a step back and make sure this "repo layout" and deployment model is what we really want.
Also keep in mind that all the configuration support for k8s and docker (helm charts, compose scripts, etc.) are geared toward EG being the server. Those will require rework to have Notebook be the server and that may prove challenging.
Happy to test on K8S or YARN with EG as server.
Thanks. EG will not be usable with kernel providers for some time. I suspect you meant to say with Notebook as a server.
yes, notebook as a server. sorry.
Hi Eric - I really apologize for the delay here (been side-tracked lately). I've managed to get back to this stuff this week. I'm currently in the process of getting the k8s kernels off the ground. I'm running into issues with the base notebook branch (that we need to use) - which is preventing getting to the crux of the kubernetes stuff.
At any rate, my plan is get k8s kernels minimally working. This should validate the ContainerKernelLIfecycleManager on which the Docker lifecycle managers also depend (we've changed the 'process proxies' to 'kernel lifecycle managers' (after all, what's a few more characters to type :smile:). Then, if you're okay with things, it would be great if you could follow the k8s lead to get the docker stuff working!
I also owe some instructions for getting things going. I'd prefer to wait until one of the PRs is merged on the jupyter_kernel_mgmt repo that allows us to derive from the KernelSpecProvider rather than implement our own 'find' logic. Once that is merged, we'll make a release of that module - which will simplify deployments. That said, I'd be happy to share which 'branches' are necessary if you're chomping at the bit.
There are many issues to work out in this 'provider world' and I'd have to say the jury is still out on this (IMHO).
I hope that helps.
@kevin-bates Thx for the updates and the hard road you are following to lead us to a distributed and managed kernel world. Ping me if I can help to test or solve outstanding issues.
Hi Eric - thank you for your patience. I was able to get a "minimally viable" kubernetes instance running. I also have created a docker_kernel_provider
repo with the basic substitutions in place (which is essentially all I needed for k8s) but have not tried anything in a Docker/DockerSwarm env. I'm hoping you might be able to try things out.
I also decided that, while we're still checking things out, a "fake" release might be the best way to convey setup instructions. This allows me to attach pre-built files to the release so others don't need to pull branches, wait for PRs to be merged, etc.
You can find this 'interim-dev' release here.
Thx @kevin-bates. I have been through the interim-dev release instructions and tried the on-prem (non container) setup: notebook, juptyer_kernel_mgmt, remote_kernel_provider and docker concrete implementation of remote_kernel_provider.
I deployed the kernelspec and tried the Docker Kernel. This fails with following exception.
I 18:22:49.596 NotebookApp] DockerSwarmKernelLifecycleManager: kernel launched. Kernel image: elyra/kernel-py:2.0.0rc2, KernelID: 66edaa4f-72e3-4a35-a0a8-62eff718fdf9, cmd: '['/opt/datalayer/opt/miniconda3/envs/datalayer/bin/python', '/usr/local/share/jupyter/kernels/py/scripts/launch_docker.py', '--RemoteProcessProxy.kernel-id', '66edaa4f-72e3-4a35-a0a8-62eff718fdf9', '--RemoteProcessProxy.response-address', '192.168.1.23:55374']'
Traceback (most recent call last):
File "/usr/local/share/jupyter/kernels/py/scripts/launch_docker.py", line 114, in <module>
launch_docker_kernel(kernel_id, response_addr, spark_context_init_mode)
File "/usr/local/share/jupyter/kernels/py/scripts/launch_docker.py", line 43, in launch_docker_kernel
param_env.pop('PATH') # Let the image PATH be used. Since this is relative to images, we're probably safe.
KeyError: 'PATH'
Comments on the on-prem-start-notebook.sh
:
Then I deployed the K8S Spec on minikube. I get running nb-kernel-mgmt
and kernel-image-puller
which run the notebooks. When I run python with a Python on Kubernetes kernel, a Pod for the kernel is launched and it works fine.
I wonder how to deploy a notebook outside of K8S which would act as an Enterprise Gateway.
@kevin-bates To complement previous comment, I wanted first to install the kubernetes_kernel_provider as concrete implementation in the on-prem scenario, but decided not to do that as the K8S nb_kernel_mgmt
pod is reachable in my case on http://192.168.64.10:31240 (minikube) and I didn't see how to set this in the launch script... Maybe EG_REMOTE_HOSTS but how to set the port?
Hi Eric - thanks for the update.
The on-prem stuff only applies to the YARN support. I've minimally tested with YARN and Kubernetes. For K8s, the notebook server runs in k8s as well. Same will be the case for Docker.
We do not support container envs using on-prem kernels or vice versa.
Regarding your comments:
It enforces gateway user - I removed that?
That's fine. The startup script was pulled (obviously) from my EG YARN env.
There are configs for YARN. Maybe you tested it with YARN and it would be easier that I first try with YARN instead of Docker?
As noted above, I "tested" YARN using on-prem configs and K8s using the full container configs. If you have a YARN env to hit, it might be a good exercise to ensure you see the same YARN behaviors, but that config is not applicable to the container configs. If you choose to checkout on-prem YARN, you might want to validate YARN using EG first.
There are also EG_* env variables set - I guess in the on-prem case they don't come in the picture?
Some might come into play. It depends on where those variables get used in EG. However, some apply to DistributedProcessProxy - which hasn't (and may never) gotten implemented in kernel provider land yet.
I wonder how to deploy a notebook outside of K8S which would act as an Enterprise Gateway.
Not sure what you mean. If you're saying you'd like to take a Notebook server and run it in headless mode, where a "front-end" configured with NB2KG hits that server... you won't be able to do that for a couple reasons. 1, the token management stuff will get in the way. 2. Kernel Gateway has handlers that look for an 'env' stanza in the kernel startup request. This functionality doesn't exist in Notebook server.
The play is to plumb the new jupyter_server with this capability. At that time, we'd likely add support for parameterized kernels where the kernel parameters also appear in a stanza in the kernel startup request.
EG_REMOTE_HOSTS
only apply to DistributedProcessProxy. If you're describing how to access the web server to run a kernel, yeah, you either need to setup an ingress or locate the IP and port (which I suspect is what you list in your comment) and hit that url for accessing the notebook.
The on-prem stuff only applies to the YARN support.
Good to know. So it is like Notebook_Server_On_Host <---> Kernel_Manager_On_YARN
- See the // for K8S in the next paragraph...
Will further work on validation on-prem with YARN, with prelimary validation of my setup with EG on YARN.
Not sure what you mean. If you're saying you'd like to take a Notebook server and run it in headless mode, where a "front-end" configured with NB2KG hits that server... you won't be able to do that for a couple reasons. 1, the token management stuff will get in the way. 2. Kernel Gateway has handlers that look for an 'env' stanza in the kernel startup request. This functionality doesn't exist in Notebook server. The play is to plumb the new jupyter_server with this capability. At that time, we'd likely add support for parameterized kernels where the kernel parameters also appear in a stanza in the kernel startup request.
I was thinking to Notebook_Server_On_Host <---> Kernel_Manager_On_K8S
(// to YARN here above) - The token security would remain local and I would expect the manager on K8S to have all what is needed.
I'm still not sure I'm understanding what you're completely driving at.
So you're Kernel_manager_On_xxx
is the headless server? If so, the kernel manager needs a "hosting application" since it doesn't know how to respond to the REST calls that get forwarded from Notebook_Server_On_Host
.
With your original question, I thought you wanted something like EG running OUTSIDE of k8s but launches kernels INSIDE k8s - which we don't support. The entity that launches the kernels must also be within the same "network".
If that's still not what you're getting at, I'd be happy to setup a webex or something. Feel free to ping me at kbates4@gmail.com or on our gitter channel/room.
For the inside/outside questions, I will take more time to align my thoughts in a picture and share it for discussion.
For the on premise on YARN, I have a working setup in EG. Trying the setup for this remote kernels, I receive following exception (issue with the port range):
Traceback (most recent call last):
File "/usr/local/share/jupyter/kernels/spark_python_yarn_client_2/scripts/launch_ipykernel.py", line 269, in _validate_port_range
lower_port = int(port_ranges[0])
ValueError: invalid literal for int() with base 10: '{port_range}'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/share/jupyter/kernels/spark_python_yarn_client_2/scripts/launch_ipykernel.py", line 360, in <module>
lower_port, upper_port = _validate_port_range(arguments['port_range'])
File "/usr/local/share/jupyter/kernels/spark_python_yarn_client_2/scripts/launch_ipykernel.py", line 279, in _validate_port_range
raise RuntimeError("Port range validation failed for range: '{}'. Error was: {}".format(port_range, ve))
I'm afk at the moment but are you using the yarn kernelspecs tar file? This sounds like it's not using the yarn lifecycle manager that should be defined in a yarn_kernel.json file.
Can you provide the json file you're using?
Right, I was using the wrong spec.
From https://github.com/gateway-experiments/remote_kernel_provider/releases/download/v0.1-interim-dev/yarn_kernelspecs.tar I am trying the spark_python_yarn_client
spec which contains the yarn_kernel.json
but the spec does not show up in the kerne list in the notebook UI, maybe because it does not find a kernel.json
file?
My py libs:
pip list | grep kernel
ipykernel 5.1.0
jupyter-kernel-gateway 2.3.0
jupyter-kernel-mgmt 0.3.0
nb-conda-kernels 2.2.2
remote-kernel-provider 0.1.0.dev0
yarn-kernel-provider 0.1.0.dev0
I have renamed yarn_kernel.json
to kernel.json
and the kernel shows up in the UI but returns the same exception.
The content of my json which BTZ is YARN Cluster mode - There is no YARN Client mode in the interim distribution: Normal? I would feel better with the YARN Client...
{
"language": "python",
"display_name": "3 Spark - Python (YARN Cluster Mode)",
"metadata": {
"lifecycle_manager": {
"class_name": "yarn_kernel_provider.yarn.YarnKernelLifecycleManager"
}
},
"env": {
"SPARK_HOME": "/opt/spark",
"PYSPARK_PYTHON": "/opt/conda/envs/datalayer/bin/python",
"PYTHONPATH": "${HOME}/.local/lib/python3.7/site-packages:/opt/spark/python/lib/py4j-0.10.7-src.zip",
"SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=${HOME}/.local/lib/python3.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/conda/bin:$PATH ${KERNEL_EXTRA_SPARK_OPTS}",
"LAUNCH_OPTS": ""
},
"argv": [
"/usr/local/share/jupyter/kernels/spark_python_yarn_cluster_3/bin/run.sh",
"--RemoteProcessProxy.kernel-id",
"{kernel_id}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.port-range",
"{port_range}",
"--RemoteProcessProxy.spark-context-initialization-mode",
"lazy"
]
}
Correct. Yarn client mode is performed via the distributed process proxy, which is not implemented and may never be. If you want to work with yarn in this new config you must use cluster mode for now.
On Sat, Jul 13, 2019 at 10:58 AM Eric Charles notifications@github.com wrote:
I have renamed yarn_kernel.json to kernel.json and the kernel shows up in the UI but returns the same exception.
The content of my json which BTZ is YARN Cluster mode - There is no YARN Client mode in the interim distribution: Normal? I would feel better with the YARN Client...
{ "language": "python", "display_name": "3 Spark - Python (YARN Cluster Mode)", "metadata": { "lifecycle_manager": { "class_name": "yarn_kernel_provider.yarn.YarnKernelLifecycleManager" } }, "env": { "SPARK_HOME": "/opt/spark", "PYSPARK_PYTHON": "/opt/conda/envs/datalayer/bin/python", "PYTHONPATH": "${HOME}/.local/lib/python3.7/site-packages:/opt/spark/python/lib/py4j-0.10.7-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERRORNOKERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=${HOME}/.local/lib/python3.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/conda/bin:$PATH ${KERNEL_EXTRA_SPARK_OPTS}", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster_3/bin/run.sh", "--RemoteProcessProxy.kernel-id", "{kernel_id}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gateway-experiments/remote_kernel_provider/issues/1?email_source=notifications&email_token=AFMNPCFSUSARMB7WDF3J26LP7IJUDA5CNFSM4HXGNGXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ3WQQI#issuecomment-511141953, or mute the thread https://github.com/notifications/unsubscribe-auth/AFMNPCDYXH3UYJMR47LEQT3P7IJUDANCNFSM4HXGNGXA .
When working with kernels in a "provider world", files named kernel.json
will be "located" by the KernelSpecProvider (id = spec
). It's like that provider "owns the rights" to kernel.json
. As a result, because we still want to follow the kernel spec file format for managing kernels, we've "extended" the KernelSpecProvider's search capabilities to allow for other names - which each "sub-provider" looks for a different named file. YarnKernelProvider will locate kernels where yarn_kernel.json
is used. KubernetesKernelProvider locates kernel specs where k8s_kernel.json
is used, etc.
The RemoteKernelProvider subclasses only get their "search" functionality from the KernelSpecProvider. The launch code is independent.
I already see there's going to be confusion about YARN client mode not being related to the YarnKernelProvider. However, in reality, YARN client mode is purely a launch thing. No special lifecycle management is required for client mode. I could modify a kernel.json
file that starts a kernel against a YARN cluster such that the Spark workers utilize the cluster. In EG, we use the DistributedProcessProxy for YARN client mode. However, all this process proxy does is launch a kernel using ssh across a set of hosts (specified via EG_REMOTE_HOSTS or directly in the kernel.json
file), but, again, nothing pertaining to YARN is used for lifecycle management.
Not sure this is helpful. Looking forward to your diagram.
Thx @kevin-bates, explanation is helpful here. Final goal is K8S (which works locally with your interim release) but YARN is useful for me also. I have YARN Cluster running fine on EG.
For YARN Cluster remote kernel (with jupyter-kernel-mgmt 0.3.0 and notebook 6.0.0.dev0):
ERROR__NO__KERNEL_ID
and the logs directly shows an exception with RuntimeError: Kernel died before replying to kernel_info
(YARN app remains in status ACCEPTED), and this goes on and with serial attempts.PS: This will be busy week for me, then I will first concentrate your view of the world (on premise with YARN Cluster, the rest being container
based) running, before discussing on any alternate scenario which I doubt will make sense at all...
Hi Eric - I suspect your YARN cluster issues are because you're using the version of jupyter_kernel_mgmt from PyPi and not the wheel file attached to interim-dev release and, as a result, you're missing the code to recognize prefixed kernel.json files of other providers.
Renaming yarn_kernel.json
to kernel.json
will result in it being found by the default KernelSpecProvider
which will result in none of the kernel launch parameters being substituted. So things will attempt to get off the found, but crash in flames.
To confirm you're using the correct jupyter_kernel_mgmt file, its source contents should resemble those of this PR, namely the use of the self.kernel_file
parameter - which defaults to kernel.json
but subclasses override.
Thx for the hint Kevin. I was suspecting that but I double-checked and I use the interim wheel (I can see DEFAULT_KERNEL_FILE = 'kernel.json'
in the source. Maybe I need more pip voodoo but I don't find where my issue is.
A version 0.0.4 different from the already published pip may remove my confusion.
So far
My user feedback is: more doc (I can help) - Impressed with the K8S where the server spins directly distributed kernels without addition Gateway. My dev feedback discussion on monorepo (see #12)
Sorry for the difficulty Eric. I will create a fresh conda env and try following the directions from the interim-dev release to see if I encounter similar behaviors. You should also be sure you're using the correct notebook build since using the released version would also exhibit the behaviors you're seeing.
Regarding the 0.0.4
release, that's kinda out of my control IMO. Once the PR I referenced above is merged, then I can talk to Thomas about that, but there are a number of other changes required at that level that I'd like to get in as well (kernel id and general parameters).
Progressing a bit: I have found in yarn_kernel_provider
line 37 of yarn.py does not take the yarn_endpoint
config at all
self.yarn_endpoint \
= lifecycle_config.get('yarn_endpoint',
kernel_manager.app_config.get('yarn_endpoint', "localhost")) # TODO - default val
and was returning the default value localhost (where you have a TODO BTW). I have forced in source to http://localhost:8088/ws/v1/cluster
and at least the YARN App starts (I love those 95% working code; it allows you to digg into). Now I have issue with the connection to notebook (websocket I think).
Enough for today for me. Would be good that you install the interim on fresh env and confirms it is workin on your side with YARN cluster.
Ah - yeah, I was experimenting with how to access config entries and I also run on the YARN master - so localhost works.
I'm sorry for the hassle here, but it may be the case that this is just not ready for others to play with. There's just WAY TOO MUCH to look into at this point and I just don't have the bandwidth to perform a lot of support right now.
I really appreciate your comments and will likely move to a single repo, but, as you noted, that's going to take time and I feel like I need to understand all of the possible show stoppers first.
Here's what the kernelspec list should look like: And after the kernel has started:
And this is mine :)
Add to pin tornado to 5.1.1 like you said + force yarn_endpoint to http://localhost:8088/ws/v1/cluster in the code
FWIW, I updated the yarn provider to include updates for the yarn-api-client changes that occurred a few weeks ago. Apparently, the copy of yarn.py used in the kernel providers was fairly stale. I've updated to resemble the master branch from EG. See https://github.com/gateway-experiments/yarn_kernel_provider/pull/4
Interim-dev assets have been updated (yarn wheel only)
I'm closing this since the INTERIM-DEV is serving this purpose and we can ensure the proper documentation is in place prior to its removal (or sooner at this point).
Hi, Great to see this!
How can I test it with the notebook? Should I first apply https://github.com/jupyter/notebook/pull/4170 (which btw has merge issues)?
PS: Any plan to create the
yarn_kernel_provider
implementation?