[Feature] SageMaker job as Studio kernel

athewsey commented 1 year ago

Lately I work mainly in SageMaker Studio, and I'd really like to be able to debug / interact with a running job using the same UI.

Solution idea

Create a custom Studio kernel image using an IPython extension and/or custom magic through which users can connect to a running SSH Helper job and run notebook cells on that instead of the Studio app.

The user experience would be something like using EMR clusters in Studio:

One-time up-front job to build/register the custom "SageMakerSSH" image (maybe?)
User launches their SSH-helper-enabled job from "normal" notebook A and fetches the managed instance ID mi-1234567890abcdef0
User opens / switches to a notebook with SageMakerSSH kernel and runs something like
- %load_ext sagemaker_ssh_helper.notebook to initialize the IPython extension
- %sagemaker_ssh connect mi-1234567890abcdef0 to connect to the instance
- From here on out, cells should run on the connected instance rather than the local Studio app unless a %%local cell magic is used: Same as how SageMaker Studio SparkMagic kernel works
- Probably some kind of %sagemaker_ssh disconnect command would also be useful

Since the sagemaker_ssh_helper library is pip-installable, it might even be possible to get this working with default (e.g. Data Science 3.0) kernels? I'm not sure - assume it depends how much hacking is possible during IPython extension load vs what needs setting up in advance.

Why this route

To my knowledge, JupyterLab is a bit more fragmented in support for remote kernels than IDEs like VSCode/PyCharm/etc. It seems like there are ways to set up SSH kernels, but it's also a tricky topic to navigate because so many pages online are talking about "accessing your remotely-running Jupyter server" instead. Investigating the Jupyter standard kernel spec paths, I see /opt/conda/envs/studio/share/jupyter/kernels exists but contains only a single python3 kernel which doesn't appear in Studio UI. It looks like there's a custom sagemaker_nb2kg Python library that manages kernels, but no obvious integration points there for alternative kernel sources besides the studio "Apps" system - and sufficiently internal/complex that patching it seems like a bad idea.

...So it looks like directly registering the remote instance as a kernel in JupyterLab would be a non-starter.

If the magic-based approach works, it might also be possible to use with other existing kernel images (as mentioned above) and even inline in the same notebook after a training job is kicked off. Hopefully it would also enable toggling over to a new job/instance without having to run CLI commands to change the installed Jupyter kernels.

ivan-khvostishkov commented 1 year ago

Hi, Alex! It's definitely an interesting idea. I will do some research for the best route to take here.

In a meantime, there's already an API that will help you to achieve the same results. You can run the following code in notebook cells:

proxy = ssh_wrapper.start_ssm_connection(11022)

proxy.run_command_with_output("ps xfa")

proxy.disconnect()

Let me know if it helps and I will update the documentation accordingly.

athewsey commented 1 year ago

Thanks Ivan, this is certainly helpful but I guess I'm hoping it's possible to do a bit more...

First, AFAICT the current implementation of run_command_with_output would only return the results of long-running commands after completion which can be frustrating/unusable for some tools: For e.g. compare the results of subprocess.check_output("echo hi && sleep 5 && echo bye", shell=True) and !echo hi && sleep 5 && echo bye in a notebook. It would be nice to have a more interactive implementation here although I know from my own attempts on related topics it can get complicated quickly...

Second, I guess it's more a question of the intended overall workflow for drafting training (/processing/etc) script bundles in Studio JupyterLab and how this tool would fit in.

I'm thinking of SSH Helper mainly as a workaround for lack of local mode in SMStudio & some limitations of warm pools: Looking for a way to iterate quickly on the scripts in JupyterLab UI and try running them in training/processing job context, finding and fixing basic functional/syntax errors without having to wait for infrastructure spin-up. Features that seem helpful to me in this context include:

(Full Interactive debugging if it's possible, but I think that's a stretch)
Notebook-like Python REPL in the context of the job, with visibility of the data and the uploaded code dir. (This is useful because the Studio kernels don't always match available framework container images today, and so we can see the pre-loaded data channels & folder structure)
- Diagnostic shell commands like ps/top matter less to me for this kind of functional debugging
- Shell commands like pwd/ls are of course still useful, but mainly for helping us understand/check the folder structure for our Python scripts
An "easy button" to replace the job's source code dir with an updated one I've drafted in Studio JupyterLab
An "easy button" to launch (updated) job source code dir and entrypoint the same way the framework/toolkit would: Without having to know about e.g. the CLI parameters or environment variables that get created (but maybe being able to override them if needed?)

I raised this issue with (2) originally in mind, but thinking that magics could be used to provide (3) and (4) too: Main goal to provide a super easy-to-use way (after the initial platform SSM/etc setup is done of course) for JuptyerLab-minded scientists to iterate on their scripts until they functionally work, before quitting the interactive training/processing job and running a "proper" non-interactive one to run the training/processing in a known reproducible way.

Appreciate that there are other use-cases for SSH Helper of course (like diagnosing processes/threads/etc in an actually ongoing job) - I'm just wondering if it has potential to deliver a purpose-built, friction-free script debugging experience from Studio.

aws-samples / sagemaker-ssh-helper

[Feature] SageMaker job as Studio kernel #15

Solution idea

Why this route