aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

SSH into a SageMaker instance for debugging purposes #344

Closed mklissa closed 6 years ago

mklissa commented 6 years ago

I am trying to connect to a SageMaker instance through SSH with my local machine, but I cannot find a way to do it. This seems like an important functionnality, either for debugging (through PyCharm) or for uploading files with SCP. I am wondering if there is any way to do this?

jesterhazy commented 6 years ago

SageMaker doesn't support SSH access to running jobs or endpoints. There are a couple of ways to get files into your instances:

There's currently no way to do remote debugging of a training job. You might be able to do this by using a customized container to run your job in local mode.

yonatanp commented 6 years ago

If you have another instance that you can ssh into from both the instance and your local machine, then you can tunnel through and achieve ssh access. I'm using this for the same purpose of SCPing stuff in and out.

For example, assuming "bastion" is the additional middle instance:

# run this command from within a terminal on your notebook instance (New -> Terminal), pushes port 22 to bastion's locally accessible port 10022
sh-4.2$ ssh user@bastion -R 10022:localhost:22 -f -N

# run this command from you local machine, pulls port 10022 of the bastion to local machine port 10022
[you@yourmachine]$ ssh user@bastion -L 10022:localhost:10022 -f -N

# now you can ssh or scp as you'd like, using the localhost port 10022 as the target
[you@yourmachine]$ ssh localhost -p 10022 -l ec2-user

You'll of course have to take care of authentication in the right directions (e.g. create private keys and add to authorized_keys as applicable).

elgalu commented 5 years ago

Update 2022

This is now solved via https://github.com/aws-samples/sagemaker-ssh-helper

kot-behemoth commented 5 years ago

@mklissa I know this is quite late, but it looks like AWS has thought about your particular use case: Tutorial: Set Up PyCharm Professional with a Development Endpoint. It works via AWS Glue's ability to create developer endpoint. However, it looks like it only supports Py2.7 though.

mariokostelac commented 4 years ago

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

Steps to make the ngrok solution work:

If you want to automate it, I suggest using lifecycle configuration scripts.

Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.

It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.

daysm commented 4 years ago

Thank you @mariokostelac! I used the most recent ngrok and needed to change two things:

elgalu commented 3 years ago

Update 2022

This is now solved via https://github.com/aws-samples/sagemaker-ssh-helper

Old text

This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.

laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3

Starting session with SessionId: lgallucci-0d662d7d50462b043

ec2> $ nvidia-smi
Thu Nov 19 08:58:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    14W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ghost commented 3 years ago

This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.

laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3

Starting session with SessionId: lgallucci-0d662d7d50462b043

ec2> $ nvidia-smi
Thu Nov 19 08:58:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    14W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

How do I know my SageMaker Studio notebook target id?

mariokostelac commented 3 years ago

This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.

This is great, thanks a lot for that information. I'll try to set it up soon.

elgalu commented 3 years ago

@hanan-vian SM doesn't give you any target id, you have to do everything yourself as if it were some computer box in your basement (sort to say). Update: this is now solved via https://github.com/aws-samples/sagemaker-ssh-helper

philschmid commented 3 years ago

@elgalu if I understand you correctly I have to start en ec2 instance with a Deep Learning-AMI? I cannot use this together with Estimator.fit() using the sdk?

elgalu commented 3 years ago

@philschmid we are discussing SSH access in SageMaker Studio/Notebooks in this thread. With EC2 you can already ssh, it's solved there.

moon-home commented 2 years ago

I am using SM with custom Docker image, not prebuilt AMI, not notebook. I didn't find the instance id on the training job page. Did you find the instance id? @philschmid @elgalu

I tried getting instance metadata by logging into CloudWatch, but curling metadata or dynamic data (doc) didn't return response here, not even an 400 level errors. Based on this doc, there are 3 possible solutions (using session oriented IMDSv2, increasing hop limit and turning on metadata access). I will continue investigating on this.


Got reply from AWS support:

Unfortunately at this moment, it is not possible to do so. As you may already know, the EC2 instances that are spun up sits in SageMaker Service Team's account so for security purposes, SSH into the instances are not permitted. If you wish to debug your training job, I'd suggest you to use local mode. Note that local mode is not available inside SM studio because a container inside a container is unstable.

ruslanmv commented 2 years ago

I've found not being able to SSH to notebook instances too limiting so I've built a guide to set it up by using the bastion box. https://ruslanmv.com/blog/How-to-connect-to-Sagemaker-Notebook-via-SSH I hope this can be helpful.

ivan-khvostishkov commented 2 years ago

I know this thread is quite old, but developers keep bumping into this discussion when searching for SageMaker and SSH.

Now there's an AWS repo with sample scripts to automate the SSH setup: https://github.com/aws-samples/sagemaker-ssh-helper .

It uses managed instances capability of AWS Systems Manager (SSM), as suggested earlier by @elgalu .

As a result, the solution is secure, serverless, and supports not only connection into running jobs and endpoints with SSH/SSM, but also into SageMaker Studio containers, and allows integration with PyCharm and VSCode.

julien-c commented 2 years ago

really cool @ivan-khvostishkov, that's very helpful

harrypawar commented 1 year ago

For SageMaker inference endpoints you could now use SSM to get shell level access to the container by enabling it from api https://docs.aws.amazon.com/sagemaker/latest/dg/ssm-access.html