Thoughts on using a configuration management framework?

jmahlik commented 11 months ago

It's pretty hard to get this up and running in an account that has restricted internet access.

I had fork and refactor almost all of the bash scripts. This was quite a challenge as they are a little unwieldy (I mean it is bash after all). So, I had a thought based on how I handle setting up dev environments on linux boxes.

Moving the install/run functionality to a declarative configuration management system would make maintaining, extending and using the project easier.

What would your thoughts be on managing the installs and configurations via something like Ansible? I recommended ansible since it's lightweight and easy to work with. Its a python package. So only need python which we already have. But it could be any config system.

The user experience could remain the same, the bash scripts would be shims around the config manager. Likely, it could be simplified. Not so many steps to get up and running, you just run a command and it gets the system in the desired state, instead of having to nohup a bunch of bash scripts.

It'd be easier to:

Allow options like install urls for the dependencies
Not rely on the working directory to source bash files
Avoid multiple re-installs to make it easier to run in a lifecycle config
Extend it by modifying or including additional config

I'd be willing to contribute work towards this since maintaining a copy of the bash scripts is quite painful. Already in the process of exploring a playbook for starting the ssh helper.

ivan-khvostishkov commented 11 months ago

Hi, Justin, great to see your interest and the willingness to contribute!

Would you elaborate a little bit more on the problem that you're trying to solve?

Also, have you already check the section in the FAQ: I'm running SageMaker in a VPC. Do I need to make extra configuration?? It shows example Dockerfiles where everything is coming pre-installed, so you don't need to "configure" anything extra, change URLs etc.

jmahlik commented 10 months ago

The particular use case is connecting sagemaker studio's jupyter server app to the kernel gateway apps to enable interactive plotting libraries that need a web server running. Similar to the web vnc example.

I did see the dockerfiles. Building them an environment without direct internet access isn't possible (same issue as running the scripts directly).

A couple specific things I thought a config manager could help address:

If one has to patch the bash files i.e. to change a download/install location, it has to be done in place or move all of them to a different directory since the bash scripts source each other based on the directory of the script. Let's say one wanted to keep the artifacts on s3 so we aren't reliant on github to download a binary. It's hard to pull that off currently.
The other thing I ran in to was repeated apt/yum installs even though things were already installed. Which made it hard to run in the timeout of a lifecycle script.
The scripts don't error on failure, they continue execution. So you're not really sure if parts completed successfully until it hits the end with hard to debug errors from prior failed scripts. I had to add set -euo pipefail to all of them to debug though the failing parts.

ivan-khvostishkov commented 10 months ago

Thank you, @jmahlik , I will take a look in your concerns. As to the last point, what version of the library do you use on the client and on the remote? The pipefail option has been added to some scripts in the latest version. Which ones you think still need this option turned on?

DrJeckyl commented 10 months ago

I ended up doing the same thing @jmahlik. I forked the code, refactored to my needs and built all the pre-requisites into a custom image for sagemaker studio. Then a lifecycle config simply registers the instance and sets the SSHOwner tag etc.

On the local side, I also refactored some of the code in to a Python install to integrate with VSCode for our Windows users.

ivan-khvostishkov commented 10 months ago

Hi, @DrJeckyl , do you also have no Internet access during the build of the custom image and require to download tools like AWS CLI and SSM Agent from internal locations?

DrJeckyl commented 10 months ago

No @ivan-khvostishkov - We use a code pipeline with internet access when building the custom images. However, a lifecycle config is needed to set the Owner tags when a kernel is launched. We had to modify the sm-ssh-ide, sm-init-ssm and sm-start-ssh.

Admittedly, we are a few versions behind and should update to see what's different now.

ivan-khvostishkov commented 23 hours ago

Hi, @DrJeckyl , did you have chance to try the latest version 2.2.0 of SSH Helper to see if the pipefail command helps you to debug the scripts?

If you have Internet during build pipeline, then you can just add to your docker file this command:

RUN  sm-ssh-ide configure

It will download and install all libraries so later when you run the lifecycle config script it will detect that everything is already configured and won't try to install anything from Internet. In this case you don't need to patch the locations of the libraries.

I understand that you want to patch the lifecycle script with the specific value for LOCAL_USER_ID, but I don't yet understand how Ansible can help you in this case? The better option, in my opinion, would be to fetch the values from Systems Manager Parameter Store.

Of course, you need to modify the scripts a little bit to call the Systems Manager API, and you are encouraged to do so, because this repository is the sample code.

But is there any logic that you propose to be the part of the main branch? If we add a new lifecycle configuration script that fetches the user IDs from Parameter Store, will it help to resolve the your issue?

Let me know your thoughts.

ivan-khvostishkov commented 23 hours ago

@jmahlik Following up on your original post, could you please help me to understand in which part you propose to run Ansible? As part of the lifecycle configuration script or as part of sm-ssh-ide script, etc.?

You mentioned that you're already in the process of creating the playbook, have you succeed in it? It would be great if you share your learnings.

aws-samples / sagemaker-ssh-helper

Thoughts on using a configuration management framework? #37