aws-samples / sagemaker-studio-apps-lifecycle-config-examples

MIT No Attribution
23 stars 8 forks source link

auto-stop-idle scripts sometimes not working. #12

Closed greyes-trc closed 4 months ago

greyes-trc commented 5 months ago

I've tried these scripts on both code editor and jupyter lab apps. Yesterday they were working. Today, the scripts seem to be started as indicated by the last log in the LifecycleConfigOnStart

JupyterLab/default/LifecycleConfigOnStart

+ echo */2 * * * * /bin/bash -ic '/opt/conda/bin/python /var/tmp/auto-stop-idle/sagemaker_studio_jlab_auto_stop_idle/auto_stop_idle.py --idle-time 300 --hostname 0.0.0.0 --port 8888 --base-url /jupyterlab/default/ --ignore-connections True --skip-terminals False --state-file-path /var/tmp/auto-stop-idle/auto_stop_idle.st >> /var/log/apps/app_container.log'

/CodeEditor/default/LifecycleConfigOnStart

echo */2 * * * * /bin/bash -ic '/opt/conda/bin/python /var/tmp/auto-stop-idle/sagemaker_code_editor_auto_shut_down/auto_stop_idle.py --time 300 --region eu-central-1 >> /var/log/apps/app_container.log'

then:

the last few JupyterLab/default logs only show:

[I 2024-03-28 15:13:50.675 ServerApp] Client connected. ID: b897bf3a043b44db9b39988623623653
[I 2024-03-28 15:13:54.440 ServerApp] Client disconnected. ID: b897bf3a043b44db9b39988623623653

repeated several times with different IDs, instead of the usual logs where the last activity's time in the JupyterLab is tracked.

The /CodeEditor/default logs weren't shown yesterday at all. Funny enough, the script was working as intended. Today, however, they are showing, just now outputting anything useful. And the script is also not working.

2024-03-28 15:05:53,228 INFO success: codeeditorserver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[15:05:53] Extension host agent started.
[15:06:26] [169.255.255.1][ad61f395][ManagementConnection] New connection established.
[15:06:26] [169.255.255.1][dabd5c36][ExtensionHostConnection] New connection established.
[15:06:27] [169.255.255.1][dabd5c36][ExtensionHostConnection] <310> Launched Extension Host Process.

Have you experienced something like this? I find it odd, that it's suddenly behaving like this without any apparent reason.


UPDATE

Apparently this might have to do with a new Sagemaker Distribution image. Yesterday, the latest Sagemaker Distribution image was 1.5. Today 1.6 seems to be the newest one. I tried this with the 1.5 version and it worked again. Then again with the 1.6 version and it doesn't work. This might be worth investigating to make sure the LCC works with the newest Sagemaker Distribution images

giuseppeporcelli commented 5 months ago

Thanks for reporting this issue @greyes-trc

We have done some research and it looks like while cron package is now installed in the SageMaker Distribution 1.6, the cron service is not running. I have opened an issue on GitHub for this: https://github.com/aws/sagemaker-distribution/issues/354

In the meantime, you can implement the following logic in the LCC to:

  1. Avoid installing cron package if it is already installed
  2. Start cron service manually
# Check if cron needs to be installed
status="$(dpkg-query -W --showformat='${db:Status-Status}' "cron" 2>&1)"
if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
    # Fixing invoke-rc.d: policy-rc.d denied execution of restart.
    sudo /bin/bash -c "echo '#!/bin/sh
    exit 0' > /usr/sbin/policy-rc.d"

    # Installing cron.
    echo "Installing cron..."
    sudo apt install cron
else
    echo "Package cron is already installed."
    sudo cron
fi

As an example, for JupyterLab LCC, you would have to replace lines 32-38 with the above code in this file: https://github.com/aws-samples/sagemaker-studio-apps-lifecycle-config-examples/blob/main/jupyterlab/auto-stop-idle/on-start.sh

greyes-trc commented 5 months ago

Thanks a lot! Not sure if I did something wrong, but for me, in version 1.6 this was working great, but the LCC was exiting with non zero status in version 1.5. To fix this, I changed one line in your code to this: status="$(dpkg-query -W --showformat='$${db:Status-Status}' "cron" 2>&1)" || status="not-installed"

this prevents the non-zero exit while the logic stays the same. Just in case someone else comes across this issue.