script.sh - feedback / troubleshooting assistance

Adam-McGill commented 2 months ago

First of all I would like to say, thank you for sharing your code base/project.

I am running the script.sh, running into a couple of issues.

I updated the code to include if the user doesn't exist, create the user. Assume this script/step assume the image has the user created already.
The issue I am running into is with this line of code "find /opt/post-generation -mindepth 1 -maxdepth 1 -type f -name '*.sh' -exec bash {} \;". As the directory doesn't exist. I am running this on a ubuntu 22.04 image. What is this directory being used for ? Can it be just ignore.

mysteq commented 2 months ago

Yes, the script assumes the user has been pre-created as part of the creation of a Azure Virtual Machine Scale Set.

For the images we run, some of the applications installed need to have some things done in the user-context after the VMSS instance is created and started. So this line is for running those scripts. If you run your own custom image without such needs, that line can be ignored.

Adam-McGill commented 2 months ago

Sounds good, thanks for that. Appreciate the quick reply.

I was able to spin up the instance and connecting to github. On the disconnect I see the offline runner in github, which based on documentation from github an ephemral runner cleans itself up within 24 hours. Can test this to validate.

So in order to control the vmss auto scaling, will need some mechanism to handle a queue from github.

Adam-McGill commented 1 month ago

@mysteq | So with the de-registration of the runner / vmss instance. The instance is removed and the github runner goes to offline. Are you manually cleaning up offline runners in github or waiting the 14 days for it to automatically delete.

mysteq commented 1 month ago

So a might be a bit difficult to answer specific for your situation, since I don't know all the details of the VMSS and how you need to do it. But for a bit generic answer:

I have some part of the onboarding script that regularly checks for shutdown events of the VMSS instance, if it finds one it should properly deregister the runner from Github, so no manually and no waiting on cleanup. It might be a bit messy documented, but it is the part here about enabling notification for instance termination: https://docs.fortytwo.io/marketplace-offerings/self-hosted-runners/github/step2/#offboard-github-runner-upon-termination-eventsscale-down-linux

At least this works fine for our regular use cases where we use just the native VMSS scaling rules. And I would expect it to be work fine for you also with a custom ubuntu image.

If you base it on ephemeral option, I also had some parts in the script for killing off the instance based on automatic health repair and re-imaging, but that part is currently both a bit untested and undocumented.

mysteq commented 1 month ago

And yes, so far as mentioned we've mainly done the native VMSS scaling rules, since that has covered the needs we've had so far. The more cleverer ways of scaling based on queue sizes in Github and similar, would not be covered by the script, and would need to happen through other means like Azure Automation, function apps, or logic apps, or similar.

Adam-McGill commented 1 month ago

Understand. As far as cleanup goes / de-registering the runner in github. I am seeing it de-register the runner when scaling down. But the instance will remain in Github as offline. Looking at the script ephemeral is set to true by default. Why I was expecting the offline instance to disappear from Github after 24 hours. Believe that is what is supposed to happen, could be wrong. If I scale up the vmss it will pick up the offline instance and work as normal. Just was wondering if i need a night cleanup process to remove offline Github instances. Unless you are saying on your end that it does de-register the instance and removes it completely from Github, then it could be something on my end, just confirming. Appreciate the feedback.

mysteq commented 1 month ago

If it de-registers properly then there should be nothing left offline in Github. Ephemeral is not set to true by default, it's just set to true based on adding the '-e' option.

Adam-McGill commented 1 week ago

Understood by adding the option '-e' to the command line to use it. Reviewing the script.sh file. Shouldn't the line 53 " while getopts 's:g:n:r:u:l:df' opt; do " be " while getopts 's:g:n:r:u:l:def' opt; do " to include the e option ?

mysteq commented 8 hours ago

Hi! Yes, there is a little bug there. Thanks!

fortytwoservices / terraform-azurerm-selfhostedrunnervmss

script.sh - feedback / troubleshooting assistance #205