Instance randomly not connecting to SSM

michael-kutsch commented 1 year ago

Bug Description

In some cases, the SSM agent takes more than the default stop-timeout of 5 minutes to connect to SSM, therefore the EC2 can come up, but the user is not able to create an ssm session. Also, the SSM agent could have crashed for some reason, which renders the same result: you cannot connect.

Steps to Reproduce

Disclaimer: this is hard to reproduce as it depends on several factors that are out of control of the user.

basti init
basti connect
basti instance does not show up in session manager
connection times out
basti instance stops

Expected Behavior

Basti instance is usable via SSM

Current Behavior

See Steps to Reproduce

Possible Solution (Optional)

I see three options:

1) increase / make the default stop-timeout configurable

2) add additional reboot to the instance after init (manual reboot helped in my case)

3) (preferred option) make the basti instance check if it the SSM agent connected successfully so a session can be initialized. If not, (force) restart the SSM agent, wait X s, recheck. If basti instance cannot connect to SSM, perform reboot. If reboot does not help, terminate the instance.

Related Issues/PRs

none

BohdanPetryshyn commented 1 year ago

@DrFunk-n-stein, thank you for submitting the issue!

First of all, I'd ask you to make sure you're using the latest version of Basti. Recently, I added additional retries on the client side (in basti connect) which solved the issue of the port forwarding session not starting when the bastion instance is online in SSM.

The problem you described is different from what I recently fixed. To be honest, I haven't noticed such SSM agent behavior even though we use Basti hundreds of times a day. However, this can happen for sure, and I think the best solution would be a variation of solution #3 you suggested.

I think the solution can be slightly simplified by rebooting the instance right after noticing the SSM agent malfunction (skipping the SSM agent restart).

I'd like to ask you if you want to become a contributor and introduce such a health check. This would really help the project🤗 Otherwise, I'll do this as soon as I have some free time.

BohdanPetryshyn commented 1 year ago

Regarding the stop timeout. basti connect command starts marking the Bastion instance as in use when it only starts trying to connect to it. So the instance shouldn't stop unless the Basti CLI gave up retrying or was manually stopped.

michael-kutsch commented 1 year ago

About becoming a contributor: yes please :D

This is a so common use case and I like the very simple UX of basti.

BohdanPetryshyn commented 1 year ago

It's always so nice when somebody volunteers to become a contributor. Thank you!

When can you start working on this?

michael-kutsch commented 1 year ago

Soonish 😅 (Father of two, I'll try to onboard myself tonight, first PR will take a while)

michael-kutsch commented 1 year ago

@BohdanPetryshyn can you assign this one to me please? I lack permissions as it seems like

michael-kutsch commented 1 year ago

Just confirmed it that it does not work in different setups.

The only common similarity that I could find is that the overall setup was using a central egress pattern which means that the instances' traffic is routed via a transit gateway to another AWS account that passes the traffic through central NAT gateways.

Maybe it's a latency or routing thing, nonetheless, manual restart of the instance did the trick again.

That is indeed an interesting issue 😅

I will apply private endpoints to the setup for the required services soon and test if this resolves it maybe already. Nonetheless, the ssm agent is not able to connect, so worth fixing this.

BohdanPetryshyn commented 1 year ago

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

michael-kutsch commented 1 year ago

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

Thanks for pinging me - feel free to hand it over to someone else. My job and private schedule prevented me from putting time into it. Sorry for that - I will let you know once I'll be available.

BohdanPetryshyn commented 1 year ago

No problem! Thank you for letting me know!

bobveringa commented 1 year ago

@DrFunk-n-stein Could you maybe provide the basti logs for this? Logs are stored in /var/log/basti/stop-if-not-used.log. Perhaps this is caused by some type of error.

michael-kutsch commented 1 year ago

I'll can check them the next time it happens.

But it's with a 99% certainty because the instance is not connected to session manager and shuts off after 5 mins. I checked this via the console and aswcli that the instance is not showing up in session manager (I used https://pypi.org/project/aws-ssm-tools/ which is a nice wrapper for ssm commands).

On reboot, most of the times, the ssm agent can connect properly and then it works

maartenvanderhoef commented 3 months ago

The problem is the sheer speed of AWS.

First the Role and Instance get created The instanceID and Role are then being used to create granular inline permissions for ec2:CreateTags

The problem with that is, that when the instance is fast which they are, the instance starts and does not have the ssm permissions active yet. Then the ssm-agent want to wait for 28 minutes or so, and well before that the instance shuts down by design.

Better would be to

Create the Role
Add the first inline permissions for ssm
Create the instance
Add the second inline permissions for Tagging

maartenvanderhoef commented 3 months ago

Or we can use a generic policy which allows an instance to tag itself like shown here:

https://unixorn.github.io/post/iam-self-tagging/

I can PR if there is support for it.

BohdanPetryshyn commented 2 months ago

Hey @maartenvanderhoef, that's a very nice catch! Thank you for figuring this out 👍

I would go with the "iam-self-tagging" approach you mentioned to simplify the code (to still create all the inline policies in one place and with a single log message). I would very much appreciate your help with implementing this stability improvement ❤️

basti-app / basti