Open michael-kutsch opened 1 year ago
@DrFunk-n-stein, thank you for submitting the issue!
First of all, I'd ask you to make sure you're using the latest version of Basti. Recently, I added additional retries on the client side (in basti connect
) which solved the issue of the port forwarding session not starting when the bastion instance is online in SSM.
The problem you described is different from what I recently fixed. To be honest, I haven't noticed such SSM agent behavior even though we use Basti hundreds of times a day. However, this can happen for sure, and I think the best solution would be a variation of solution #3 you suggested.
I think the solution can be slightly simplified by rebooting the instance right after noticing the SSM agent malfunction (skipping the SSM agent restart).
I'd like to ask you if you want to become a contributor and introduce such a health check. This would really help the projectπ€ Otherwise, I'll do this as soon as I have some free time.
Regarding the stop timeout. basti connect
command starts marking the Bastion instance as in use when it only starts trying to connect to it. So the instance shouldn't stop unless the Basti CLI gave up retrying or was manually stopped.
About becoming a contributor: yes please :D
This is a so common use case and I like the very simple UX of basti
.
It's always so nice when somebody volunteers to become a contributor. Thank you!
When can you start working on this?
Soonish π (Father of two, I'll try to onboard myself tonight, first PR will take a while)
@BohdanPetryshyn can you assign this one to me please? I lack permissions as it seems like
Just confirmed it that it does not work in different setups.
The only common similarity that I could find is that the overall setup was using a central egress pattern which means that the instances' traffic is routed via a transit gateway to another AWS account that passes the traffic through central NAT gateways.
Maybe it's a latency or routing thing, nonetheless, manual restart of the instance did the trick again.
That is indeed an interesting issue π
I will apply private endpoints to the setup for the required services soon and test if this resolves it maybe already. Nonetheless, the ssm agent is not able to connect, so worth fixing this.
Hi, @DrFunk-n-stein π
Are you still up to implementing the health check yourself?
Hi, @DrFunk-n-stein π
Are you still up to implementing the health check yourself?
Thanks for pinging me - feel free to hand it over to someone else. My job and private schedule prevented me from putting time into it. Sorry for that - I will let you know once I'll be available.
No problem! Thank you for letting me know!
@DrFunk-n-stein Could you maybe provide the basti logs for this? Logs are stored in /var/log/basti/stop-if-not-used.log
. Perhaps this is caused by some type of error.
I'll can check them the next time it happens.
But it's with a 99% certainty because the instance is not connected to session manager and shuts off after 5 mins. I checked this via the console and aswcli that the instance is not showing up in session manager (I used https://pypi.org/project/aws-ssm-tools/ which is a nice wrapper for ssm commands).
On reboot, most of the times, the ssm agent can connect properly and then it works
The problem is the sheer speed of AWS.
First the Role and Instance get created The instanceID and Role are then being used to create granular inline permissions for ec2:CreateTags
The problem with that is, that when the instance is fast which they are, the instance starts and does not have the ssm permissions active yet. Then the ssm-agent want to wait for 28 minutes or so, and well before that the instance shuts down by design.
Better would be to
Or we can use a generic policy which allows an instance to tag itself like shown here:
https://unixorn.github.io/post/iam-self-tagging/
I can PR if there is support for it.
Hey @maartenvanderhoef, that's a very nice catch! Thank you for figuring this out π
I would go with the "iam-self-tagging" approach you mentioned to simplify the code (to still create all the inline policies in one place and with a single log message). I would very much appreciate your help with implementing this stability improvement β€οΈ
Bug Description
In some cases, the SSM agent takes more than the default stop-timeout of 5 minutes to connect to SSM, therefore the EC2 can come up, but the user is not able to create an ssm session. Also, the SSM agent could have crashed for some reason, which renders the same result: you cannot connect.
Steps to Reproduce
Disclaimer: this is hard to reproduce as it depends on several factors that are out of control of the user.
Expected Behavior
Basti instance is usable via SSM
Current Behavior
See
Steps to Reproduce
Possible Solution (Optional)
I see three options:
1) increase / make the default stop-timeout configurable
2) add additional reboot to the instance after init (manual reboot helped in my case)
3) (preferred option) make the basti instance check if it the SSM agent connected successfully so a session can be initialized. If not, (force) restart the SSM agent, wait X s, recheck. If basti instance cannot connect to SSM, perform reboot. If reboot does not help, terminate the instance.
Related Issues/PRs
none