aws / aws-ec2-instance-connect-config

This is the ssh daemon configuration and necessary EC2 instance scripting to enable EC2 Instance Connect. Also included is various package manager configurations for packaging for various Linux distributions.
Apache License 2.0
83 stars 35 forks source link

eic_harvest_hostkeys fails in local zones #28

Open rcj4747 opened 3 years ago

rcj4747 commented 3 years ago

ec2-instance-connect breaks during host key harvesting for instances launched in local zones.

I have tested with Ubuntu 21.04 (Hirsute) using that distro's package (version 1.1.13-0ubuntu1) but inspection of the code here indicates the problem is also in this upstream code @ https://github.com/aws/aws-ec2-instance-connect-config/blob/c15b99fa223f277787e50b044baf39e483dedf8c/src/bin/eic_harvest_hostkeys#L101

Recreate steps

  1. Launch a VM in a local zone with ec2-instance-connect installed (I used Ubuntu in us-west-2-lax-1a).
  2. Observe the state of the ec2-instance-connect systemd unit (it will be in the failed state)
$ systemctl is-system-running
degraded

$ systemctl list-units --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
● ec2-instance-connect.service loaded failed failed EC2 Instance Connect Host Key Harvesting

$ journalctl --unit ec2-instance-connect
-- Logs begin at Wed 2021-02-10 22:47:47 UTC, end at Wed 2021-02-10 22:55:46 UTC. --
Feb 10 22:48:16 ip-172-31-51-82 systemd[1]: Starting EC2 Instance Connect Host Key Harvesting...
Feb 10 22:48:16 ip-172-31-51-82 systemd[1]: ec2-instance-connect.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 10 22:48:16 ip-172-31-51-82 systemd[1]: ec2-instance-connect.service: Failed with result 'exit-code'.
Feb 10 22:48:16 ip-172-31-51-82 systemd[1]: Failed to start EC2 Instance Connect Host Key Harvesting.
  1. Run the harvest hostkeys script with debug output via bash -x /usr/share/ec2-instance-connect/eic_harvest_hostkeys to observe the failure
    ...
    ++ /usr/bin/curl -s -f -m 1 -H 'X-aws-ec2-metadata-token:<<TOKEN>>' http://169.254.169.254/latest/meta-data/placement/availability-zone/
    + zone=us-west-2-lax-1a
    + zone_exit=0
    + '[' 0 -ne 0 ']'
    + /bin/echo us-west-2-lax-1a
    + /bin/grep -Eq '^([a-z]+-){2,3}[0-9][a-z]$'
    + /usr/bin/head -n 1
    + exit 255
rcj4747 commented 3 years ago

I have created bug #1915345 in the Ubuntu bug tracker as well.

rcj4747 commented 3 years ago

And I haven't tested, but this would also fail in Wavelength zones with names in the form us-east-1-wl1-bos-wlz-1. Even if instance connect isn't supported in wavelength zones I rather not have the service fail and mark the system status as degraded.

ohitspaul commented 3 years ago

Hi @rcj4747 , thanks for bringing this to our attention. We're currently looking into this issue.

ohitspaul commented 3 years ago

This has been addressed as part of release here: https://github.com/aws/aws-ec2-instance-connect-config/releases/tag/1.1.14

OS vendors are working on shipping this change automatically with AMI updates, and publishing to package repositories to be pulled from. Will leave this issue open until package is available to be pulled from corresponding repositories.

rcj4747 commented 3 years ago

@ohitspaul Do we not want host key harvesting in local/wavelength zones? Yes, the systemd unit no longer fails because the patch ignored the script failure, but does it work? The release has no changes to /usr/share/ec2-instance-connect/eic_harvest_hostkeys to match the string format for a local/wavelength zone still. So while the feature is available in local and wavelength zones (the user can try but it appears to be non-functional, see next comment) the code fails to address the underlying failure and so ec2-instance-connect still broken in those zones (just silently now). And if it were unsupported in those zones, couldn't the script be updated to still recognize the zones, print a message about not being supported and exit. This still leaves the error in journalctl --unit ec2-instance-connect.

rcj4747 commented 3 years ago

I'll amend my comment. Removing the check and pushing the key to https://ec2-instance-connect.us-west-2.amazonaws.com/PutEC2HostKeys/ results in a failure {"__type":"com.amazon.coral.service#InternalFailure", so it seems that instance connect is not enabled for this zone (yet?) but the web console allows me to try. I still think you need to consider the usability of your patch. Papering over the failure by ignoring the return code still leaves the 255 exit code in the unit log file and the service is supported. So when I user has ec2-instance-connect installed (which is in the stock Ubuntu images in AWS for 20.04 and later) and attempts to attach they only get a There was a problem connecting to your instance error and are invited to wait a few minutes and retry. Looking at the service it has succeeded and looking at the log there is an error code that doesn't provide an indication that this is not supported.

My suggestion is to look for regex matches on local zones (and wavelength?) and print a message that instance-connect is not supported currently by the package. Or catch the {"__type":"com.amazon.coral.service#InternalFailure" and then print an message about support (wishing that was more specific) but that feels fragile and the rest of the code would be untested in a local zone. Either way, could you provide the end-user more breadcrumbs so they can understand what is happening when they fail to connect.

ohitspaul commented 3 years ago

@rcj4747 thank you for your feedback. Currently we do not explicitly support local/wavelength zones. We have added an item to our backlog for local zone support. We will also be adding more detailed output in each of the different components of EC2 Instance Connect in an upcoming upgrade to these on-instance scripts.

RainaWLK commented 1 year ago

I found same issue with local zones and cannot ssh login using ec2-connect.

The issue is in this script: https://github.com/aws/aws-ec2-instance-connect-config/blob/master/src/bin/eic_curl_authorized_keys line 105

# Validate the zone is aa-bb-#c (or aa-bb-cc-#d for special partitions like AWS GovCloud)
/bin/echo "${zone}" | /usr/bin/head -n 1 | /bin/grep -Eq "^([a-z]+-){2,3}[0-9][a-z]$" || exit 255

For example, Oregon region, Los Angeles Local zone AZ = us-west-2-lax-1a

So it cannot pass regex check, leaves the 255 exit code.

I tried to modify that regex check in my EC2 instance:

sudo vi /usr/share/ec2-instance-connect/eic_curl_authorized_keys
......
# Validate the zone is aa-bb-#c (or aa-bb-cc-#d for special partitions like AWS GovCloud)
#/bin/echo "${zone}" | /usr/bin/head -n 1 | /bin/grep -Eq "^([a-z]+-){2,3}[0-9][a-z]$" || exit 255        # bug
/bin/echo "${zone}" | /usr/bin/head -n 1 | /bin/grep -Eq "^([a-z]+-){2,3}[0-9]([a-z]|-[a-z]+-[0-9][a-z])$" || exit 255
.......

Then EC2 connect works fine for me

Can you fix this issue?