connelldave / botocove

A simple decorator to run Python functions across multiple AWS accounts, OUs and/or regions, with or without an AWS Organization.
https://pypi.org/project/botocove/
GNU Lesser General Public License v3.0
97 stars 8 forks source link

Would it be possible to use the decorator session within @cove() annotation to load the regions that are relevant for session account? #74

Open iainelder opened 1 year ago

iainelder commented 1 year ago

Would it be possible to use the decorator session within @cove() annotation to load the regions that are relevant for session account?

It is common-place to have regions disabled from the account configuration (opt-in regions) and using the Management Account (or any other) regions as the list often result in An error occurred (UnrecognizedClientException) when calling the ListTrails operation: The security token included in the request is invalid exceptions due to region being disabled/not opted-in in Account -> AWS Regions AND/OR due to Global STS Endpoint issued tokens being only valid on regions enabled by default unless explicitly changed by the user in IAM -> Security Token Service (STS) -> Global endpoint

Another side effect of not using the account's enabled regions is that, you can miss regions that are not enabled/opted-in in the account, .i.e. Management Account.

There are currently ~10 regions that requires opt-in https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html?icmpid=docs_iam_console#id_credentials_region-endpoints

Originally posted by @alencar in https://github.com/connelldave/botocove/issues/55#issuecomment-1577091152

iainelder commented 1 year ago

@alencar, it sounds like you are describing different issues:

  1. A member account has disabled a region that is enabled in the management account
  2. A member account has enabled a region that is disabled in the management account
  3. Global STS endpoint tokens being only valid on regions enabled by default

I'm not familiar with the third issue so you may need to give me more information before I can help.

botocove assumes that that the target regions are accessible in all accounts, and so will call the function once in each target region in each target account. It works this way because it was good enough to solve my problems in back in #28.

There were times when I wished I had more control over exactly which regions were accessed per account, such as when I needed to remediate the resources of just a few account-regions across a large organization. It would have been faster to skip the account-regions needing no remediation. So I'm open to the idea of making it more flexible, but I want to understand your use case first, because there are many ways to work around it before making changes to botocove.

Given the way that botocove works today, the only sure way to access all the enabled regions of all the accounts in a single pass is to target all the regions that are known to be enabled in at least one account.

Unless all your accounts enable the whole set of target regions, then, as you showed, some security token exceptions will occur: ClientError: An error occurred (UnrecognizedClientException) when calling the ListTrails operation: The security token included in the request is invalid. For example in my account where region eu-south-2 is not enabled, I can generate that error like this:

Session().client("cloudtrail", region_name="eu-south-2").list_trails()["Trails"]

One way around that is to just ignore the exceptions in the cove output for the account-regions that you know are disabled.

In the same example account I run this to generate one good result and one exception.

from botocove import cove
from botocore.exceptions import ClientError

@cove(
    rolename="AWSControlTowerExecution",
    regions=["eu-central-1", "eu-south-2"],
    target_ids=["111111111111"]
)
def test_caller_identity(session):
    if session.client("sts").get_caller_identity():
        return "OK"

cove_output = test_caller_identity()

You could post-process the cove_output object to remove any results in the Exceptions list whose ExceptionDetails is a ClientError with error message "The security token included in the request is invalid".

{'Results': [{'Id': '111111111111',
   'RoleName': 'AWSControlTowerExecution',
   'RoleSessionName': 'AWSControlTowerExecution',
   'AssumeRoleSuccess': True,
   'Region': 'eu-central-1',
   'Partition': 'aws',
   'Name': 'Log Archive',
   'Arn': 'arn:aws:organizations::222222222222:account/o-aaaaaaaaaa/111111111111',
   'Email': '...',
   'Status': 'ACTIVE',
   'Result': 'OK'}],
 'Exceptions': [{'Id': '111111111111',
   'RoleName': 'AWSControlTowerExecution',
   'RoleSessionName': 'AWSControlTowerExecution',
   'AssumeRoleSuccess': True,
   'Region': 'eu-south-2',
   'Partition': 'aws',
   'ExceptionDetails': botocore.exceptions.ClientError('An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid'),
   'Name': 'Log Archive',
   'Arn': 'arn:aws:organizations::222222222222:account/o-aaaaaaaaaa/111111111111',
   'Email': '...',
   'Status': 'ACTIVE'}],
 'FailedAssumeRole': []}

Would that work for you? Or were you looking for something else?

connelldave commented 1 year ago

If I understand the problem statement here (and it'd be useful to restate it as a user story: as an X I want to Y, today only Z) - "What if my org has inconsistent regions enabled across many accounts"; I don't think this is something we can solve in-library without a large burden of API calls.

Roughly two routes I can think of initially:

I don't think we can address this preflight without assuming into and then calling each account's get regions API, but I'm not super familiar with that API in general, and will hit the same constraints as the difficult to solve in-band validation problem.

alencar commented 1 year ago

@iainelder / @connelldave indeed is a related, but different issue than #55

High-level example is to go across each account and region and retrieve the EC2 instances. AllRegions=True will return each and every valid AWS region, being it opted-in or not for the account providing the initial credential (i.e. Management Account). Watch out for any Exception that can indicate the region is disabled and handle it as appropriate for the end user use case.

from botocove import cove
import boto3
from botocore.Exceptions import ClientError
from botocore.Exceptions import UnrecognizedClientException
...

"""
Get all regions, including ones not opted-in
"""
@cove(
    regions=[
        r['RegionName'] for r in boto3.client('account').list_regions(AllRegions=True)['Regions']
    ]
)
def example(session):
       ec2 = session.client('ec2')
       response = ec2.get_paginator('describe_instances').paginate().build_full_result()
       return response

...
    ...
    results = example()

    """
    Check results["Exceptions"] for any UnrecognizedClientException as it may indicate
    an opt-in region that's disabled from this account OR Global STS Endpoint not configured
    to use v2Token. See IAM.Client.set_security_token_service_preference()

    Regions that are disabled by other means like SCP, would return ClientError/AccessDenied
    """
    ...

With Regions being loaded by the Botocove Session factory, the above issues related to Disabled/Not opted-in regions would be addressed.

Issues caused by the Global STS Token Version (v1Token, v2Token) can be already addressed if the Session is constructed from the Regional Endpoint.

A workaround would be to call IAM.Client.get_account_summary() and looking for the value SummaryMap["GlobalEndpointTokenVersion"] to decide if using Global STS Endpoint or Regional STS Endpoint to obtain Sessions.

High-level, from Management Account, using the Account.Client, potentially using it from the place where the Sessions are set up? Sorry, I don't know my way around Botocove code :(

     ...
     """
     Get the Account.Client from the region that cannot be disabled and find out which regions are enabled
     using the Management Account credential
     """
     account = boto3.client('account', region_name='us-east-1')
     """
     Iterate over each Organization Member Account (Adds 1 API call per member account)
     - Get the list of regions of each account
     -  Update the _decorated_ session with the list of enabled regions
     """
         response = account.list_regions(
             AccountId=<Account Id>,
             RegionOptStatusContains=[
                 'ENABLED',
                 'ENABLED_BY_DEFAULT'
             ]
         )
         """
         Update the _decorated_ session to have the list of regions as default
         """
         ...
    ...
...  

References

iainelder commented 1 year ago

@connelldave, for all the reasons you give, I don't think we need to make botocove aware of the opt-in status of a region.

If the aim here is to avoid disabled account-regions, both accessing them and referencing them in the output, then the use case is similar to mine when I wanted to access only the account regions needing remediation.

We can support both with a new parameter to cove that describes the set of account-regions to be accessed.

@cove(
    regions_per_account={
        "111111111111": ["rr-aaaa-1", "rr-bbbb-1", "rr-cccc-1"],
        "222222222222": ["rr-aaaa-1", "rr-bbbb-1"],
        "333333333333": ["rr-aaaa-1", "rr-bbbb-1", "rr-cccc-1", "rr-dddd-1"],
    }
)
def example(session):
    ...

The regions_per_account parameter would override the target_ids, ignored_ids, and regions parameters.

So configured, botocove would call example in these account-regions:

And the output object would have only results or exceptions for those combinations.

In my use case, I would find the correct value for regions_per_account by doing a first pass with botocove over all account-regions to identify regions that need to be remediated. The first pass would run some listing and describing APIs. After studying the output of the first pass, I would pass a description of where to remediate to regions_per_account and a description of how to remediate as a new decorated function that runs some create/update/delete APIs.

In @alencar's use case, the client code would call a function like this before passing the return value to regions_per_account.

def get_active_account_regions(session):
    org_client = session.client("organizations")
    account_client = session.client("account")

    mgmt_account_id = org_client.describe_organization()["Organization"]["MasterAccountId"]
    pages = org_client.get_paginator("list_accounts").paginate()
    active_member_accounts = [
        account
        for page in pages
        for account in page["Accounts"]
        if account["Status"] == "ACTIVE" and not account["Id"] == mgmt_account_id
    ]

    # boto3 has no paginator for ListRegions. MaxResults allows up to 50 regions
    # in one response. In June 2023 there are 31 launched regions [1].
    # [1]: https://aws.amazon.com/about-aws/global-infrastructure/
    active_account_regions = {}
    for account in active_member_accounts:
        active_regions = account_client.list_regions(
            AccountId=account["Id"],
            MaxResults=50,
            RegionOptStatusContains=["ENABLED", "ENABLED_BY_DEFAULT"]
        )["Regions"]
        active_account_regions[account["Id"]] = [r["RegionName"] for r in active_regions]

    return active_account_regions

To make the function work in my test account, I needed to enable trusted access for AWS Account Management like this:

aws organizations enable-aws-service-access \
--service-principal account.amazonaws.com
$ aws organizations list-aws-service-access-for-organization
{
    "EnabledServicePrincipals": [
        {
            "ServicePrincipal": "account.amazonaws.com",
            "DateEnabled": "2023-06-07T11:00:49.362000+02:00"
        },
        {
            "ServicePrincipal": "cloudtrail.amazonaws.com",
            "DateEnabled": "2023-05-19T12:20:06.578000+02:00"
        },
        {
            "ServicePrincipal": "config.amazonaws.com",
            "DateEnabled": "2023-05-19T12:27:58.513000+02:00"
        },
        {
            "ServicePrincipal": "controltower.amazonaws.com",
            "DateEnabled": "2023-05-19T12:20:05.228000+02:00"
        },
        {
            "ServicePrincipal": "sso.amazonaws.com",
            "DateEnabled": "2023-05-19T12:20:44.899000+02:00"
        }
    ]
}

Without trusted access, using the AccountId parameter of ListRegions causes this error:

AccessDeniedException: An error occurred (AccessDeniedException) when calling the ListRegions operation: User: arn:aws:sts::111111111111:assumed-role/AWSReservedSSO_AWSAdministratorAccess_aaaaaaaaaaaaaaaa/... is not authorized to perform: account:ListRegions (Your organization must first enable trusted access with AWS Account Management.)

@alencar, would something like that work better for you than post-processing the botocove output?

To be clear, I'm not suggesting that we add get_active_account_regions to botocove. When the cove host account isn't an organization management account or delegated administrator, the function wouldn't make sense. Instead that function would be in your client code that calls cove.

iainelder commented 1 year ago

Issues caused by the Global STS Token Version (v1Token, v2Token) can be already addressed if the Session is constructed from the Regional Endpoint.

Can you show an example of the problem the STS token version causes? I've read about the SetSecurityTokenServicePreferences and GetAccountSummary APIs for controlling the version, but I don't yet understand how it interacts with botocove.

There is one path in the code that uses the boto3 default session. I wonder whether here it would matter.

https://github.com/connelldave/botocove/blob/af25603060dd3d99f87ac75fe45325fb2cbdbc2d/botocove/cove_host_account.py#L162-L165

iainelder commented 1 year ago

@alencar , did you find a solution to the problem?

alencar commented 1 year ago

@iainelder applying what is discussed in https://github.com/connelldave/botocove/issues/74#issuecomment-1580351228 seems a good solution. Adding regions_per_account additional parameter for use controlled account-regions combinations would be great.

alencar commented 1 year ago

@iainelder STS Global/Regional endpoints only affects calls to STS [1], basically where you call sts.assume_role(...) like

https://github.com/connelldave/botocove/blob/af25603060dd3d99f87ac75fe45325fb2cbdbc2d/botocove/cove_session.py#L58

and perhaps indirectly

https://github.com/connelldave/botocove/blob/af25603060dd3d99f87ac75fe45325fb2cbdbc2d/botocove/cove_host_account.py#L175

[1] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html

iainelder commented 1 year ago

@alencar , thanks. When I have a moment I'll try to add support for regions_per_account to Botocove. You're also welcome to try it yourself if you don't want to wait for me. I doubt that this week I will get around to it.

Thanks also for the references to the code where the STS global/region endpoint setting matters. I'll experiment in my own environment to see whether I can break Botocove with a certain configuration of regions. If we can reproduce any errors, then we can fix that as well.

iainelder commented 1 year ago

@alencar , did you find a solution to your problem?

I've started working in an environment with a "ragged regional" setup. I need to take an inventory of trails from a region in a member account that is disabled in the management account. I get the same error we discussed before: ClientError('An error occurred (UnrecognizedClientException) when calling the ListTrails operation: The security token included in the request is invalid'.

I would like to fix this so that I can complete my inventory checking using botocove.

This issue has gotten a bit muddled, so I'll create a new one to track that specific issue when I have a simple repro.

alencar commented 1 year ago

@iainelder I have parsed the results Exceptions with jq to to identify disabled regions.

connelldave commented 1 year ago

@alencar , thanks. When I have a moment I'll try to add support for regions_per_account to Botocove. You're also welcome to try it yourself if you don't want to wait for me. I doubt that this week I will get around to it.

I'm supportive of this, as well as shipping a helper function for get_active_account_regions although I'd suggest it needs to be get_organization_active_account_regions since it has a dependency on there being an org (just to differentiate the use case for non-orgs of just taking a list of accounts that trust another account, I doubt this is very common)