aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
557 stars 279 forks source link

Error writing to logstream due to name exceeding maximum length #254

Closed HansScharrer closed 1 year ago

HansScharrer commented 3 years ago

Hello,

I am attempting to deploy the instance scheduler to a single account and have it manage the EC2/RDS instances in 337 target accounts. It appears that the logstream name is being generated based on the account IDs and due to the number of accounts the logstream name is exceeding the maximum length. Here is an example of the error that is being thrown -

Error writing to logstream Scheduler-rds-string-of-accountIds-us-east-1 with token None (An error occurred (InvalidParameterException) when calling the DescribeLogStreams operation: 1 validation error detected: Value 'Scheduler-rds-string-of-accountIds-us-east-1 ...' at 'logStreamNamePrefix' failed to satisfy constraint: Member must have length less than or equal to 512)

Do you have any suggestions on what would be the most straight forward way to fix the python code so that the logstream name does not use the account ID list?

Thanks, Hans

gockle commented 3 years ago

Hi @HansScharrer Did you deploy the remote template in the target account(s) and update the CloudFormation parameter "Cross-account roles" (deployment instructions)? The scheduler will try to create a log of the format Scheduler-ec2-[ACCOUNT_ID]-[REGION]-[YYYYMMDD]. If configured correctly, the log stream name should not exceed the length limit.

HansScharrer commented 3 years ago

Hi @gockle Thanks for your response. Because the CloudFormation parameter value limit of 4,096 bytes I wrote a custom python script to populate the Dynamodb cross_account_roles item with the 337 target account ARNs. Is that how I introduced this bug? The log format that is being created is -

Scheduler-ec2-[ACCOUNT_ID]-[ACCOUNT_ID]-[ACCOUNT_ID]-[ACCOUNT_ID]-[ACCOUNT_ID]-[FOLLOWED BY ALL ACCOUNT_IDs]-[REGION]-[YYYYMMDD]

gockle commented 3 years ago

Hi @HansScharrer You can consider using the Systems Parameter store to work around the 4096 limit for the cloudformation parameter, please refer to the documentation.

Comma-delimited list of cross-account roles. For example, arn:aws:iam::111122223333:role/SchedulerCrossAccountRole. If you store your cross-account ARNs in the AWS Systems Manager Parameter Store, use the format {param:name}. For more information, refer to AWS Systems Manager Parameter Store.

You can enter multiple Systems Manager Parameter Store names, using the format, {param: ssm-param1}, {param: ssm-param2}

HansScharrer commented 3 years ago

In case other folks run into a similar issue I believe I found the workaround. In the two files -

cloudwatch_event_handler.py scheduler_request_handler.py

this line is creating the log stream name with the account IDs separated by a dash -

s = "-".join([LOG_STREAM_PREFIX, service, "-".join(account_names), "-".join(self.configuration.regions)])

Changing it to -

s = "-".join([LOG_STREAM_PREFIX, service, "allaccounts", "-".join(self.configuration.regions)])

resolved the problem for me. Hope that helps!

gockle commented 3 years ago

Hi @HansScharrer I would recommend that you use the SSM parameter store to provide the Cross Account Roles to the solution and not update the dynamoDB config attribute. The fix you provided above will work only to solve the log stream name length issue, But the solution is designed to execute the scheduling logic in multiple Lambda instances,

For example, if the solution is configured to execute in 1 region and 2 accounts, for both EC2 and RDS the solution will create the following log streams

Each of these log stream will be executed in a specific lambda, if all the lmabda files are going to log information to the same log stream, which seems to be how the configuration you have suggested above does, there is a possibility that the lambda will running into throttling issues with the cloudwatch API's.

Let me know if I am missing anything here in my understanding.

HansScharrer commented 3 years ago

Hi @gockle Thanks for your response and for pointing out that behavior. It seems storing the Cross Account Roles in the SSM parameter will make the solution more scalable because of the parallelization. Due to the number of account ARNs I believe we would exceed the Standard parameter limit of 4 KB and likely the advanced parameter value limit of 8 KB. Would you recommend the approach outlined here to work around that? - https://github.com/awslabs/aws-instance-scheduler/issues/120

gockle commented 3 years ago

Hi @HansScharrer You can pass multiple SSM parameter names to the solution, in the format,

{param: ssm-param1}, {param: ssm-param2}.

Or you can launch multiple stacks for the same solution and split the accounts across them. Please use the latest v1.4.0 version of the solution.

HansScharrer commented 3 years ago

Hi @gockle Thank you. I will investigate that approach. Yep, I am using the latest v1.4.0 version.

HansScharrer commented 3 years ago

Hi @gockle This is the behavior I am seeing using the SSM parameter approach.

I created two SSM parameters. The first - /instance-scheduler/account-list1 with the ARN list - arn:aws:iam::accountid1:role/instance-scheduler-role, arn:aws:iam::accountid2:role/instance-scheduler-role, arn:aws:iam::accountid3:role/instance-scheduler-role, arn:aws:iam::accountid4:role/instance-scheduler-role

and the second - /instance-scheduler/account-list2 with the ARN list - arn:aws:iam::accountid5:role/instance-scheduler-role, arn:aws:iam::accountid6:role/instance-scheduler-role, arn:aws:iam::accountid7:role/instance-scheduler-role, arn:aws:iam::accountid8:role/instance-scheduler-role

I then set the CrossAccountRoles parameter to - {param: /instance-scheduler/account-list1},{param: /instance-scheduler/account-list2}

When the instance scheduler ran it created two log streams - Scheduler-ec2-accountid1-accountid2-accountid3-accountid4-accountid5-accountid6-accountid7-accountid8-us-east-1-20210520 Scheduler-rds-accountid1-accountid2-accountid3-accountid4-accountid5-accountid6-accountid7-accountid8-us-east-1-20210520

It appears to me that it is not starting a lambda for each account. Looking at the number of lambda invocations it looks like three were invoked when based on your previous comment I believe I should have seen at least eight. Is this what I should expect?

To clarify, the instance scheduler I am deploying is pulling the instance-schedule code from this location -

Main: 
  Type: "AWS::Lambda::Function"
  Properties: 
    Code: 
      S3Bucket: 
        Fn::Join: 
          - ""
          - 
            - "solutions-"
            - 
              Ref: "AWS::Region"
      S3Key: "aws-instance-scheduler/v1.4.0/instance-scheduler.zip
gockle commented 3 years ago

Hi @HansScharrer Can you confirm what is being loaded into the DynamoDB table config attribute, since you are updating the cloudformation paramater with {param: /instance-scheduler/account-list1},{param: /instance-scheduler/account-list2}

is it {param: /instance-scheduler/account-list1},{param: /instance-scheduler/account-list2} in the dynamoDB table as well?

Also can you confirm if you selecting the option StringList when creating the SSM Parameter?

HansScharrer commented 3 years ago

Hi @gockle I confirmed the following:

gockle commented 3 years ago

HI @HansScharrer Thanks for confirming I am trying to recreate the issue.

gockle commented 3 years ago

Hi @HansScharrer Can you enable enhanced logging parameter named "Enable CloudWatch Logs", select Yes if not already done, and update the stack, once the update is complete, can you verify if the logs are being generated with prefix [DEBUG] in the log group named [STACK_NAME]-logs,

{
    "action": "scheduler:run",
    "configuration": {
        "tag_name": "Schedule",
        "default_timezone": "US/Eastern",
        "trace": true,
        "enable_SSM_maintenance_windows": true,
        "use_metrics": true,
        "schedule_clusters": true,
        "create_rds_snapshot": false,
        "schedule_lambda_account": false,
        "started_tags": "started=start",
        "stopped_tags": "stopped=stop",
        "regions": [
            "us-east-1"
        ],
        "cross_account_roles": [
            "arn:aws:iam::[ACCOUNT_ID]:role/ROLE_NAME",
            "arn:aws:iam::[ACCOUNT_ID]:role/ROLE_NAME",
            "arn:aws:iam::[ACCOUNT_ID]:role/ROLE_NAME",
            ....
        ],
        "scheduled_services": [
            "ec2"
        ],
    "schedules": { ... }
    "periods": { ... }
}

Can you verify if the attribute cross_account_roles has a list of role(s)? And if the format for the role is being maintained with a correct format that you included in the SSM parameter?

HansScharrer commented 3 years ago

Hi @gockle In the log stream named "InstanceScheduler-20210520" I see the following -

2021-05-20T11:45:22.376-05:00 DEBUG : Event is { "action": "scheduler:run", "configuration": { "tag_name": "Schedule", "default_timezone": "UTC", "trace": true, "enable_SSM_maintenance_windows": true, "use_metrics": true, "schedule_clusters": true, "create_rds_snapshot": false, "schedule_lambda_account": false, "started_tags": "ScheduleMessage=Started on 2021/05/20 at 16:45 UTC", "stopped_tags": "ScheduleMessage=Stopped at 2021/05/20 at 16:45 UTC", "regions": [ "us-east-1" ], "cross_account_roles": [ "arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", "arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role", " arn:aws:iam::[ACCOUNT_ID]:role/instance-scheduler-role" ], "scheduled_services": [ "ec2" ],

I also see another log entry similar to the one above occuring nearly at the same time but rds is listed instead of ec2.

gockle commented 3 years ago

Hi @HansScharrer

I was able to recreate the scenario but only after making some changes to the lambda code, in the file

cloudwatch_event_handler (path lambda/requesthandlers/)

there is a method handle_request, here there is small conditional check running_as_lambda()

if this method (runnning_as_lambda) returns false, the function self._run_schedulers_in_process() will be executed.

Which will result in the scenario you are observing. This ideally should not happen, i.e the method running_as_lambda() should never return False when the solution is deployed using the CloudFormation Template provided by us here,

Can you try uploading the lambda code in your solution using the this URL Instance Scheduler Zip US-EAST-1.

Note this will completely remove the changes you might have made to the Lambda, so I suggest you take a back of the Lambda Code.

Please let me know if you will be able to try this and also if this resolves the issue.

Another note, there is another possibility which might have caused this issue, if the CloudWatch Event Rule which triggers the Lambda has an attribute no_lambda is set to true then this scenario will occur, again we don't host the solution with this attribute so I am not sure how this can accidently be part of the event object received by the file cloudwatch_event_handler.py

HansScharrer commented 3 years ago

Hi @gockle

Thanks for taking the time to reproduce the problem I was observing. I will go ahead and try the code you referenced in the "Instance Scheduler Zip US-EAST-1" link above.

I will keep you posted.

Thanks, Hans

HansScharrer commented 3 years ago

Hi @gockle

Bad news. Same behavior as before. Did this work for you on your end? If so, I am confused why I would see different behavior.

gockle commented 3 years ago

Hi @HansScharrer If you are referring to the cross_account roles yes it worked. Can you provide the cloudwatch event rule JSON, it will be logged into the logstream [STACK_NAME]-YYYYMMDD,

INFO   :  InstanceScheduler, version v1.4.0
DEBUG   : Event is {
    "version": "0",
    "id": "0c9e8bc1-c091-d377-bb94-28cde978dce9",
    "detail-type": "Scheduled Event",
    "source": "aws.events",
    "account": "[ACCOUNT_ID]",
    "time": "2021-05-20T20:20:00Z",
    "region": "[REGION]",
    "resources": [
        "arn:aws:events:[REGION]:[ACCOUNT_ID]:rule/[STACK_NAME]-SchedulerRule-ITPU68N4YOOZ"
    ],
    "detail": {}
}
HansScharrer commented 3 years ago

Hi @gockle After a fair amount of debugging I believe I may have discovered the bug.

It appears this section of code in the cloudwatch_event_handler.py -

    def number_of_accounts():
        self._logger.info(len(self.configuration.cross_account_roles))
        return len(self.configuration.cross_account_roles) + 1 if self.configuration.schedule_lambda_account else 0

is not properly handling if the Dynamodb entry "schedule_lambda_account" is set to "false". As soon as I set it to "true" which is the default and most likely the setting you were using then I observe the parallel and logging behavior you have mentioned above.

gockle commented 3 years ago

@HansScharrer Thanks for finding that, it sure seems to be the root cause of the issue, we will investigate this further and add an bugfix to the solution backlog, After making the change were you able to use the solution? Or are you planning to skip using the solution since you don't want to schedule instances in the account where you are deploying the solution? And let me know if there are further questions.

HansScharrer commented 3 years ago

@gockle What I will probably end up doing so that I can set "schedule_lambda_account" to "false" is change the line - return len(self.configuration.cross_account_roles) + 1 if self.configuration.schedule_lambda_account else 0 to - return len(self.configuration.cross_account_roles)

That appears to fix it for me. Unless you have a better idea?

gockle commented 3 years ago

Hi @HansScharrer For your scenario looks like the fix that you made will work.

lorchda commented 2 years ago

@gockle I wonder if the highlighted block is indeed correct. Shouldn't it be:

def number_of_accounts():
     return len(self.configuration.cross_account_roles) + 1 if self.configuration.schedule_lambda_account else len(self.configuration.cross_account_roles)

The total count of accounts is +1 if we consider the Lambda account for scheduling also, otherwise it's just the list of cross account roles. It should probably not be 0, unless I am missing something.

hearde commented 1 year ago

Fixed in v1.5.1