aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
542 stars 264 forks source link

Invalid PhysicalResourceId and Received response status [FAILED] from custom resource #525

Closed MicahQuinland closed 3 months ago

MicahQuinland commented 6 months ago

Describe the bug

Very similar to this bug that was preciously closed: https://github.com/aws-solutions/instance-scheduler-on-aws/issues/339

When I use 'Custom::ServiceInstanceSchedule' to create schedules I intermittently get the CloudFormation error 'Invalid PhysicalResourceId', and a retry of the deployment works the second time.

I also occasionally get this cloud formation error 'CREATE_FAILED | Received response status [FAILED] from custom resource.'

To Reproduce

To reproduce simply deploy, update, and delete custom schedules via Cloudformation and eventually you will get this error out of the blue. Most of the time it works and sometimes it doesn't.

I am currently working with a customer that deploys the Scheduler Hub along side the IAC schedules in one template via a pipeline. In this case it will fail 75% of the time and work only after retrying.

Expected behavior

My customer would expect the Custom Resource creation to work every time consistently

Please complete the following information about the solution:

Screenshots attached Invalid PhysicalResourceId Custom Resource Failed

aws-khargita commented 6 months ago

Thank you for opening this bug report.

We have identified the issue to be the parallel creation of multiple custom resources each attempting to create the same log stream at once. This results in a ResourceAlreadyExistsException.

Once the log stream exists, further invocations of the custom resources lambda handler will not try to recreate it, which would explain the success on retry. Since a new log stream is created daily, this problem will resurface on a daily basis.

A fix for this bug will be included in the upcoming release. For now, when using cloudformation templates to manage schedules, you can use the DependsOn attribute to allow one custom resource to run synchronously prior to the others. This should prevent the race condition.

Example of what that would look like:

Resources:
  SampleSchedule1:
    Type: 'Custom::ServiceInstanceSchedule'
    Properties:
      ServiceToken: !Ref ServiceInstanceScheduleServiceTokenARN #do not edit this line
      NoStackPrefix: 'False'
      Name: my-renamed-sample-schedule
      Description: a full sample template for creating cfn schedules showing all possible values
      Timezone: America/New_York
      Enforced: 'True'
      Hibernate: 'True'
      RetainRunning: 'True'
      StopNewInstances: 'True'
      UseMaintenanceWindow: 'True'
      SsmMaintenanceWindow: 'my_window_name'
      Periods:
      - Description: run from 9-5 on the first 3 days of March
        BeginTime: '9:00'
        EndTime: '17:00'
        InstanceType: 't2.micro'
        MonthDays: '1-3'
        Months: '3'
      - Description: run from 2pm-5pm on the weekends
        BeginTime: '14:00'
        EndTime: '17:00'
        InstanceType: 't2.micro'
        WeekDays: 'Sat-Sun'

  SampleSchedule2:
    Type: 'Custom::ServiceInstanceSchedule'
    Properties:
      ServiceToken: !Ref ServiceInstanceScheduleServiceTokenARN #do not edit this line
      NoStackPrefix: 'True'
      Description: a sample template for creating simple cfn schedules
      Timezone: Europe/Amsterdam
      Periods:
      - Description: stop at 5pm every day
        EndTime: '17:00'
    DependsOn: SampleSchedule1
aws-khargita commented 5 months ago

Hi @TranVanDung-Leo, we identified and have a pending fix for this issue. It will be resolved in the next release of Instance Scheduler.

FugroEgger commented 3 months ago

fyi, the issue still exists in current Version: 1.5.6 (Released: 5/2024) DependsOn property usage still circumvent it

aws-khargita commented 3 months ago

HI @FugroEgger, 1.5.6 was a unexpected patch release to resolve two CVEs. To clarify, this fix will be the next major release. Apologies for the confusion!

CrypticCabub commented 3 months ago

Fixed in v3.0.0