getsocial-rnd / neo4j-aws-causal-cluster

Neo4j Enterprise Causal Cluster on AWS ECS by GetSocial
Apache License 2.0
26 stars 8 forks source link

CloudFormation start rollbacking after Neo4jClusterAutoScalingGroup #2

Closed bkaganyildiz closed 5 years ago

bkaganyildiz commented 5 years ago

I've tried to made cluster up and running but CF template stuck at Neo4jClusterAutoScalingGroup and Status Reason is Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement. With no back-up this is where I'm stuck. And Template fails at Neo4jReplicasTask with Status Reason Property Environment contains duplicate values when ReplicasCount is 1. I'm not sure if I've made something wrong and there's prerequisites that I've need to do or problem with CF Template.

taraspos commented 5 years ago

@bkaganyildiz thanks for the raised issue!

I would need some information from you so I can try to replicate your problem.

  1. In which region you are using?
  2. Can you share the Parameters you used for the template (except passwords of course :))
  3. Did you build the Docker image for neo4j as mentioned in the Usage secion?
bkaganyildiz commented 5 years ago
  1. I'm using eu-west-1
    • Do you agree? true
    • VpcId? neo4j-test
    • KeyName? neo4j-test
    • ECSAMI? /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
    • NodeSecurityGroups? neo4j-test-ssh-access
    • SNSTopicArn?
    • ClusterInstanceType? t2.medium
    • SubnetID? neo4j-test-eu-west-1b, neo4j-test-eu-west-1a, neo4j-test-eu-west-1c
    • DesiredCapacity? 3
    • EBSSize? 10
    • EBSType? gp2
    • ReplicasInstanceType? t2.medium
    • ReplicasCount? 0 and 1 I've tried them both failed in either cases.
    • ReplicasSubnetID? neo4j-test-replica-eu-west-1a, neo4j-test-replica-eu-west-1b, neo4j-test-replica-eu-west-1c
    • DockerImage? ${ACCOUNT_ID}.dkr.ecr.eu-west-1.amazonaws.com/repository/neo:causal
    • DockerECRARN? arn:aws:ecr:eu-west-1:${ACCOUNT_ID}:repository/repository/neo
    • AdminUser? neo4j
    • AdminPass? ***
    • ReadOnlyUser? neo4j-read
    • ReadOnlyUserPassword? ***
    • CloudMapNamespaceID?
    • CloudMapNamespaceName? neo4j.testing
    • Neo4jCoreSubdomain? core
    • Neo4jReplicasSubdomain? replica
    • BackupPath?
    • BackupHourlyStoreForDays? 1
    • BackupDailyStoreForDays? 14
    • AllowUpgrade? false
    • Does your AWS account has system for automatic ECS instance draining deployed? true and false tried them both
    • SlowQueryLog? disabled.
  2. Yes I've pushed image to ECR using make push_image
taraspos commented 5 years ago

Are your subnets has access to the internet? Internet Gateway or Nat Gateway attached?

Neo4j instances should make a call back to CloudFormation service to notify about a successful launch. However, if there is no internet available that call will not work.

This is my guess from message Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

taraspos commented 5 years ago

Also, this is probably not related to the current problem, but your ECR ARN doesn't seem to be correct, it looks like: arn:aws:ecr:eu-west-1:${ACCOUNT_ID}:repository/repository/neo but should be: arn:aws:ecr:eu-west-1:${ACCOUNT_ID}:repository/neo

bkaganyildiz commented 5 years ago

Yes they've been attached to igw.

bkaganyildiz commented 5 years ago

While creating repository I've created as repository/neo, I guess that's the reason of duplicate repository in ARN.

taraspos commented 5 years ago

can you please go to the https://console.aws.amazon.com/ec2/autoscaling/home, find the autoscaling group for neo4j core instances and check the Activity History tab? Are there any errors there?

bkaganyildiz commented 5 years ago

I guess rollback removed the ASG. After failed attempt? Let me run the Template again and try to see the error on creation.

taraspos commented 5 years ago

yes, please. also, once it fails, can you please copy (or make a screenshot) of all the failed CloudFormation events/resources?

taraspos commented 5 years ago

One more thing, if you see that instances are created, please check if they have a public IP attached or even SSH inside and check the internet connection.

Also, logs /var/log/cloud-init.log and /var/log/cloud-init-output.log will be helpful as well.

bkaganyildiz commented 5 years ago
Timestamp Logical ID Status Status reason
2019-10-08 17:01:04 UTC+0300 Neo4jClusterAutoScalingGroup CREATE_FAILED Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2019-10-08 17:01:04 UTC+0300 Neo4jClusterAutoScalingGroup UPDATE_IN_PROGRESS Failed to receive 1 resource signal(s) for the current batch. Each resource signal timeout is counted as a FAILURE.
2019-10-08 16:46:02 UTC+0300 Neo4jClusterAutoScalingGroup CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:46:01 UTC+0300 Neo4jClusterAutoScalingGroup CREATE_IN_PROGRESS -
2019-10-08 16:45:56 UTC+0300 Neo4jClusterLanuchTemplate CREATE_COMPLETE -
2019-10-08 16:45:56 UTC+0300 Neo4jClusterLanuchTemplate CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:45:55 UTC+0300 Neo4jClusterLanuchTemplate CREATE_IN_PROGRESS -
2019-10-08 16:45:52 UTC+0300 Neo4jInstanceProfile CREATE_COMPLETE -
2019-10-08 16:45:30 UTC+0300 TaskStateChangeInvokeLambdaPermission CREATE_COMPLETE -
2019-10-08 16:45:19 UTC+0300 TaskStateChangeInvokeLambdaPermission CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:45:19 UTC+0300 TaskStateChangeInvokeLambdaPermission CREATE_IN_PROGRESS -
2019-10-08 16:45:16 UTC+0300 TaskStateChangeRule CREATE_COMPLETE -
2019-10-08 16:45:12 UTC+0300 TaskMonitorEventRule CREATE_COMPLETE -
2019-10-08 16:45:02 UTC+0300 BackupEventRule CREATE_COMPLETE -
2019-10-08 16:44:15 UTC+0300 TaskStateChangeRule CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:15 UTC+0300 TaskStateChangeRule CREATE_IN_PROGRESS -
2019-10-08 16:44:12 UTC+0300 CloudMapSyncFunction CREATE_COMPLETE -
2019-10-08 16:44:12 UTC+0300 CloudMapSyncFunction CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:12 UTC+0300 Neo4jClusterECSservice CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:11 UTC+0300 TaskMonitorEventRule CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:11 UTC+0300 Neo4jClusterECSservice CREATE_IN_PROGRESS -
2019-10-08 16:44:11 UTC+0300 CloudMapSyncFunction CREATE_IN_PROGRESS -
2019-10-08 16:44:11 UTC+0300 TaskMonitorEventRule CREATE_IN_PROGRESS -
2019-10-08 16:44:08 UTC+0300 Neo4jClusterTask CREATE_COMPLETE -
2019-10-08 16:44:08 UTC+0300 Neo4jClusterTask CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:08 UTC+0300 Neo4jClusterTask CREATE_IN_PROGRESS -
2019-10-08 16:44:01 UTC+0300 BackupEventRule CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:01 UTC+0300 DiscoveryServiceCoreA CREATE_COMPLETE -
2019-10-08 16:44:01 UTC+0300 DiscoveryServiceCoreA CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:44:01 UTC+0300 BackupEventRule CREATE_IN_PROGRESS -
2019-10-08 16:44:00 UTC+0300 DiscoveryServiceCoreSRV CREATE_COMPLETE -
2019-10-08 16:44:00 UTC+0300 DiscoveryServiceCoreSRV CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:59 UTC+0300 DiscoveryServiceCoreA CREATE_IN_PROGRESS -
2019-10-08 16:43:59 UTC+0300 DiscoveryServiceCoreSRV CREATE_IN_PROGRESS -
2019-10-08 16:43:57 UTC+0300 BackupEventRole CREATE_COMPLETE -
2019-10-08 16:43:56 UTC+0300 DiscoveryNamespace CREATE_COMPLETE -
2019-10-08 16:43:51 UTC+0300 Neo4jInstanceProfile CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:51 UTC+0300 Neo4jInstanceProfile CREATE_IN_PROGRESS -
2019-10-08 16:43:48 UTC+0300 EC2Role CREATE_COMPLETE -
2019-10-08 16:43:38 UTC+0300 BackupEventRole CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:38 UTC+0300 BackupEventRole CREATE_IN_PROGRESS -
2019-10-08 16:43:35 UTC+0300 Neo4jBackupTask CREATE_COMPLETE -
2019-10-08 16:43:35 UTC+0300 EC2Role CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:35 UTC+0300 Neo4jBackupTask CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:35 UTC+0300 CloudMapSyncFuncRole CREATE_COMPLETE -
2019-10-08 16:43:35 UTC+0300 EC2Role CREATE_IN_PROGRESS -
2019-10-08 16:43:35 UTC+0300 Neo4jBackupTask CREATE_IN_PROGRESS -
2019-10-08 16:43:31 UTC+0300 BackupBucket CREATE_COMPLETE -
2019-10-08 16:43:22 UTC+0300 TaskMonitorEventSNSPolicy CREATE_COMPLETE -
2019-10-08 16:43:22 UTC+0300 Neo4jClusterHighCPU CREATE_COMPLETE -
2019-10-08 16:43:22 UTC+0300 TaskMonitorEventSNSPolicy CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:22 UTC+0300 Neo4jClusterHighMemory CREATE_COMPLETE -
2019-10-08 16:43:22 UTC+0300 Neo4jClusterHighCPU CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:22 UTC+0300 Neo4jClusterHighMemory CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:22 UTC+0300 TaskMonitorEventSNSPolicy CREATE_IN_PROGRESS -
2019-10-08 16:43:22 UTC+0300 Neo4jClusterHighCPU CREATE_IN_PROGRESS -
2019-10-08 16:43:21 UTC+0300 Neo4jClusterHighMemory CREATE_IN_PROGRESS -
2019-10-08 16:43:19 UTC+0300 Neo4jSNSTopic CREATE_COMPLETE -
2019-10-08 16:43:18 UTC+0300 Neo4jSecurityGroupHTTPinbound CREATE_COMPLETE -
2019-10-08 16:43:18 UTC+0300 Neo4jSecurityGroupTransactionInboundFromCluster CREATE_COMPLETE -
2019-10-08 16:43:18 UTC+0300 Neo4jSecurityGroupBoltInboundFromCluster CREATE_COMPLETE -
2019-10-08 16:43:18 UTC+0300 Neo4jSecurityGroupHTTPinboundFromCluster CREATE_COMPLETE -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupTransactionInboundFromCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupBoltInboundFromCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupBoltInbound CREATE_COMPLETE -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupRaftInboundFromCluster CREATE_COMPLETE -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupTransactionInboundFromCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupDiscoveryInboundFromCluster CREATE_COMPLETE -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupBoltInboundFromCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupHTTPinboundFromCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupHTTPinboundFromCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupHTTPinbound CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupHTTPinbound CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupBoltInbound CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupRaftInboundFromCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupDiscoveryInboundFromCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupBoltInbound CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupRaftInboundFromCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:17 UTC+0300 Neo4jSecurityGroupDiscoveryInboundFromCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:14 UTC+0300 Neo4jSecurityGroup CREATE_COMPLETE -
2019-10-08 16:43:13 UTC+0300 Neo4jSecurityGroup CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:13 UTC+0300 CloudMapSyncFuncRole CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:13 UTC+0300 CloudMapSyncFuncRole CREATE_IN_PROGRESS -
2019-10-08 16:43:10 UTC+0300 DiscoveryNamespace CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:10 UTC+0300 BackupBucket CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:10 UTC+0300 CloudwatchLogsGroup CREATE_COMPLETE -
2019-10-08 16:43:10 UTC+0300 LambdaLogGroup CREATE_COMPLETE -
2019-10-08 16:43:09 UTC+0300 LambdaLogGroup CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:09 UTC+0300 CloudwatchLogsGroup CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:09 UTC+0300 BackupBucket CREATE_IN_PROGRESS -
2019-10-08 16:43:09 UTC+0300 LambdaLogGroup CREATE_IN_PROGRESS -
2019-10-08 16:43:09 UTC+0300 DiscoveryNamespace CREATE_IN_PROGRESS -
2019-10-08 16:43:09 UTC+0300 Neo4jSecurityGroup CREATE_IN_PROGRESS -
2019-10-08 16:43:09 UTC+0300 CloudwatchLogsGroup CREATE_IN_PROGRESS -
2019-10-08 16:43:08 UTC+0300 Neo4jCluster CREATE_COMPLETE -
2019-10-08 16:43:08 UTC+0300 Neo4jSNSTopic CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:08 UTC+0300 Neo4jCluster CREATE_IN_PROGRESS Resource creation Initiated
2019-10-08 16:43:08 UTC+0300 Neo4jSNSTopic CREATE_IN_PROGRESS -
2019-10-08 16:43:08 UTC+0300 Neo4jCluster CREATE_IN_PROGRESS -
2019-10-08 16:43:03 UTC+0300 neo4j-test-asg-error CREATE_IN_PROGRESS User Initiated
bkaganyildiz commented 5 years ago

It's the failed CF. Also I've checked ASG what I do not understand it's I've seen them they're all healthy and successfully created. After rollback from Activity Log I've seen Warning message: At 2019-10-08T14:01:34Z instance <Instance_ID1> was selected for termination. At 2019-10-08T14:01:24Z a user request update of AutoScalingGroup constraints to min: 0, max: 0, desired: 0 changing the desired capacity from 3 to 0. At 2019-10-08T14:01:34Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 3 to 0. At 2019-10-08T14:01:34Z instance <Instance_ID1> was selected for termination. At 2019-10-08T14:01:34Z instance <Instance_ID2> was selected for termination.. Also I've checked the instances and they do not have any Public IP that's attached to them.

taraspos commented 5 years ago

Also I've checked the instances and they do not have any Public IP that's attached to them.

This must be a problem. Looks like instances don't have internet because no Public IP assigned to them. You will need to go to your VPC Subnets configuration and set Auto-assign public IPv4 address to Yes for each used subnet, and then try creating the cluster again and it should work (hopefully).

bkaganyildiz commented 5 years ago

Thanks for the help BTW, it's resolved.