getsocial-rnd / neo4j-aws-causal-cluster

Neo4j Enterprise Causal Cluster on AWS ECS by GetSocial
Apache License 2.0
26 stars 8 forks source link
aws aws-ecs ecs ecs-cluster neo4j neo4j-cluster neo4j-enterprise

Neo4j Causal Cluster setup for AWS by GetSocial

A setup for a Neo4j Enterprise Causal Cluster on top of AWS ECS.

You can obtain Neo4j from the official website. Please contact sales@neo4j.com for Enterprise licensing.


Upgrade guide for 3.5.x -> 4.2.x


Content

  1. Why
  2. Features
  3. Includes
  4. Limitations
  5. Prerequisites
  6. About
    1. Core Servers
    2. Read Replicas
    3. Discovery endpoints
    4. Spot Setup
  7. Usage
  8. Upgrade version
    1. Patch version upgrades
    2. Major and minor version upgrades
  9. Neo4j cluster operations manual
  10. Troubleshooting

Why

Here at GetSocial we started using Neo4j a few years ago to enhance our product with the power of social connections and we have been trying to find the architecture that can keep up with our growing service.

We run our infrastructure on AWS and the first approach to host Neo4j was using the suggested (at that time) HA Cluster architecture neo4j-aws-ha-cluster, which lately become deprecated and that the new Causal Cluster architecture would be the preferred approach given its scaling possibilities and resilience to failing nodes.

Features

Includes

Limitations

Prerequisites

About

Infrastructure

Neo4j graph database is deployed as a Causal Cluster (HA clustering is deprecated in the latest neo4j versions). It uses Bolt – a highly efficient, lightweight binary client-server protocol designed for database applications.

Essentially it's a Neo4j cluster with a minimum of 3 nodes (for successful startup and leader election, the cluster will still function with 2 healthy nodes in runtime, this is used for rolling operations, nodes will be removed/restarted one by one while other two keep functioning).

Setup is split logically into 2 ECS clusters (yet still it's 1 Neo4j cluster):

Usage

  1. Create an ECR repository for Neo4j custom images. You will use its URL. (URL looks like 1234567890.dkr.ecr.us-east-1.amazonaws.com/neo).

  2. Save environment variable for use in the makefile (customize them first)

    export NEO_ECR_REPO=<paste here URL of your ECR repo>
    export NEO_AWS_REGION=<your AWS region>
  3. Build a Docker image and push it to your ECR:

    make push_image
  4. If you know what are you doing feel free to modify cloudformation.yml in any way you like before spinning up infrastructure, however, most of the things are customizable via parameters.

  5. Create a Cloud Formation stack using cloudformation.yml with your parameters.

    If you don't need the Read Replicas you can set the ReplicasCount=0 and ignore the rest of Slave related parameters (except SlaveSubnetID you need to choose any subnet there)

    Parameters reference


    Neo4j License

    Parameter Description
    AcceptLicense Before using Neo4j, you must accept license

    Global configurations

    Parameter Description
    VpcId Existing AWS VPC to deploy the Neo4j cluster in
    KeyName SSH key to use for cluster EC2 instances access
    ECSAMI ECS Optimized AMI Version as SSM AMI metadata parameter path
    By default use the recommended one, however keeping this setting as is during further stack updates may result in unexpected AMI update (when new AMI version will become the recommended one)
    If you don't want to update AMI, pin AMI version to specific value with value like /aws/service/ecs/optimized-ami/amazon-linux-2/amzn2-ami-ecs-hvm-2.0.20181112-x86_64-ebs
    NodeSecurityGroups List of additional Security Groups to assign to the EC2 instances (for example, your custom SG group for SSH access or VPN access, etc)
    SNSTopicArn SNS topic to send CloudWatch Alerts to, you could provide an ARN of the existing topic or new topic will be created if you don't specify any

    Core Nodes Configuration

    Parameter Description
    ClusterInstanceType AWS Instance type to use for Neo4j Cluster Core nodes (possible to use spot instances, see details)
    SubnetID List of subnets to deploy your cluster into.
    Must include at least 3 subnets in different AZ see details
    DesiredCapacity Number of desired Neo4j Core nodes.
    Must be at least 3 nodes and must much the number of subnets in different AZs see details
    EBSSize Size of EBS volume for Neo4j data in GBs
    EBSType Type of EBS volume

    Read replicas configuration

    Parameter Description
    ReplicasInstanceType AWS Instance type to use for Neo4j Cluster Read Replicas (possible to use spot instances, see details)
    ReplicasCount Number of the desired Neo4j Read replicas. Set to 0 if you don't want to deploy read replicas (all the other resources associated with Read replicas will be not created as well. In this case all other the replica related parameters can be ignored)
    ReplicasSubnetID List of subnets to deploy your read replicas into.

    Docker image configurations

    Parameter Description
    DockerImage URL of your custom build Neo4j image (in the following format 111111111111.dkr.ecr.us-east-1.amazonaws.com/neo:c531de3a6655b8c885330ca91b867431760392bf)
    DockerECRARN ARN of your Private ECR repo (in the following format arn:aws:ecr:us-east-1:111111111111:repository/neo)

    Neo4j users configuration

    Parameter Description
    AdminUser Must be neo4j
    AdminPassword Password for the neo4j user
    ReadOnlyUser Name for the Neo4j Read-Only user
    ReadOnlyUserPassword Password for the Neo4j Read-Only user

    Cloud Map discovery settings

    Parameter Description
    CloudMapNamespaceID ID of an existing CloudMap Namespace to use for discovery. If not set, a new CloudMap Namespace will be created for you automatically
    CloudMapNamespaceName Name of the CloudMap Namespace. If CloudMapNamespaceID is set to use an existing Namespace then CloudMapNamespaceName should be match the existing Namespace name. If CloudMapNamespaceID is not set, then a new CloudMap Namespace will be created automatically with the provided name.
    See more details
    Neo4jCoreSubdomain The subdomain that will be used for the Neo4j Core Cluster. Will look like <subdomain>.<namespace>. For example: core.neo4j.service. By default is set to core
    Neo4jReplicasSubdomain The subdomain that will be used for the Neo4j Read replicas. Will look like <subdomain>.<namespace>. For example: replica.neo4j.service, by default is set to replica

    Neo4j operations

    Parameter Description
    BackupPath Full S3 path (in format <bucket_name>/path/to/backup.zip). This parameter is used ONLY when you want to start/restore Neo4j cluster from backup.
    See related limitations
    BackupHourlyStoreForDays Number of days to keep hourly backups
    BackupDailyStoreForDays Number of days to keep daily backups (hourly backup made on midnight considered as daily backup)
    IsDrainSupported If set to true, instances replacement will be done automatically in rolling way with the usage of ASG Termination Hooks. Set this parameter to true only after deploying ecs-drain-lambda, see additional information. If this parameter set to false Rolling Update of AutoScaling Group will not be triggered automatically, because that will result in downtime and you will need to do the operations manually (not recommended)
    SlowQueryLog Enable logging of Cypher queries that takes longer than 500 ms to the CloudWatch log. If enabled additional tiny sidecar container will be deployed next to Neo4j containers to tail the slow_query.log and push it to the CloudWatch log (additional CloudWatch log group will be created as well)

Upgrade version

Patch version upgrades

Upgrade between patch versions can be done using rolling upgrades. However, it is possible only when a store format upgrade is not needed (see release notes for a particular change).

Major and minor version upgrades

Major and minor version upgrades not tested yet and probably may require offline upgrade.

Neo4j cluster operations manual

Most of cluster operations is done via ECS Console:

Troubleshooting

In most of the cases, algorithm will be following:

  1. Try neo4j UI on the port 7474. Run :sysinfo to see nodes present in the cluster.
  2. Check the CloudWatch Logs output of neo4j container for any problems.
  3. Check the Neo4j ECS Cluster/Service/Tasks for stopped ones and ecs/docker errors, container exist codes.
  4. Check the Neo4j debug logs for any problems. Debug logs can be found on the disk or in the CloudWatch logs if CaptureDebugLogs enabled. And ECS tasks can be found in the AWS ECS Console.

Possible problems:

  1. Cluster leader keep changing. Possible reasons are:

    • Neo4j containers are being restarted. Check ECS for the stopped tasks, if there are some, this means that containers are being restarted. You need to figure out why. They can be killed due the HealthCheck failure, due to instance replacement or some internal error. You should be able to find out that from the ECS Console. If it is some internal error, check the Neo4j container output logs for errors. So you need to find the reason and fix it.

    • Neo4j re-elects the cluster leader because of long GC pauses. You should be able see corresponding logs in the debug log. There are usually application related reasons, like very heavy query.

  2. Cluster not forming. Check the output of Neo4j containers (in the CloudWatch logs) or the debug logs. Discovery happens in the cloudmap_discover function in the ecs-extension.sh script via calls to the AWS CloudMap API. So on the cluster forming stage, if any of the containers cannot start, discovery will fail.

    Also, check for configuration problems. Remember, that amount of configured Availability zones (via Subnets) should match the number of core nodes in the cluster, so RexRay plugin can create one EBS volume per AZ.

TODO