aws / neptune-export

Apache License 2.0
12 stars 9 forks source link

Neptune Export

CI Status Code Coverage

Exports Amazon Neptune property graph data to CSV or JSON, or RDF graph data to Turtle.

Usage

Topics

Neptune-Export service

You can now deploy neptune-export as a service inside your Neptune VPC. Use these CloudFormation templates to install the Neptune-Export service.

Standalone usage

neptune-export is also runnable as an adhoc command. See prerequisites.

Best practices

Export from a cloned cluster

neptune-export cannot guarantee the consistency of exported data if you export from a Neptune cluster whose data is changing while the export is taking place. Therefore, we recommend exporting from a clone of your cluster. This ensures the export takes place against a static version of the data at the point in time the database was cloned. Further, exporting from a clone ensures the export doesn’t impact the query performance of the original cluster.

neptune-export makes it easy to export from a clone. Simply supply a --clone-cluster option with the command. You can also use the --clone-cluster-replica-count option to specify the number of read replicas to be added to the cloned cluster, and the --clone-cluster-instance-type parameter to tell neptune-export which instance type – e.g. db.r5.2xlarge – to use for each instance in the cloned cluster (by default, neptune-export will use the same instance type as the primary in the original cluster.)

If you clone your cluster using the --clone-cluster option, neptune-export will ignore any --concurrency option supplied in the params, and will instead work out a concurrency setting based on the number of instances in the cloned cluster and their instance types.

If you use the cluster cloning features of neptune-export, you must ensure the AWS Identity and Access Management identity with which the process runs can perform the following actions:

Use a config file

Use the export-pg-from-config command in preference to export-pg when exporting property graphs from Neptune. The export-pg command makes two passes over your data: the first to generate metadata, the second to create the data files. This first pass takes place on a single thread, and for very large datasets can take many hours – often much longer than the export itself.

The preferred approach is to generate the metadata once using create-pg-config, store the config file in S3, and then refer to it from export-pg-from-config using the --config-file option.

Supply approximate node and edge counts

When performing a parallel export (--concurrency is larger than one), neptune-export must first query your database to determine the number of nodes and edges to be exported. These numbers are then used to calculate ranges for each query in a set of parallel queries. Counting the nodes and edges in a large dataset can take many minutes.

neptune-export now includes --approx-node-count and --approx-edge-count options that allow you to supply estimates for the number of nodes and edges you expect to export. By specifying approximate counts you can reduce the export time, because neptune-export will no longer have to query the database to count the nodes and edges.

The numbers you supply need only be approximate – it doesn’t matter if you’re within ten percent of the real counts. One way of calculating these numbers is to use the counts from a previous export, adjusted based on the approximate number of additions and deletions that have taken place in the interim.

Exporting to the Bulk Loader CSV Format

When exporting to the CSV format used by the Amazon Neptune bulk loader, neptune-export generates CSV files based on a schema derived from scanning your graph. This schema is persisted in a JSON file. There are three ways in which you can use the tool to generate bulk load files:

Generating schema

export-pg and create-pg-config both generate schema JSON files describing the properties associated with each node and edge label. By default, these commands will scan the entire database. For large datasets, this can take a long time.

Both commands also allow you to sample a range of nodes and edges in order to create this schema. If you are confident that sampling your data will yield the same schema as scanning the entire dataset, specify the --sample option with these commands. If, however, you have reason to believe the same property on different nodes or edges could yield different datatypes, or different cardinalities, or that nodes or edges with the same labels could contain different sets of properties, you should consider retaining the default behaviour of a full scan.

Once you have generated a schema file, either with export-pg or create-pg-config, you can reuse it for subsequent exports in export-pg-from-config. You can also modify the file to restrict the labels and properties that will be exported.

Label filters

All three commands allow you to supply vertex and edge label filters.

Token-only export

For some offline use cases you may want to export only the structural data in the graph: that is, just the labels and IDs of vertices and edges. export-pg allows you to specify a --tokens-only option with the value nodes, edges or both. A token-only export does not generate a schema, nor does it export any property data: for vertices it simply exports ~id and ~label; for edges, it exports ~id, ~from, ~to and ~label. You can still use label filters to determine exactly which vertices and edges will be exported.

Parallel export

The export-pg and export-pg-from-config commands support parallel export. You can supply a concurrency level, which determines the number of client threads used to perform the parallel export, and, optionally, a range or batch size, which determines how many nodes or edges will be queried by each thread at a time. If you specify a concurrency level, but don't supply a range, the tool will calculate a range such that each thread queries (1/concurrency level) * number of nodes/edges nodes or edges.

If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance.

You can load balance requests across multiple instances in your cluster (or even multiple clusters) by supplying multiple --endpoint options.

Long-running queries

neptune-export uses long-running queries to generate the schema and the data files. You may need to increase the neptune_query_timeout DB parameter in order to run the tool against large datasets.

For large datasets, we recommend running this tool against a standalone database instance that has been restored from a snapshot of your database.

Serializer

The latest version of neptune-export uses the GraphBinary serialization format introduced in Gremlin 3.4.x. Previous versions of neptune-export used Gryo. To revert to using Gryo, supply --serializer GRYO_V3D0.

Character Encoding

neptune-export attempts to use the JVM system default text encoding for all output files. This can be configured manually if needed by setting the file.encoding system property.

java -Dfile.encoding=UTF8 -jar neptune-export.jar ...

Exporting the Results of User-Supplied Queries

neptune-export's export-pg-from-queries command allows you to supply groups of Gremlin queries and export the results to CSV or JSON.

Every user-supplied query should return a resultset whose every result comprises a Map. Typically, these are queries that return a valueMap() or a projection created using project().by().by()....

Queries are grouped into named groups. All the queries in a named group should return the same columns. Named groups allow you to 'shard' large queries and execute them in parallel (using the --concurrency option). Note that query sharding is not done automatically, so if you just supply one query, you will get no benefit from increasing the concurrency level past one. The resulting CSV or JSON files will be written to a directory named after the group.

If there is a possibility that individual rows in a query's resultset will contain different keys, use the --two-pass-analysis flag to force neptune-export to determine the superset of keys or column headers for the query.

You can supply multiple named groups using multiple --queries options. Each group comprises a name, an equals sign, and then a semi-colon-delimited list of Gremlin queries. Surround the list of queries in double quotes. For example:

-q person="g.V().hasLabel('Person').range(0,100000).valueMap();g.V().hasLabel('Person').range(100000,-1).valueMap()"

Sharding queries for concurrent execution can create a large number of queries, especially with a high concurrency level. In order to avoid inputting all of these queries as command-line arguments, you can also supply them in a JSON file with the --queriesFile option. The JSON file should be formatted like this:

[
  {
    "name": "NamedQueryGroup",
    "queries": ["list", "of", "sharded", "queries", "in", "group"]
  },
  ...
]

This file can be given as a local path, or over https or s3.

Split Queries

The --split-queries option may be used to automatically shard queries. When invoked, the tool will calculate ranges in the same manner as the export-pg command's parallel export, and then split each query into --concurrency number of shards.

The sharded queries use injected range() steps at the beginning of the query to divide the ranges. For example, g.V().hasLabel("person") may be sharded as:

g.V().range(0, 250).hasLabel("person")
g.V().range(250, 500).hasLabel("person")
g.V().range(500, 750).hasLabel("person")
g.V().range(750, -1).hasLabel("person")

This range()-based sharding may not be uniformly balanced, and may lead produce different results with certain queries. Any gremlin steps which operate on the entire input stream at once (such as order(), dedup(), and group()) should be used with caution as this sharding inevitably alters their inputs.

For any queries which are incompatible with range()-based sharding, or in situations where more precise balancing is required, it is recommended to avoid using --split-queries and instead provide a --queriesFile with pre-sharded queries.

Parallel execution of queries

If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance. When neptune-export executes named groups of queries in parallel, it simply flattens all the queries into a queue, and spins up a pool of worker threads according to the concurrency level you have specified using --concurrency. Worker threads continue to take queries from the queue until the queue is exhausted.

Batching

Queries whose results contain very large rows can sometimes trigger a CorruptedFrameException. If this happens, you can either adjust the batch size (--batch-size) to reduce the number of results returned to the client in a batch (the default is 64), or increase the frame size (--max-content-length, default value 65536).

Exporting an RDF Graph

At present neptune-export supports exporting an RDF dataset to Turtle, NQuads, and NTriples with a single-threaded long-running query.

Exporting Named Graphs

The default scope for export-rdf is to export the entire dataset (union of all named graphs). Use the --named-graph <NamedGraphURI> argument to limit the scope to a single named graph. This can only be used with the default graph scope (--rdf-export-scope graph).

Exporting from User-Supplied SPARQL query

To export the results from a SPARQL query, use the --rdf-export-scope query and --sparql <SPARQL Query> arguments.

Security

Encryption in transit

By default, neptune-export connects to your database using SSL. If your target does not support SSL connections, use the --disable-ssl flag.

(SSL used to be an opt-in feature for neptune-export, with a --use-ssl option for turning SSL on. This behaviour has now changed: SSL is on by default, but can be turned off using --disable-ssl. The --use-ssl option now no longer has any effect.)

If you are using a load balancer or a proxy server (such as HAProxy), you must use SSL termination and have your own SSL certificate on the proxy server.

IAM DB authentication

neptune-export supports exporting from databases that have IAM database authentication enabled. Supply the --use-iam-auth option with each command. Remember to set the SERVICE_REGION environment variable – e.g. export SERVICE_REGION=us-east-1.

neptune-export also supports connecting through a load balancer to a Neptune database with IAM DB authetication enabled. However, this feature is only currently supported for property graphs, with support for RDF graphs coming soon.

If you are connecting through a load balancer, and have IAM DB authentication enabled, you must also supply either an --nlb-endpoint option (if using a network load balancer) or an --alb-endpoint option (if using an application load balancer), and an --lb-port.

For details on using a load balancer with a database with IAM DB authentication enabled, see Connecting to Amazon Neptune from Clients Outside the Neptune VPC.

Exporting to an Amazon Kinesis Data Stream

When exporting to an Amazon Kinesis Data Stream, records are aggregated by default – that is, multiple exported records are packed into a single Kinesis Data Streams record. In your stream client your will need to deaggregate the records. If you are using a Python client, you can use the record deaggregation module from the Kinesis Aggregation/Deaggregation Modules for Python.

You can turn off stream record aggregation when you export to a Kinesis Data Stream using the --disable-stream-aggregation option.

Building neptune-export

To build the jar, run:

mvn clean install

Deploying neptune-export as an AWS Lambda Function

The neptune-export jar can be deployed as an AWS Lambda function. To access Neptune, you will either have to configure the function to access resources inside your VPC, or expose the Neptune endpoints via a load balancer.

Be mindful of the AWS Lambda limits, particularly with regard to function timeouts (max 15 minutes) and /tmp directory storage (512 MB). Large exports can easily exceed these limits.

When deployed as a Lambda function, neptune-export will automatically copy the export files to an S3 bucket of your choosing. Optionally, it can also write a completion file to a separate S3 location (useful for triggering additional Lambda functions). You must configure your function with an IAM role that has write access to these S3 locations.

The Lambda function expects a number of parameters, which you can supply either as environment variables or via a JSON input parameter. Fields in the JSON input parameter override any environment variables you have set up.

Environment Variable JSON Field Description
COMMAND command neptune-export command and command-line options: e.g. export-pg -e <neptune_endpoint> Mandatory
OUTPUT_S3_PATH outputS3Path S3 location to which exported files will be written Mandatory
CONFIG_FILE_S3_PATH configFileS3Path S3 location of a JSON config file to be used when exporting a property graph from a config file Optional
COMPLETION_FILE_S3_PATH completionFileS3Path S3 location to which a completion file should be written once all export files have been copied to S3 Optional
SSE_KMS_KEY_ID sseKmsKeyId ID of the customer managed AWS-KMS symmetric encryption key to used for server-side encryption when exporting to S3 Optional

Samples

AWS CDK Wrapper for Machine Learning

A CDK Wrapper around Neptune Export and Neptune ML CloudFormation stacks to run fake news detection jobs.