problems with streaming to kinesis with neptune-export tool

alxcnt commented 1 year ago

We're using neptune-export tool for graph data export and encountered the following problem:

if one exports from the neptune db cluster using the following config:

{
  "command": "export-pg-from-queries",
  "params": {
    "endpoint": "[your prod endpoint here]",
    "batchSize": 8,
    "concurrency": 8,
    "queriesFile": "sharded_queries/export_queries.json",
    "format": "json",
    "logLevel": "info",
    "maxContentLength": 100000000,
    "serializer": "GRAPHBINARY_V1D0"
  }
}

the rows in the resulting files look fine, one per line and no unexpectedly terminated lines or garbage at the file's end(last line)

HOWEVER:

if one specifies the following config for streaming to kinesis:**

{
  "command": "export-pg-from-queries",
  "params": {
    "endpoint": "[your endpoint here]",
    "batchSize": 8,
    "concurrency": 8,
    "queriesFile": "sharded_queries/export_queries.json",
    "format": "json",
    "logLevel": "info",
    "maxContentLength": 100000000,
    "serializer": "GRAPHBINARY_V1D0",
    "streamName": "[stream-name]",
    "output": "stream"
  }
}

the data gets sent to the stream but if you examine the files in the export dir which the kinesis producer is tailing during the export you'll see the rows separated by '[' chars sometimes and also last line will not be completely written as if the output was abruptly terminated in the middle.

While reading from the kinesis stream data was written to you'll encounter byte sequences which are not utf-8 decodable and you'll fail to read the data.

Could you please take a look at it? It'd also be nice to have an option to turn off file-system writes completely (no buffering to export dir, just stream to kinesis directly).

iansrobinson commented 1 year ago

Hi @alxcnt

Thanks for reporting this. I've been able to reproduce, and have a fix.

However, as of 28th February, neptune-export has moved to a new repository, https://github.com/aws/neptune-export, so as to better facilitate its ongoing development independent of the other projects in amazon-neptune-tools. We'll be posting a deprecation notice in this project shortly.

The fix, therefore, is currently a PR in the new repository:

https://github.com/aws/neptune-export/pull/11

Regrading the file-system writes: it's not possible to turn off file-system writes completely. Results coming from Neptune to the Java Gremlin client, which the export tool uses, must be handled immediately: any delay in handling can cause the client-side Netty buffer to fill, and the Java Gremlin client to crash. Writing to Kinesis exerts some back-pressure, which can delay the handling of results, and cause the client to crash. Hence the tool writes immediately to disk, and from there to Kinesis (using the KinesisProducer library) at a rate Kinesis can sustain.

alxcnt commented 1 year ago

Hello @iansrobinson appreciate prompt response and thank you for the fix, I'll try it asap and let you know! Regarding the file output, thanks for confirming that it can't be turned off. While looking at the code myself I noticed that files are used as a buffering solution, but hoped it could be solved somehow.

alxcnt commented 1 year ago

Hello @iansrobinson. When trying the build made using your PR branch I get the following error which seem to be unrelated to the problem, but it didn't appear in v.1.11, so I assume it's related to changes introduced after it was released.

java.lang.IllegalStateException: Unable to identify cluster ID from endpoints: [neptune endpoint-redacted]
        at com.amazonaws.services.neptune.cluster.NeptuneClusterMetadata.createFromEndpoints(NeptuneClusterMetadata.java:84)
        at com.amazonaws.services.neptune.cli.CommonConnectionModule.clusterMetadata(CommonConnectionModule.java:96)
        at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.lambda$run$0(ExportPropertyGraphFromGremlinQueries.java:90)
        at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:41)
        at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:34)
        at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.run(ExportPropertyGraphFromGremlinQueries.java:88)
        at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
        at com.amazonaws.services.neptune.export.NeptuneExportService.execute(NeptuneExportService.java:235)
        at com.amazonaws.services.neptune.export.NeptuneExportLambda.handleRequest(NeptuneExportLambda.java:173)
        at com.amazonaws.services.neptune.RunNeptuneExportSvc.run(RunNeptuneExportSvc.java:60)
        at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
        at com.amazonaws.services.neptune.NeptuneExportCli.main(NeptuneExportCli.java:48)
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - Uploading results of failed export to S3
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - S3 output path is empty

In our case we have the neptune cluster accesible on VPN via the cluster(and other endpoints) while attempting the export with streaming to kinesis stream created under different AWS account(than the neptune cluster) using credentials to that account. How could we solve it using the latest version of the neptune-export tool with the applied fix?

iansrobinson commented 1 year ago

Hi @alxcnt

You need to run the tool with credentials that can access the Neptune Management API for your Neptune cluster. If you're running it in a different account to your Neptune cluster, the tool won't be able to access any metadata about the cluster. At the moment there isn't a way to supply credentials from another account, though we can consider adding that as a feature.

Access to the Management API for inferring the clusterId has been necessary since the end of Nov 2022, when the clusterId inferencing was improved. The clone cluster feature, which is necessary for ensuring a static view of the data for databases undergoing a write workload, has needed access to the Management API for a couple of years.

The Management API client uses the default credentials provider chain . These are the locations it will look for the relevant credentials:

public DefaultAWSCredentialsProviderChain() {
        super(new EnvironmentVariableCredentialsProvider(),
              new SystemPropertiesCredentialsProvider(),
              WebIdentityTokenCredentialsProvider.create(),
              new ProfileCredentialsProvider(),
              new EC2ContainerCredentialsProviderWrapper());
    }

alxcnt commented 1 year ago

Hello @iansrobinson thanks for the clarification! My understanding is that if you don't have writes then you'd not need to clone the cluster to have the consistent view of the data for the export (cloning the cluster was/is an optional param afair). I'd be great if we could specify 2 sets of credentials: one for accessing the cluster and one for writing to the stream/s3bucket upload. It may be needed when one operates under the same account but with separate roles used for service access, or when using 2 different accounts. The above is our current case, could you please consider adding it as feature?

iansrobinson commented 1 year ago

If you're running this from the command line, how would you expect to pass those credentials?

As the names of different IAM profiles
Explicitly as two sets of access key, secret key
Via two different sets of environment variables
Via two different sets of system properties

alxcnt commented 1 year ago

Perhaps having an ability to configure a prefix for env vars proving alternative configuration for accessing the Neptune cluster could be a solution.

{
  "command": "export-pg-from-queries",
  "params": {
    "authVarsPrefix": "AWS_NEPTUNE_",
    ...

  }
}

so you'd specify alternative profile, session tokens etc.

iansrobinson commented 1 year ago

@alxcnt Thanks. I've opened an enhancement issue for this: https://github.com/aws/neptune-export/issues/12

Please comment there if you've anything else to add. Thanks.

alxcnt commented 1 year ago

Hello @iansrobinson, i've tried your fix by applying the patch to the version tagged amazon-neptune-tools-1.12 and I still observe the behaviour reported in the issue's description. Namely,

While reading from the stream(using python consumer, implemented using boto3) the following error occurs(same as before): UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
The files found in exports/../*.json have incomplete line at the end
When all queries are executed the tool doesn't terminate and produces the following output: [kpl-daemon-0003] INFO com.amazonaws.services.kinesis.producer.LogInputStreamReader - [2023-03-01 19:17:47.170641] [0x00010d31][0x00007000099e7000] [info] [processing_statistics_logger.cc:114] Stage 2 Triggers: { stream: ‘your-export-stream-name’, manual: 0, count: 0, size: 0, matches: 0, timed: 0, KinesisRecords: 0, PutRecords: 0 }

iansrobinson commented 1 year ago

Hi @alxcnt

There was an additional fix after the version tagged 1.12: https://github.com/awslabs/amazon-neptune-tools/commit/2d281ea293165673a218a8f01a3d5d2ab76e5929

Files in export should be deleted once the export completes: the fact the tool doesn't terminate and you still have export files suggests that the issue is at least partly result of this fix not having been applied.

alxcnt commented 1 year ago

Thanks, will use this commit for testing your fix, will let you know the results.

iansrobinson commented 1 year ago

OK - testing with a highly concurrent export, and I see the unicode errors again. This needs a larger piece of diagnosis work, which we'l have to schedule.

alxcnt commented 1 year ago

Confirmed, after applying those 2 patches to 1.12 - the files get deleted, and the tool terminated on completion, but unicode decoding issue remains as you say. Would it be possible to share an ETA for when the fix would be available?

iansrobinson commented 1 year ago

Hi @alxcnt

So, it turns out the unicode issue is not a bug, but a feature of the way in which data is written to a Kinesis Data Stream.

neptune-export uses the Kinesis Producer Library to write to Kinesis, and by default, records are aggregated before being written to the stream. This means that the client must deaggregate the records before processing them.

In Python you can use the record deaggregation module from the Kinesis Aggregation/Deaggregation Modules for Python to deaggregate records.

I've also submitted a PR that adds a "disableStreamAggregation":true option to the command parameters: https://github.com/aws/neptune-export/pull/15

alxcnt commented 1 year ago

Hello @iansrobinson, thanks tracking this down! With 3 patches on top of the 1.12 I was able to stream the records and read them back using the simple python consumer(used the disableStreamAggregation option, and didn't change the reading code). This is awesome news, looking forward to PR to be merged and separate cluster authorisation credential option to be supported. Thank you again for this.

awslabs / amazon-neptune-tools

problems with streaming to kinesis with neptune-export tool #307