Closed alxcnt closed 1 year ago
Hi @alxcnt
Thanks for reporting this. I've been able to reproduce, and have a fix.
However, as of 28th February, neptune-export has moved to a new repository, https://github.com/aws/neptune-export, so as to better facilitate its ongoing development independent of the other projects in amazon-neptune-tools. We'll be posting a deprecation notice in this project shortly.
The fix, therefore, is currently a PR in the new repository:
https://github.com/aws/neptune-export/pull/11
Regrading the file-system writes: it's not possible to turn off file-system writes completely. Results coming from Neptune to the Java Gremlin client, which the export tool uses, must be handled immediately: any delay in handling can cause the client-side Netty buffer to fill, and the Java Gremlin client to crash. Writing to Kinesis exerts some back-pressure, which can delay the handling of results, and cause the client to crash. Hence the tool writes immediately to disk, and from there to Kinesis (using the KinesisProducer library) at a rate Kinesis can sustain.
Hello @iansrobinson appreciate prompt response and thank you for the fix, I'll try it asap and let you know! Regarding the file output, thanks for confirming that it can't be turned off. While looking at the code myself I noticed that files are used as a buffering solution, but hoped it could be solved somehow.
Hello @iansrobinson. When trying the build made using your PR branch I get the following error which seem to be unrelated to the problem, but it didn't appear in v.1.11, so I assume it's related to changes introduced after it was released.
java.lang.IllegalStateException: Unable to identify cluster ID from endpoints: [neptune endpoint-redacted]
at com.amazonaws.services.neptune.cluster.NeptuneClusterMetadata.createFromEndpoints(NeptuneClusterMetadata.java:84)
at com.amazonaws.services.neptune.cli.CommonConnectionModule.clusterMetadata(CommonConnectionModule.java:96)
at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.lambda$run$0(ExportPropertyGraphFromGremlinQueries.java:90)
at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:41)
at com.amazonaws.services.neptune.util.Timer.timedActivity(Timer.java:34)
at com.amazonaws.services.neptune.ExportPropertyGraphFromGremlinQueries.run(ExportPropertyGraphFromGremlinQueries.java:88)
at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
at com.amazonaws.services.neptune.export.NeptuneExportService.execute(NeptuneExportService.java:235)
at com.amazonaws.services.neptune.export.NeptuneExportLambda.handleRequest(NeptuneExportLambda.java:173)
at com.amazonaws.services.neptune.RunNeptuneExportSvc.run(RunNeptuneExportSvc.java:60)
at com.amazonaws.services.neptune.export.NeptuneExportRunner.run(NeptuneExportRunner.java:72)
at com.amazonaws.services.neptune.NeptuneExportCli.main(NeptuneExportCli.java:48)
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - Uploading results of failed export to S3
[main] WARN com.amazonaws.services.neptune.export.ExportToS3NeptuneExportEventHandler - S3 output path is empty
In our case we have the neptune cluster accesible on VPN via the cluster(and other endpoints) while attempting the export with streaming to kinesis stream created under different AWS account(than the neptune cluster) using credentials to that account.
How could we solve it using the latest version of the neptune-export
tool with the applied fix?
Hi @alxcnt
You need to run the tool with credentials that can access the Neptune Management API for your Neptune cluster. If you're running it in a different account to your Neptune cluster, the tool won't be able to access any metadata about the cluster. At the moment there isn't a way to supply credentials from another account, though we can consider adding that as a feature.
Access to the Management API for inferring the clusterId has been necessary since the end of Nov 2022, when the clusterId inferencing was improved. The clone cluster feature, which is necessary for ensuring a static view of the data for databases undergoing a write workload, has needed access to the Management API for a couple of years.
The Management API client uses the default credentials provider chain . These are the locations it will look for the relevant credentials:
public DefaultAWSCredentialsProviderChain() {
super(new EnvironmentVariableCredentialsProvider(),
new SystemPropertiesCredentialsProvider(),
WebIdentityTokenCredentialsProvider.create(),
new ProfileCredentialsProvider(),
new EC2ContainerCredentialsProviderWrapper());
}
Hello @iansrobinson thanks for the clarification! My understanding is that if you don't have writes then you'd not need to clone the cluster to have the consistent view of the data for the export (cloning the cluster was/is an optional param afair). I'd be great if we could specify 2 sets of credentials: one for accessing the cluster and one for writing to the stream/s3bucket upload. It may be needed when one operates under the same account but with separate roles used for service access, or when using 2 different accounts. The above is our current case, could you please consider adding it as feature?
If you're running this from the command line, how would you expect to pass those credentials?
Perhaps having an ability to configure a prefix for env vars proving alternative configuration for accessing the Neptune cluster could be a solution.
{
"command": "export-pg-from-queries",
"params": {
"authVarsPrefix": "AWS_NEPTUNE_",
...
}
}
so you'd specify alternative profile, session tokens etc.
@alxcnt Thanks. I've opened an enhancement issue for this: https://github.com/aws/neptune-export/issues/12
Please comment there if you've anything else to add. Thanks.
Hello @iansrobinson, i've tried your fix by applying the patch to the version tagged amazon-neptune-tools-1.12
and I still observe the behaviour reported in the issue's description. Namely,
boto3
) the following error occurs(same as before):
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
exports/../*.json
have incomplete line at the end[kpl-daemon-0003] INFO com.amazonaws.services.kinesis.producer.LogInputStreamReader - [2023-03-01 19:17:47.170641] [0x00010d31][0x00007000099e7000] [info] [processing_statistics_logger.cc:114] Stage 2 Triggers: { stream: ‘your-export-stream-name’, manual: 0, count: 0, size: 0, matches: 0, timed: 0, KinesisRecords: 0, PutRecords: 0 }
Hi @alxcnt
There was an additional fix after the version tagged 1.12: https://github.com/awslabs/amazon-neptune-tools/commit/2d281ea293165673a218a8f01a3d5d2ab76e5929
Files in export should be deleted once the export completes: the fact the tool doesn't terminate and you still have export files suggests that the issue is at least partly result of this fix not having been applied.
Thanks, will use this commit for testing your fix, will let you know the results.
OK - testing with a highly concurrent export, and I see the unicode errors again. This needs a larger piece of diagnosis work, which we'l have to schedule.
Confirmed, after applying those 2 patches to 1.12 - the files get deleted, and the tool terminated on completion, but unicode decoding issue remains as you say. Would it be possible to share an ETA for when the fix would be available?
Hi @alxcnt
So, it turns out the unicode issue is not a bug, but a feature of the way in which data is written to a Kinesis Data Stream.
neptune-export uses the Kinesis Producer Library to write to Kinesis, and by default, records are aggregated before being written to the stream. This means that the client must deaggregate the records before processing them.
In Python you can use the record deaggregation module from the Kinesis Aggregation/Deaggregation Modules for Python to deaggregate records.
I've also submitted a PR that adds a "disableStreamAggregation":true
option to the command parameters: https://github.com/aws/neptune-export/pull/15
Hello @iansrobinson, thanks tracking this down!
With 3 patches on top of the 1.12
I was able to stream the records and read them back using the simple python consumer(used the disableStreamAggregation
option, and didn't change the reading code).
This is awesome news, looking forward to PR to be merged and separate cluster authorisation credential option to be supported. Thank you again for this.
We're using neptune-export tool for graph data export and encountered the following problem:
the rows in the resulting files look fine, one per line and no unexpectedly terminated lines or garbage at the file's end(last line)
HOWEVER:
the data gets sent to the stream but if you examine the files in the export dir which the kinesis producer is tailing during the export you'll see the rows separated by '[' chars sometimes and also last line will not be completely written as if the output was abruptly terminated in the middle.
While reading from the kinesis stream data was written to you'll encounter byte sequences which are not utf-8 decodable and you'll fail to read the data.
Could you please take a look at it? It'd also be nice to have an option to turn off file-system writes completely (no buffering to export dir, just stream to kinesis directly).