awslabs / amazon-neptune-tools

Tools and utilities to enable loading data and building graph applications with Amazon Neptune.
Apache License 2.0
298 stars 151 forks source link

Neptune export-pg generate edge files with missing headers #31

Closed jyoti-datadata closed 4 years ago

jyoti-datadata commented 5 years ago

I am trying to replicate the database from one AWS region to other and using this utility to export the data from master DB.

The utility runs fine but when I try to upload the files to other Neptune DB, using Neptune bulk loader, the edge inserts fail with errors:

"errorCode" : "PARSING_ERROR", "errorMessage" : "Record has more columns than header", "fileName" : "s3://{edge file location}.csv", "recordNum" : 60

Steps to replicate issue:

  1. Export the data using bin/neptune-export.sh export-pg --log-level error -e {endpoint} -d ~/Downloads/

  2. Run bulk loader command for other DB:

    curl -X POST     -H 'Content-Type: application/json'     http://{endpoint}:8182/loader -d '
             {
                "source" : "s3://{csv files folder location}",
                "format" : "csv",
                "iamRoleArn" : "{iam role}",
                "region" : "us-west-2",
                "failOnError" : "FALSE"
              }'
  3. Check the status of the load :

  1. Output:
    
            {
                "fullUri" : "s3://{s3 bcuket}/sync/edges/{filename}.csv",
                "runNumber" : 1,
                "retryNumber" : 2,
                "status" : "LOAD_FAILED",
                "totalTimeSpent" : 0,
                "startTime" : 1564090152,
                "totalRecords" : 80497,
                "totalDuplicates" : 0,
                "parsingErrors" : 80265,
                "datatypeMismatchErrors" : 0,
                "insertErrors" : 232
            }
        ],
        "errors" : {
            "startIndex" : 1,
            "endIndex" : 3,
            "loadId" : "{loadid}",
            "errorLogs" : [
                {
                    "errorCode" : "PARSING_ERROR",
                    "errorMessage" : "Record has more columns than header",
                    "fileName" : "{file location}",
                    "recordNum" : 60
                },
                {
                    "errorCode" : "PARSING_ERROR",
                    "errorMessage" : "Record has more columns than header",
                    "fileName" : "{file location}",
                    "recordNum" : 61
                },
                {
                    "errorCode" : "PARSING_ERROR",
                    "errorMessage" : "Record has more columns than header",
                    "fileName" : "{file location}",
                    "recordNum" : 62
                }```

Fix tried:

  1. Provide the config file
  2. One observation was there were 100s of newline inputs in exported edge file. I had to remove them but still it failed.

Our Sample Edge data looks like: headers: ~id,~label,~from,~to,createdBy:string,createdTimestamp:date,weight:double,updatedBy:string,endDate:date,updatedTimestamp:date

Data:

Export utility never created 3 header in first column: updatedBy:string,endDate:date,updatedTimestamp:date headers, I added later to fix the issue . Not all rows of data will have these values.

Size of data: there are 80323 edges in the data for this label.

iansrobinson commented 5 years ago

Thanks @gauravsinghh for reporting this. From your description, it sounds as though the tool isn't picking up the fact that some edges have more properties than others, resulting in too few headers. Is that correct?

I'll look to reproduce this over the next few days and put in a fix.

ian

jyoti-datadata commented 5 years ago

Hi @iansrobinson , You are correct .

Also I am using release 1.0 . I did build the latest code and got the same issue.

iansrobinson commented 5 years ago

I've not been able to reproduce this issue, despite creating a dataset containing over 1 million edges, half of which have only a subset of properties.

For each edge (or vertex) with a particular label, the tool only outputs the property values that it knows about via its metadata collection. The fact that the additional properties are being output in individual rows, even though the headers are missing, suggests that the tool has generated the metadata for these properties.

I have been able to reproduce a situation in which 3 property headers are written to the second line of the CSV file, and lots of extra newline characters inserted throughout the output, by generating a dataset containing a newline in both a property key and property value.

Given that you've seen lots of additional newlines, I wonder whether there are any additional newline characters in any of your dataset's edge keys or values – in particular the 'updatedBy' property? When the tool runs, it should generate a config.json file containing the metadata it has inferred for all labels (the location of this file is detailed in the output on the command line). Please would share this config file – or at least review it to see a) whether those 3 properties are there, and b) whether any of them contains a newline character.

Thanks

ian

On Sun, 28 Jul 2019 at 00:50, CAPITAL ONE SERVICES LLC < notifications@github.com> wrote:

Hi @iansrobinson https://github.com/iansrobinson , You are correct .

This is my official Id. I posted from my personal id @gauravsinghh https://github.com/gauravsinghh.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/amazon-neptune-tools/issues/31?email_source=notifications&email_token=AACKOMFH7IIDANOFYZ5AMZ3QBTNLBA5CNFSM4IG7HOI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26URIY#issuecomment-515721379, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKOMAWTCASRIDBHFAT7ZTQBTNLBANCNFSM4IG7HOIQ .

beebs-systap commented 4 years ago

Closing as unable to reproduce.