aws / neptune-export

Apache License 2.0
12 stars 9 forks source link

Special characters not being handled #70

Closed morisoinc closed 1 year ago

morisoinc commented 1 year ago

After exporting the data from a Neptune database using the following command:

java -jar neptune-export.jar nesvc \
  --root-path /home/ec2-user \
  --json '{
            "command": "export-pg",
            "outputS3Path" : "s3://(your Amazon S3 bucket)/neptune-export",
            "params": {
              "endpoint" : "(your neptune DB cluster endpoint)"
            }
          }'

The data was successfully stored in S3. However, nodes and edges that had property values with special characters were not handled correctly: the special characters were replaced with a question mark. I.e.: "Café" -> "Caf?".

Cole-Greer commented 1 year ago

Thanks for submitting this @morisoinc, I will look into it shortly and update with any findings.

Cole-Greer commented 1 year ago

@morisoinc Would you mind sharing which application you are using to read the exported CSV files? I've tried reproducing the issue and the special character é looks fine in my results. In my tests the results file appears to be properly encoded UTF-8.

morisoinc commented 1 year ago

i used this link in order to setup the export logic, and this the specifics steps i documented in our case:

Exporting data

now, when i check the .csv files that were exported to s3, the special characters there are not encoded correctly. even when downloading the files manually and opening them in a text editor or by using the import logic i have in place. it seems like the .csv files are already saved without proper encoding

Cole-Greer commented 1 year ago

It seems I'm still unable to reproduce this. I'm also running via EC2. This is my command:

[ec2-user@*** ~]$ java -jar neptune-export.jar nesvc   --root-path /home/ec2-user   --json '{
            "command": "export-pg",
            "outputS3Path" : "s3://***/neptune-export/",
            "params": {
              "endpoint" : "***.neptune.amazonaws.com"
            }
          }'

Afterwards I'm checking the files:

[ec2-user@*** ~]$ cat output/dc9e6527e02b47da841d041e6bdd712a/nodes/test-3.modified.csv    //check local output file
~id,~label,test:string
"ecc59d4e-4184-8917-98f4-a8c52f694765","test","Café"

[ec2-user@*** ~]$ file output/dc9e6527e02b47da841d041e6bdd712a/nodes/test-3.modified.csv
output/dc9e6527e02b47da841d041e6bdd712a/nodes/test-3.modified.csv: UTF-8 Unicode text

[ec2-user@*** ~]$ aws s3 cp s3://***/neptune-export/dc9e6527e02b47da841d041e6bdd712a/nodes/test-3.modified.csv ./
download: s3://***/neptune-export/dc9e6527e02b47da841d041e6bdd712a/nodes/test-3.modified.csv to ./test-3.modified.csv

[ec2-user@*** ~]$ cat test-3.modified.csv    //check output file from S3
~id,~label,test:string
"ecc59d4e-4184-8917-98f4-a8c52f694765","test","Café"

[ec2-user@*** ~]$ file test-3.modified.csv
test-3.modified.csv: UTF-8 Unicode text

Are you able to query your graph through some other means (graph notebook, gremlin-console...) to verify that the data is in fact correctly encoded within Neptune? I would like to confirm that your issue is on the export side of things and that there weren't any issues when you loaded the data into the graph. For what it's worth, I added my data through gremlin-console and copy-pasted "Café" from your original post here.

gremlin> g.addV("test").property("test", "Café")
==>v[ecc59d4e-4184-8917-98f4-a8c52f694765]
morisoinc commented 1 year ago

doing a cat output/... in a generated file shows the content of it with the wrong data, and when i ran the file output/... i got this:

output/c0c92e77bea34f24b6f30643a8d6a376/nodes/place-13.modified.csv: ASCII text, with very long lines

it looks like this is the issue, it's saving as ASCII text instead of UTF-8. is it due to the file being large?

also, when querying the graph db using gremlin queries as you suggested returns the data as it's supposed to be, with special characters

Cole-Greer commented 1 year ago

That's interesting, I would not expect your file to use a different encoding from mine. I will have to dig deeper to understand what's going on here. Would you mind sharing which version and distribution of java you are using? I see earlier you mentioned installing both openjdk 11 and 1.8.0. Also roughly how large is your output file?

morisoinc commented 1 year ago

that's odd, isn't it? i ran java -version and got this:

openjdk version "1.8.0_382"
OpenJDK Runtime Environment (build 1.8.0_382-b05)
OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)

let me know if there's anything else you need. if you test with these specific java/openjdk versions and are able to reproduce the issue, could you share the versions you were originally using so i can at least do the export on my side here before the issue gets fixed? that would be really great for my team and i as we're stuck from doing big changes in our stack due to the neptune export issue.

also, thanks for all the investigation you've been doing on this in these past few days!

morisoinc commented 1 year ago

oh and the files sizes vary, the ones in question are 6MB, 6.6MB, 9.8MB, 4.4MB. i checked another one that has special characters and it's only 268.6 KB

Cole-Greer commented 1 year ago

Thanks, I will run some more tests with this, I'll let you know as soon as I find anything.

Cole-Greer commented 1 year ago

I'm still having some difficulty reproducing the issue. I want to ensure I am using the same version of Linux and Neptune Export as you are. This is what I am currently running in my test:

[ec2-user@*** ~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"

[ec2-user@*** ~]$ java -jar neptune-export.jar export-pg -e ***.neptune.amazonaws.com -d testout
neptune-export.jar: [buildVersion='1.0.7', buildTime='2023-09-27T23:19:22+0000', commitId='cfcc206364273423381d0572fa15007d88a492ff', commitTime='2023-09-27T23:16:27+0000']
...
morisoinc commented 1 year ago

this is what i have:

[ec2-user@*** ~]$ cat /etc/os-release 
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

the other command returned this:

[ec2-user@*** ~]$ java -jar neptune-export.jar export-pg -e ***.neptune.amazonaws.com -d testout
neptune-export.jar: [buildVersion='1.0.7', buildTime='2023-09-27T23:19:22+0000', commitId='cfcc206364273423381d0572fa15007d88a492ff', commitTime='2023-09-27T23:16:27+0000']

i wonder if the ec2 instance details has any influence in the results... this one we're using is created using default settings in a cdk project we have.

Cole-Greer commented 1 year ago

I think I have a solution for you. Neptune Export currently does not specify what encoding scheme is used for the output files. It is relying on Java's default charset. Anytime I run it, (even with the same OS and JDK as you), the default charset is UTF-8 and I cannot reproduce your issue. However, the default can be overridden with a system property.

If I run this command, I get ASCII output with the ? instead of é:

java -Dfile.encoding=ASCII -jar neptune-export.jar nesvc ...

I expect that running this command will resolve your issue:

java -Dfile.encoding=UTF8 -jar neptune-export.jar nesvc ...
morisoinc commented 1 year ago

thank you SO MUCH! that solved the issue for us and i'm now finally able to export the data correctly. i appreciate your time and effort to find out about this issue

Cole-Greer commented 1 year ago

No worries, I'm glad that's working for you. Now that we understand what is happening here, I kind of like that Neptune Export follows the default character encoding of the system it's using. I think there is some value in that functionality as well as the ability to override it through JVM properties. Does it seem reasonable to you to maintain the current behavior (perhaps with a note added to the troubleshooting docs), or would you expect Neptune Export to always output as UTF-8 regardless of the system configuration?

morisoinc commented 1 year ago

i agree, neptune export becomes a "closed box" that behaves according to environment it runs on. i think you should definitely add that information to the docs as other users will probably face this considering they've installed java and created ec2 instances with the default settings. i'm not really aware about encoding actually... but i know UTF-8 is kind of the standard everywhere, as it covers a big amount of characters (if not all). wouldn't it be better to neptune export default to it instead? and if the user would like to change it to something else (which can be unlikely most of the times) they should explicitly declare that when running the command. what do you think? just a thought, as i mentioned, i don't understand a lot about encodings

Cole-Greer commented 1 year ago

As far as I know, Java applications cannot distinguish if a system property was set from the external environment or through a user explicitly setting a -D flag. The 2 choices seem to be to continue relying on Java to determine the file encoding or to replace it with an internal system. At this time, my preference is to continue to leverage Java to pick up the default encoding from the system. I will add a bit of documentation regarding this.