awslabs / amazon-neptune-tools

Tools and utilities to enable loading data and building graph applications with Amazon Neptune.
Apache License 2.0
300 stars 152 forks source link

'illegal multibyte sequence' error on long list of numbers as string #13

Closed bramson closed 5 years ago

bramson commented 6 years ago

In my Neo4j database I have nodes with GIS coordinates stored as a property. These are long lists of pairs of ~13 digit numbers separated by commas. Neo4j stores these as strings, I believe in UTF-8 format.

I used the APOC command to export the graph in graphml format, and now I'm trying to convert it to CSV for upload into Neptune using this script. I get an 'illegal multibyte sequence' error

UnicodeDecodeError('cp932', b'5.4397040833, 133.328841248 35.439539551, 133.32875060...3757508892 35', 332, 333, 'illegal multibyte sequence')

A couple of weird things occur at the end... (1) the single quote after "35" and (2) the "332, 333" which are not in the data. So I guess those are error codes (but I really have no idea). My first guess is that the real problem is that the list is just too long (because it's occurring at numbers instead of the Asian script text, but I really don't know),

Any information on what could be generating this error and how to avoid it?

beebs-systap commented 6 years ago

@bramson Are you running Python2 or Python3? If not Python3, can you try it with Python3? Also, can you share a snippet of the data?

bramson commented 5 years ago

I'm running Python3. On the path to sharing a snippet of data, I found that the file reader gave the same error at the same location (322) unless I specified the encoding as 'utf-8' (which of course it would be). This error indicates that graphml2csv.py is using a different encoding (cp932). Is there a way to tell the program to use 'utf-8' instead?

One node looks like this, although most nodes have MUCH longer polygon sequences:

`['<?xml version="1.0" encoding="UTF-8"?>\n', '<graphml xmlns="http://graphml.graphdrawing.org/xmlns" ...

:Chome35.4403133.333463POLYGON ((133.3352475761 35.4405125122, 133.3348922205 35.4407121097, 133.3341474462 35.4411163299, 133.3329744207 35.4417679452, 133.3329141306 35.4417997427, 133.3326984842 35.4419134722, 133.3324592193 35.4420396585, 133.3312633591 35.4407340535, 133.3312010912 35.4407174098, 133.3311529674 35.4407053569, 133.3310064612 35.4405565062, 133.3311624414 35.4404062198, 133.331368021 35.4402252883, 133.3315258067 35.4401096988, 133.3316113037 35.4400432266, 133.3318180557 35.4399204015, 133.3319202046 35.4398711085, 133.3320201384 35.4398269728, 133.3322433885 35.4397336839, 133.3324769207 35.4396647043, 133.3326477087 35.4396246052, 133.3332773467 35.4395122145, 133.3342852757 35.4393424198, 133.3349709417 35.4392287212, 133.3356377488 35.438564666, 133.335844996 35.4392447685, 133.336034546 35.4398809941, 133.3360891393 35.4400732332, 133.3355122842 35.4403785024, 133.3352475761 35.4405125122))81312.3321276.13鳥取県米子市米原3丁目\n`
beebs-systap commented 5 years ago

@bramson I wasn't able to reproduce this one locally, but my hypothesis is that your source file is actually encoded in cp932 and the script is interpreting it as utf-8, which is causing the error.

Here is a branch that adds the ability to specify the input file encoding: https://github.com/awslabs/amazon-neptune-tools/tree/issue13/graphml2csv.

Can you try this with the -e cp932 and see if it resolves the error?

beebs-systap commented 5 years ago

Closing for now. Please re-open if this is still an issue.