archesproject / arches

Arches is a web platform for creating, managing, & visualizing geospatial data. Arches was inspired by the needs of the Cultural Heritage community, particularly the widespread need of organizations to build & manage cultural heritage inventories
GNU Affero General Public License v3.0
212 stars 143 forks source link

JSON-LD Import and Export fails for unicode characters #5137

Closed azaroth42 closed 5 years ago

azaroth42 commented 5 years ago

Describe the bug

When either uploading or downloading content that has unicode characters, the JSON-LD code raises an unhandled exception. Can't tell what it is, due to #5116 :(

It is, however, trivial to reproduce:

To Reproduce

Tagging has high, as this is a blocker for any real data.

benosteen commented 5 years ago

There are several bugs that I've found:

Firstly, in the API call, it uses the broken command logger.exception() which fails as it does not include an error to send to the logger. It uses it twice, and in one case, this is used within a catchall exception which is what is masking all the other errors. I recommend expanding this, and removing the catchall catch Exception block entirely.

(https://github.com/archesproject/arches/blob/master/arches/app/views/api.py#L303)

Next part of the fix is simple and does correct the encoding in the RDFlib Graph object. (Involves a few datatype edits)

This is what the serialization of the RDF Graph looks like, once the datatype fixes are in:

<http://localhost:8000/resources/281d85dc-c377-11e9-8d2f-0242ac170004> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E4_Period> .
<http://localhost:8000/resources/281d85dc-c377-11e9-8d2f-0242ac170004> <http://purl.org/dc/terms/relation> "Rottl\\u00E4nder, R." .

(NB if sent to stdout by a logger command, the double will turn into a single . eg)

>>> g.serialize(format='nt')
'<http://example.org/1> <http://www.w3.org/2000/01/rdf-schema#label> "Rottl\\u00E4nger" .\n\n'
>>> print(g.serialize(format='nt'))
<http://example.org/1> <http://www.w3.org/2000/01/rdf-schema#label> "Rottl\u00E4nger" .

And this is what it looks like after importing this into pyld via the from_rdf command:

logger.debug(js) <-- the pyld object

[{'http://purl.org/dc/terms/relation': [{'@value': 'Rottl\\u00E4nder, R.'}], '@id': 'http://localhost:8000/resources/281d85dc-c377-11e9-8d2f-0242ac170004', '@type': ['http://www.cidoc-crm.org/cidoc-crm/E4_Period']}]

Note the double encoding when being logged to the commandline. There is a line in pyld that is worrying me: https://github.com/digitalbazaar/pyld/blob/master/lib/pyld/jsonld.py#L2997 this use of str could be problematic in this case.

azaroth42 commented 5 years ago

In PyLD, _is_string is just a wrapper for return isinstance(v, basestring) ... which is mapped to just str in 3.x ... which should catch any real string/unicode values, I think.

Can you put the fixes for the datatypes into a branch for testing?

Thanks Ben!

benosteen commented 5 years ago

https://github.com/archesproject/arches/pull/5181

benosteen commented 5 years ago

Should be resolved by #5181 merge