drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Serd parses prefixed IRIs that contain illegal Unicode characters #5

Closed wouterbeek closed 6 years ago

wouterbeek commented 6 years ago

Serd parses prefixed IRIs that contain illegal Unicode characters in their local name.

For example, the following Turtle snippet appears in an actual data file (notice that the underscores are the illegal Unicode character EN DASH (U+2013):

@prefix dbp: <http://dbpedia.org/property/> .
@prefix dbr: <http://dbpedia.org/resource/> .
dbr:Germany_at_the_2006–08_European_Nations_Cup dbp:stadium     dbr:Amsterdam .

Serdi parses this snippet, but it should raise an error:

serdi unicode.ttl
<http://dbpedia.org/resource/Germany_at_the_2006\u201308_European_Nations_Cup> <http://dbpedia.org/property/stadium> <http://dbpedia.org/resource/Amsterdam> .

Tested with Serd 0.28.0.

wouterbeek commented 6 years ago

+1 for a distinction between strict and lax mode.

BTW I do not see the rational behind not allowing the Unicode dash in this position. What could be the rational behind this in the standard?

drobilla commented 6 years ago

Fixed in https://github.com/drobilla/serd/commit/1cd321825c52eddd4175cb4ec58ae8d7ad2da48d