Serd parses prefixed IRIs that contain illegal Unicode characters

wouterbeek commented 7 years ago

Serd parses prefixed IRIs that contain illegal Unicode characters in their local name.

For example, the following Turtle snippet appears in an actual data file (notice that the underscores are the illegal Unicode character EN DASH (U+2013):

@prefix dbp: <http://dbpedia.org/property/> .
@prefix dbr: <http://dbpedia.org/resource/> .
dbr:Germany_at_the_2006–08_European_Nations_Cup dbp:stadium     dbr:Amsterdam .

Serdi parses this snippet, but it should raise an error:

serdi unicode.ttl
<http://dbpedia.org/resource/Germany_at_the_2006\u201308_European_Nations_Cup> <http://dbpedia.org/property/stadium> <http://dbpedia.org/resource/Amsterdam> .

Tested with Serd 0.28.0.

wouterbeek commented 7 years ago

+1 for a distinction between strict and lax mode.

BTW I do not see the rational behind not allowing the Unicode dash in this position. What could be the rational behind this in the standard?

drobilla commented 7 years ago

Fixed in https://github.com/drobilla/serd/commit/1cd321825c52eddd4175cb4ec58ae8d7ad2da48d

drobilla / serd

Serd parses prefixed IRIs that contain illegal Unicode characters #5