Wimmics / corese

Software platform implementing and extending the standards of the Semantic Web.
https://project.inria.fr/corese/
Other
92 stars 29 forks source link

base keyword breaks hash based URIs #179

Open NicoRobertIn opened 3 months ago

NicoRobertIn commented 3 months ago

Issue Description:

The parser ruins the URI part passed to the @base keyword if this URI part is hash based

Bug Details:

When a URI is passed to the base keyword in a turtle file, if this URI ends with a #, then a part of this URI is lost during parsing, ruining all the URIs of the graph using this base

Steps to Reproduce:

  1. Create a turtle file using a base with a hash based URI and add a triple with a URI using this base. For example this one:
@base <https://example.org/route/disappeared#> .

<BrokenURI> a owl:Class .
  1. Query that base with a construct request that will retrieve that URI, for example:
construct {?s ?p ?o} where {?s ?p ?o }

Expected Behavior:

The broken URI should be <https://example.org/route/disappeared#BrokenURI>

Actual Behavior:

The following turtle is returned

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ns1: <https://example.org/route/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

ns1:BrokenURI a owl:Class .

rdf:type a rdf:Property .

From this result we can conclude that the BrokenURI is now <https://example.org/route/BrokenURI>, which is different from <https://example.org/route/disappeared#BrokenURI>

Note to Developers:

This behaviour was tested on Corese python, Corese command and Corese GUI on different computers.

The same URI modification can also be seen with a simple select * where {?s ?p ?o} request

Screenshots/Attachments:

image image image

FabienGandon commented 3 months ago

I think the base URI must be an absolute URI i.e. an URI with no fragment hence no #

frmichel commented 3 months ago

Indeed, RFC3986 says: "A base URI must conform to the syntax rule (Section 4.3). Then section 4.3 is not so easy to catch, but at least it says this: "defining a base URI for later use by relative references calls for an absolute-URI syntax rule that does not allow a fragment."

@NicoRobertIn, I think that a base URI is not like a prefix : a prefix just entails a URI by simple string concatenation, while the base URI is used to resolve relative URIs and this is not only string concatenation.

frmichel commented 3 months ago

Additional: to fix your problem, you should set

@base <https://example.org/route/disappeared> .
<#BrokenURI> a owl:Class .
MaillPierre commented 3 months ago

The turtle syntax says that "@base" should be followed by an IRIREF. An IRIREF must correspond to the following form: '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' Which I translate as the REGEX: <([^\x00\-\x20<>"{}|^`\\]|\X)*> This regex validates <https://example.org/route/disappeared#BrokenURI>.

Unless my regex is wrong, the Turtle recommendation says that the base URL used in OP's file is correct.

frmichel commented 3 months ago

Hmmm.... indeed. But there's something weird. When I add '#' to the set of forbidden characters, then the regex still matches. <([^#\x00\-\x20<>"{}|^\]|\X)*>` How come?

MaillPierre commented 3 months ago

@frmichel good catch, I updated the regex by decomposing the UCHAR regex: https://regex101.com/r/05Bh3v/3 <([^\x00\-\x20<>"{}|^`\\]|(\\u|\\U)([0-9]|[A-F]|[a-f]))*> It still validates <https://example.org/route/disappeared#BrokenURI>