Open ricroberts opened 7 years ago
Rick commented elsewhere:
Ok probably a grafter issue to do with the regex preprocessing. Grafter should probably strip all comment lines before doing that processing.
I have found the cause of this issue. It's caused by some slightly naive logic in Sesame.
Nevertheless, the easy solution is to strip comments in Grafter, as Rick has suggested, rather than alter Sesame. I have a patch almost ready and am currently writing tests to provide confidence and cover edge cases.
Nice one @scottlowe,
Is the sesame issue fixed in RDF4j? Because if it is it might be worth postponing until we can use an RDF4j grafter build. If it's not fixed in RDF4j, we should file a bug report.
There are some edge cases to consider with #
occuring inside either a URI PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
which obviously should not strip characters following #
. Also similarly for strings e.g. "foo #blah"
should not be stripped.
Finally a complicating case is multiline strings containing where #
's should not be stripped though in practice queries like this are highly unlikely:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> # only strip this not the uri #
# strip
SELECT * WHERE { # strip
?s ?p "foo # dont strip
# dont strip
" . # strip .
}
There are some edge cases to consider with
#
My test case looks like this, with hashes in URIs and it works:
#This is the first comment line; it should be removed
# and this second line with a space after the hash should be removed
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> # <- uri with hash should be left alone, but this comment removed
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
# This should also be removed
GRAPH ?g { #and this #dsf/># also
?s ?p ?o .
}
}
The result of applying the comment-stripper was:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o .
}
}
I was quite pleased with the positive results, given the relatively simple regex:
(\s+#\s*[^\n]+)|(^#\s*[^\n]+)
Finally a complicating case is multiline strings containing where
#
's
I appreciate your thoroughness @RickMoynihan, especially given that I didn't consider multiline strings, but that might throw a spanner in the works.
My initial impression is that leaving multiline strings containing hashes intact isn't feasible with the current, simple regex/replace approach. I'll have a quick look into it, but covering this case may require parsing and tokenising (?), or else a thoroughly nightmarish and potentially unpredictable regex. Could this be expanding the scope further than is practicable without doing a lot more work?
Some options would be to:
RE: option 4: I will compose a follow-up comment with details of the Sesame issue. I didn't post this yesterday because it was late and too much to write at the time - wasn't intending to seem vague. Please stand by!
Is the sesame issue fixed in RDF4j?
Don't know! Investigating ...
Some background to the Sesame issue.
Although we believe the comments sometimes cause issues elsewhere within a SPARQL file, the following information specifically relates to a problem seen with SPARQL files that include a SELECT with multiline comments at the start of the file, e.g.
# First line of comments
# Second line of comments
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
?s ?p ?o .
}
The error message resulting from this query was particularly odd, given that one would expect some sort of SPARQL parsing or syntax exception:
RepositoryException Unsupported Accept for query type org.openrdf.http.client.SparqlSession.execute (SparqlSession.java:1103)
Why would there be an HTTP error concerning the Accept mime format, given that nothing else had changed about the query, except for a couple of comment lines?
In Grafter, we call prepareQuery here:
(defn prepare-query
"Low level function to prepare (parse, but not process) a sesame RDF
query. Takes a repository a query string and an optional sesame
Dataset to act as a query restriction.
Prepared queries still need to be evaluated with evaluate."
([repo sparql-string] (prepare-query repo sparql-string nil))
([repo sparql-string restriction]
(let [conn (->connection repo)]
(let [pq (.prepareQuery conn
QueryLanguage/SPARQL
sparql-string)]
(when restriction (.setDataset pq restriction))
pq))))
For SELECT
queries, which return result sets of key-value tuples, Sesame determines that a TupleQuery
should be prepared. For CONSTRUCT
queries, Sesame will prepare a GraphQuery
object. When the query is made over HTTP, Sesame adds the correct Accept
headers relating to acceptable RDF mime types, for either a GraphQuery
or a TupleQuery
- this is where the problem lies.
But how does Sesame determine the correct query type object to prepare and instantiate?
First, inside prepareQuery
, Sesame calls the removeSPARQLQueryProlog
method which strips a leading line of comments if found, and then all the prefix declarations, followed by any trailing comment line after the now removed prefix declarations.
Unfortunately, this means that a comment line will be left at the top of the file, if there was more than one line of comments.
public static String removeSPARQLQueryProlog(String queryString) {
String normalizedQuery = queryString;
// strip all prefix declarations
Pattern pattern = Pattern.compile("prefix[^:]+:\\s*<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(queryString);
int startIndexCorrection = 0;
while (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end() - startIndexCorrection,
normalizedQuery.length());
startIndexCorrection += (matcher.end() - startIndexCorrection);
}
normalizedQuery = normalizedQuery.trim();
// strip leading comment before base (if present)
pattern = Pattern.compile("^#[^\n]+");
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
normalizedQuery = normalizedQuery.trim();
}
// strip base declaration (if present)
pattern = Pattern.compile("^base\\s+<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
}
// strip leading comment after base (if present)
pattern = Pattern.compile("^#[^\n]+");
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
}
return normalizedQuery.trim();
}
}
What happens next is that Sesame simply looks to see if the remaining SPARQL string starts with a SELECT
(instantiate a TupleQuery
), or ASK
(BooleanQuery
), or else prepare the default GraphQuery
.
Because we still have a comment line left at the beginning of our SPARQL string, Sesame can't determine the query type and defaults to a GraphQuery
for our SELECT
, and that is how we end up with the wrong Accept HTTP headers.
public Query prepareQuery(QueryLanguage ql, String query, String base)
throws RepositoryException, MalformedQueryException
{
if (SPARQL.equals(ql)) {
String strippedQuery = QueryParserUtil.removeSPARQLQueryProlog(query).toUpperCase();
if (strippedQuery.startsWith("SELECT")) {
return prepareTupleQuery(ql, query, base);
}
else if (strippedQuery.startsWith("ASK")) {
return prepareBooleanQuery(ql, query, base);
}
else {
return prepareGraphQuery(ql, query, base);
}
}
throw new UnsupportedOperationException("Unsupported query language " + ql);
}
Thanks @scottlowe.
I've seen comments at the top of files cause problems, but I think I've also seen single line comments in the body of the query cause problems too (when there's a values
clause) - but sounds like you might have covered this off with your fix in https://github.com/Swirrl/grafter/issues/118#issuecomment-349654190 (?).
@RicSwirrl Yes, that's my experience too... comments occasionally cause issues elsewhere, but by stripping them all, we will have that secondary issue covered as well, as you've righty surmised.
Ok good explanation @scottlowe, and makes perfect sense about what's happening.
If you're looking at testing this with RDF4j you can check by using this branch/ns
https://github.com/Swirrl/grafter/blob/0.12.x-SNAPSHOT/src/grafter/rdf4j/sparql.clj
That branch should contain the most recent RDF4j with some grafter changes to support it...
Though IIRC the test suite is a broken currently on that branch, the code itself should work so you should be able to verify if the bug is still present easy enough.
@RickMoynihan Will do. I'll look at that 0.12.x version of Grafter. I don't need the tests; I can run some code in the REPL.
I've looked at the latest RDF4j code. It looks to be better structured overall and is architecturally different to Sesame, however, when you get into the details the Sesame logic relating to our issue persists:
The way the prolog is removed is substantially different though, via a lexer rather than regexes:
Whilst the effect of leaving a comment hanging is slightly different in this case, with the first comment line truncated differently to the Sesame version, the resulting bug is effectively the same though:
ed
# and this second line with a space after the hash
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o .
}
}
Regardless of how the code looks, I've actually run this RDF4j code from within a Grafter REPL and it fails with the same error as expected, so it's still "broken".
🤔 Just wondering if there is a W3C rule somewhere saying that you can only have a single comment on the first line, before prefixes. Haven't found any evidence supporting that though, so I'll assume that we're right at this time.
I will submit a comment-stripping patch for consideration, shortly.
Some sparql query files file to interpolate properly. This query
failed with this error:
But when I got rid of the leading comment it worked OK.
Note: some other queries seem ok with comments. Maybe it's the VALUES clause in combination with a comment that is the problem(?).