Swirrl / grafter

Linked Data & RDF Manufacturing Tools in Clojure
Eclipse Public License 1.0
188 stars 17 forks source link

VALUES Sparql interpolation with comments fails #118

Open RicSwirrl opened 6 years ago

RicSwirrl commented 6 years ago

Some sparql query files file to interpolate properly. This query

# Given a list of collection_uris, gets features in those collections
# which are parents of the passed feature_uri

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>

SELECT ?ancestor_uri ?collection_uri WHERE {

   VALUES ?collection_uri {}

   ?ancestor_uri foi:memberOf ?collection_uri .
   ?feature_uri foi:within ?ancestor_uri .
}

failed with this error:

RepositoryException org.openrdf.http.client.SparqlSession.execute (SparqlSession.java:1103)

But when I got rid of the leading comment it worked OK.

Note: some other queries seem ok with comments. Maybe it's the VALUES clause in combination with a comment that is the problem(?).

scottlowe commented 6 years ago

Rick commented elsewhere:

Ok probably a grafter issue to do with the regex preprocessing. Grafter should probably strip all comment lines before doing that processing.

I have found the cause of this issue. It's caused by some slightly naive logic in Sesame.

Nevertheless, the easy solution is to strip comments in Grafter, as Rick has suggested, rather than alter Sesame. I have a patch almost ready and am currently writing tests to provide confidence and cover edge cases.

RickMoynihan commented 6 years ago

Nice one @scottlowe,

Is the sesame issue fixed in RDF4j? Because if it is it might be worth postponing until we can use an RDF4j grafter build. If it's not fixed in RDF4j, we should file a bug report.

There are some edge cases to consider with # occuring inside either a URI PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> which obviously should not strip characters following #. Also similarly for strings e.g. "foo #blah" should not be stripped.

Finally a complicating case is multiline strings containing where #'s should not be stripped though in practice queries like this are highly unlikely:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> # only strip this not the uri #

# strip
SELECT * WHERE { # strip
  ?s ?p "foo # dont strip
# dont strip
" . # strip .
}
scottlowe commented 6 years ago

There are some edge cases to consider with #

My test case looks like this, with hashes in URIs and it works:

#This is the first comment line; it should be removed
# and this second line with a space after the hash should be removed

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> # <- uri with hash should be left alone, but this comment removed
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>

SELECT * WHERE {
  # This should also be removed
  GRAPH ?g { #and this #dsf/># also
    ?s ?p ?o .
  }
}

The result of applying the comment-stripper was:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>

SELECT * WHERE {
  GRAPH ?g {
    ?s ?p ?o .
  }
}

I was quite pleased with the positive results, given the relatively simple regex:

(\s+#\s*[^\n]+)|(^#\s*[^\n]+)

However

Finally a complicating case is multiline strings containing where #'s

I appreciate your thoroughness @RickMoynihan, especially given that I didn't consider multiline strings, but that might throw a spanner in the works.

My initial impression is that leaving multiline strings containing hashes intact isn't feasible with the current, simple regex/replace approach. I'll have a quick look into it, but covering this case may require parsing and tokenising (?), or else a thoroughly nightmarish and potentially unpredictable regex. Could this be expanding the scope further than is practicable without doing a lot more work?

Some options would be to:

  1. Simply ignore the complicated multiline string edge-case
  2. Strip comments using the simple regex approach in our client app and leave Grafter alone, but that wouldn't help Grafter users.
  3. Allow a per-query option to be passed to Grafter that disables comment stripping for gnarly edge-case multiline string queries.
  4. Do nothing for now, and avoid multiline comments at the beginning of SPARQL files

RE: option 4: I will compose a follow-up comment with details of the Sesame issue. I didn't post this yesterday because it was late and too much to write at the time - wasn't intending to seem vague. Please stand by!

Is the sesame issue fixed in RDF4j?

Don't know! Investigating ...

scottlowe commented 6 years ago

Some background to the Sesame issue.

Although we believe the comments sometimes cause issues elsewhere within a SPARQL file, the following information specifically relates to a problem seen with SPARQL files that include a SELECT with multiline comments at the start of the file, e.g.

# First line of comments
# Second line of comments

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>

SELECT * WHERE {
   ?s ?p ?o .
}

The error message resulting from this query was particularly odd, given that one would expect some sort of SPARQL parsing or syntax exception:

RepositoryException Unsupported Accept for query type org.openrdf.http.client.SparqlSession.execute (SparqlSession.java:1103)

Why would there be an HTTP error concerning the Accept mime format, given that nothing else had changed about the query, except for a couple of comment lines?

In Grafter, we call prepareQuery here:

(defn prepare-query
  "Low level function to prepare (parse, but not process) a sesame RDF
  query.  Takes a repository a query string and an optional sesame
  Dataset to act as a query restriction.

  Prepared queries still need to be evaluated with evaluate."
  ([repo sparql-string] (prepare-query repo sparql-string nil))
  ([repo sparql-string restriction]
     (let [conn (->connection repo)]
       (let [pq (.prepareQuery conn
                               QueryLanguage/SPARQL
                               sparql-string)]

         (when restriction (.setDataset pq restriction))
         pq))))

For SELECT queries, which return result sets of key-value tuples, Sesame determines that a TupleQuery should be prepared. For CONSTRUCT queries, Sesame will prepare a GraphQuery object. When the query is made over HTTP, Sesame adds the correct Accept headers relating to acceptable RDF mime types, for either a GraphQuery or a TupleQuery - this is where the problem lies.

But how does Sesame determine the correct query type object to prepare and instantiate?

First, inside prepareQuery, Sesame calls the removeSPARQLQueryProlog method which strips a leading line of comments if found, and then all the prefix declarations, followed by any trailing comment line after the now removed prefix declarations.

Unfortunately, this means that a comment line will be left at the top of the file, if there was more than one line of comments.

public static String removeSPARQLQueryProlog(String queryString) {
   String normalizedQuery = queryString;

   // strip all prefix declarations
   Pattern pattern = Pattern.compile("prefix[^:]+:\\s*<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
   Matcher matcher = pattern.matcher(queryString);

   int startIndexCorrection = 0;
   while (matcher.find()) {
     normalizedQuery = normalizedQuery.substring(matcher.end() - startIndexCorrection,
         normalizedQuery.length());
     startIndexCorrection += (matcher.end() - startIndexCorrection);
   }

   normalizedQuery = normalizedQuery.trim();

   // strip leading comment before base (if present)
   pattern = Pattern.compile("^#[^\n]+");
   matcher = pattern.matcher(normalizedQuery);
   if (matcher.find()) {
     normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
     normalizedQuery = normalizedQuery.trim();
   }

   // strip base declaration (if present)
   pattern = Pattern.compile("^base\\s+<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
   matcher = pattern.matcher(normalizedQuery);
   if (matcher.find()) {
     normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
   }

   // strip leading comment after base (if present)
   pattern = Pattern.compile("^#[^\n]+");
   matcher = pattern.matcher(normalizedQuery);
   if (matcher.find()) {
     normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
   }

   return normalizedQuery.trim();
 }
}

What happens next is that Sesame simply looks to see if the remaining SPARQL string starts with a SELECT (instantiate a TupleQuery), or ASK (BooleanQuery), or else prepare the default GraphQuery.

Because we still have a comment line left at the beginning of our SPARQL string, Sesame can't determine the query type and defaults to a GraphQuery for our SELECT, and that is how we end up with the wrong Accept HTTP headers.

public Query  prepareQuery(QueryLanguage ql, String query, String base)
  throws RepositoryException, MalformedQueryException
{
  if (SPARQL.equals(ql)) {
    String strippedQuery = QueryParserUtil.removeSPARQLQueryProlog(query).toUpperCase();
    if (strippedQuery.startsWith("SELECT")) {
      return prepareTupleQuery(ql, query, base);
    }
    else if (strippedQuery.startsWith("ASK")) {
      return prepareBooleanQuery(ql, query, base);
    }
    else {
      return prepareGraphQuery(ql, query, base);
    }
  }
  throw new UnsupportedOperationException("Unsupported query language " + ql);
}
RicSwirrl commented 6 years ago

Thanks @scottlowe.

I've seen comments at the top of files cause problems, but I think I've also seen single line comments in the body of the query cause problems too (when there's a values clause) - but sounds like you might have covered this off with your fix in https://github.com/Swirrl/grafter/issues/118#issuecomment-349654190 (?).

scottlowe commented 6 years ago

@RicSwirrl Yes, that's my experience too... comments occasionally cause issues elsewhere, but by stripping them all, we will have that secondary issue covered as well, as you've righty surmised.

RickMoynihan commented 6 years ago

Ok good explanation @scottlowe, and makes perfect sense about what's happening.

If you're looking at testing this with RDF4j you can check by using this branch/ns

https://github.com/Swirrl/grafter/blob/0.12.x-SNAPSHOT/src/grafter/rdf4j/sparql.clj

That branch should contain the most recent RDF4j with some grafter changes to support it...

Though IIRC the test suite is a broken currently on that branch, the code itself should work so you should be able to verify if the bug is still present easy enough.

scottlowe commented 6 years ago

@RickMoynihan Will do. I'll look at that 0.12.x version of Grafter. I don't need the tests; I can run some code in the REPL.

scottlowe commented 6 years ago

I've looked at the latest RDF4j code. It looks to be better structured overall and is architecturally different to Sesame, however, when you get into the details the Sesame logic relating to our issue persists:

https://github.com/eclipse/rdf4j/blob/f654b03707657ded963479dc9c8c697458de6e70/core/repository/sparql/src/main/java/org/eclipse/rdf4j/repository/sparql/SPARQLConnection.java#L383-L399

The way the prolog is removed is substantially different though, via a lexer rather than regexes:

https://github.com/eclipse/rdf4j/blob/a283e758fb79e7b62e94b669547e6e64c64106cf/core/queryparser/api/src/main/java/org/eclipse/rdf4j/query/parser/QueryPrologLexer.java#L154-L189

Whilst the effect of leaving a comment hanging is slightly different in this case, with the first comment line truncated differently to the Sesame version, the resulting bug is effectively the same though:

ed
# and this second line with a space after the hash

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>

SELECT * WHERE {
  GRAPH ?g {
    ?s ?p ?o .
  }
}

Regardless of how the code looks, I've actually run this RDF4j code from within a Grafter REPL and it fails with the same error as expected, so it's still "broken".

🤔 Just wondering if there is a W3C rule somewhere saying that you can only have a single comment on the first line, before prefixes. Haven't found any evidence supporting that though, so I'll assume that we're right at this time.

I will submit a comment-stripping patch for consideration, shortly.