SANSA-Stack / Archived-SANSA-Query

SANSA Query Layer
Apache License 2.0
31 stars 13 forks source link

Problem with a SPARQL query containing DISTINCT and ORDER BY #35

Closed jmaharramzade closed 4 years ago

jmaharramzade commented 5 years ago

I'll demonstrate the problem using the Sparqlify example: https://github.com/SANSA-Stack/SANSA-Examples/blob/develop/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/query/Sparqlify.scala.

Run the Sparqlify class in the server/endpoint mode pointing to the rdf.nt as input (--input src/main/resources/rdf.nt). Execute the following query:

SELECT DISTINCT ?x ?y WHERE {
    ?x <http://xmlns.com/foaf/0.1/givenName> ?y .
}
ORDER BY ?y

Observe the error in the server console.

Exception in thread "Thread-31" java.lang.RuntimeException: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: cannot resolve '`a_1.o`' given input columns: [o, o_2, s, l, o_1]; line 4 pos 9;
'Sort ['a_1.o ASC NULLS FIRST, 'a_1.l ASC NULLS FIRST], true
+- Distinct
   +- Project [o#55 AS o#296, l#56 AS l#297, s#54 AS s#298, o#55 AS o_1#299, l#56 AS o_2#300]
      +- SubqueryAlias `a_1`
         +- SubqueryAlias `http://xmlns.com/foaf/0.1/givenname_xmlschema#string_lang`
            +- LogicalRDD [s#54, o#55, l#56], false

    at org.aksw.jena_sparql_api.web.utils.RunnableAsyncResponseSafe.run(RunnableAsyncResponseSafe.java:29)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: cannot resolve '`a_1.o`' given input columns: [o, o_2, s, l, o_1]; line 4 pos 9;
'Sort ['a_1.o ASC NULLS FIRST, 'a_1.l ASC NULLS FIRST], true
+- Distinct
   +- Project [o#55 AS o#296, l#56 AS l#297, s#54 AS s#298, o#55 AS o_1#299, l#56 AS o_2#300]
      +- SubqueryAlias `a_1`
         +- SubqueryAlias `http://xmlns.com/foaf/0.1/givenname_xmlschema#string_lang`
            +- LogicalRDD [s#54, o#55, l#56], false

    at org.aksw.jena_sparql_api.web.servlets.SparqlEndpointBase$3.run(SparqlEndpointBase.java:352)
    at org.aksw.jena_sparql_api.web.utils.RunnableAsyncResponseSafe.run(RunnableAsyncResponseSafe.java:26)
    ... 1 more
Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`a_1.o`' given input columns: [o, o_2, s, l, o_1]; line 4 pos 9;
'Sort ['a_1.o ASC NULLS FIRST, 'a_1.l ASC NULLS FIRST], true
+- Distinct
   +- Project [o#55 AS o#296, l#56 AS l#297, s#54 AS s#298, o#55 AS o_1#299, l#56 AS o_2#300]
      +- SubqueryAlias `a_1`
         +- SubqueryAlias `http://xmlns.com/foaf/0.1/givenname_xmlschema#string_lang`
            +- LogicalRDD [s#54, o#55, l#56], false

    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:107)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
    at net.sansa_stack.query.spark.sparqlify.QueryExecutionUtilsSpark.createQueryExecution(QueryExecutionUtilsSpark.java:23)
    at net.sansa_stack.query.spark.sparqlify.QueryExecutionSparqlifySpark.executeCoreSelect(QueryExecutionSparqlifySpark.java:38)
    at org.aksw.jena_sparql_api.core.QueryExecutionBaseSelect.execSelect(QueryExecutionBaseSelect.java:407)
    at org.aksw.jena_sparql_api.web.servlets.ProcessQuery.processQuery(ProcessQuery.java:117)
    at org.aksw.jena_sparql_api.web.servlets.ProcessQuery.processQuery(ProcessQuery.java:75)
    at org.aksw.jena_sparql_api.web.servlets.SparqlEndpointBase$3.run(SparqlEndpointBase.java:349)
    ... 2 more
Aklakan commented 4 years ago

The generated query has the following form which looks very valid:

SELECT DISTINCT `a_1`.`o` `o`, `a_1`.`l` `l`, `a_1`.`s` `s`, `a_1`.`o` `o_1`, `a_1`.`l` `o_2`
FROM
  `http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn` `a_1`
ORDER BY `a_1`.`o`, `a_1`.`l`

Running it on postgres (with adjusted escaping) gives:

CREATE TABLE "http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn" (
  "s" text,
  "o" text,
  "l" text
);

SELECT DISTINCT "a_1"."o" "o", "a_1"."l" "l", "a_1"."s" "s", "a_1"."o" "o_1", "a_1"."l" "o_2"
FROM
  "http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn" "a_1"
ORDER BY "a_1"."o", "a_1"."l";

 o | l | s | o_1 | o_2 
---+---+---+-----+-----
(0 rows)

So I am afraid this is an issue with catalyst: In the spark algebra snippet below, on can see that the sort operation was moved above the distinct, however, the column references were not adjusted to 'foo#bar' notation, and instead still make use of the aliases of the original query string.

org.apache.spark.sql.AnalysisException: cannot resolve '`a_1.o`' given input columns: [s, l, o_2, o_1, o]; line 4 pos 9;

'Sort ['a_1.o ASC NULLS FIRST, 'a_1.l ASC NULLS FIRST], true
+- Distinct
   +- Project [o#4 AS o#32, l#5 AS l#33, s#3 AS s#34, o#4 AS o_1#35, l#5 AS o_2#36]
      +- SubqueryAlias `a_1`
         +- SubqueryAlias `http://xmlns.com/foaf/0.1/name_xmlschema#string_langsbn`
            +- LogicalRDD [s#3, o#4, l#5], false

For now I close this issue as wont-fix as I think our generated SQL is correct but catalyst gets it wrong.