Closed jmaharramzade closed 4 years ago
The generated query has the following form which looks very valid:
SELECT DISTINCT `a_1`.`o` `o`, `a_1`.`l` `l`, `a_1`.`s` `s`, `a_1`.`o` `o_1`, `a_1`.`l` `o_2`
FROM
`http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn` `a_1`
ORDER BY `a_1`.`o`, `a_1`.`l`
Running it on postgres (with adjusted escaping) gives:
CREATE TABLE "http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn" (
"s" text,
"o" text,
"l" text
);
SELECT DISTINCT "a_1"."o" "o", "a_1"."l" "l", "a_1"."s" "s", "a_1"."o" "o_1", "a_1"."l" "o_2"
FROM
"http://xmlns.com/foaf/0.1/name_XMLSchema#string_langsbn" "a_1"
ORDER BY "a_1"."o", "a_1"."l";
o | l | s | o_1 | o_2
---+---+---+-----+-----
(0 rows)
So I am afraid this is an issue with catalyst: In the spark algebra snippet below, on can see that the sort operation was moved above the distinct, however, the column references were not adjusted to 'foo#bar' notation, and instead still make use of the aliases of the original query string.
org.apache.spark.sql.AnalysisException: cannot resolve '`a_1.o`' given input columns: [s, l, o_2, o_1, o]; line 4 pos 9;
'Sort ['a_1.o ASC NULLS FIRST, 'a_1.l ASC NULLS FIRST], true
+- Distinct
+- Project [o#4 AS o#32, l#5 AS l#33, s#3 AS s#34, o#4 AS o_1#35, l#5 AS o_2#36]
+- SubqueryAlias `a_1`
+- SubqueryAlias `http://xmlns.com/foaf/0.1/name_xmlschema#string_langsbn`
+- LogicalRDD [s#3, o#4, l#5], false
For now I close this issue as wont-fix as I think our generated SQL is correct but catalyst gets it wrong.
I'll demonstrate the problem using the Sparqlify example: https://github.com/SANSA-Stack/SANSA-Examples/blob/develop/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/query/Sparqlify.scala.
Run the Sparqlify class in the server/endpoint mode pointing to the rdf.nt as input (--input src/main/resources/rdf.nt). Execute the following query:
Observe the error in the server console.