PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.79k stars 212 forks source link

"Leaking" of `_expr_*` columns into result relations #4719

Open snth opened 1 month ago

snth commented 1 month ago

What happened?

prqlc produces SQL which uses _expr_* columns internally for internal intermediate calculation results. These should never end up in the resulting output relation.

The current situation is understandable since not all SQL dialects support the SELECT EXCLUDE _expr_* syntax and without schema information about the input relations, the compiler cannot explicitly enumerate the output columns.

However if we think about PRQL from first principles as "a language for Relational Algebra" then I hazard the guess that we all agree that such internal implementation details should not form part of the output.

As pointed out in https://github.com/PRQL/prql/issues/4633#issuecomment-2220485438, such extraneous columns can also cause errors in downstream calcuations.

One of my personal ambitions for PRQL is to extend it beyond the current SQL backends to also include other "relational" languages and query engines such as Pandas, Polars, (Power Query) M Language, etc... to name just a few. In those cases the _expr_* columns should also not show up and ideally PRQL should produce consistent results between different backends.

PRQL input

from [{a=1}]
derive a=2

SQL output

WITH table_0 AS (
  SELECT
    1 AS a
)
SELECT
  a AS _expr_0,
  2 AS a
FROM
  table_0

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Expected SQL output

WITH table_0 AS (
  SELECT
    1 AS a
)
SELECT
  2 AS a
FROM
  table_0

MVCE confirmation

Anything else?

I'm not sure what the solution would be so just raising this as a tracking issue and discussion point.

There are some more examples in https://github.com/PRQL/prql/issues/3130#issuecomment-2222335851.

aljazerzen commented 1 month ago

I think the current result is correct.

from [{a = 3}] has type [{a = int}] (array of tuples with a single field named a)

derive a = 2 will add a new field and because name a already exists in the tuple, it will remove that name, but not the whole field. What remains is this: [{int, a = int}] (an array of tuples with two fields: an unnamed int and an int named a).

We express that in SQL as SELECT ... AS _expr_0, ... AS a.

kgutwin commented 1 month ago

derive a = 2 will add a new field and because name a already exists in the tuple, it will remove that name, but not the whole field. What remains is this: [{int, a = int}] (an array of tuples with two fields: an unnamed int and an int named a).

This reasoning makes sense to me, but might be a bit unexpected to a user, as it seems the most obvious result of derive a = 2 in this case would be to overwrite the existing a field in the tuple. Is it necessary to keep the unnamed field around?

Alternatively, it also makes sense that unnamed fields could be (by default?) filtered out of relations at the end of pipelines.

aljazerzen commented 1 month ago

Yes, valid points. I don't remember when we decided on this behavior, but I was a discussed decision, not just something that happened accidentally.

Before opening up discussion about these questions, it would be wise to find the issue where we decided on current semantics of derive. Maybe @snth rememeber where was that?