PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.9k stars 217 forks source link

Fix Quoting issue for BigQuery #852

Closed blackary closed 2 years ago

blackary commented 2 years ago

Though #838 may have resolved issues with other dialects, it hasn't fixed the BigQuery case

prql dialect:bigquery
from db.schema.table

Now yields the result with no quotes at all

SELECT
  db.schema.table.*
FROM
  db.schema.table

But, this is invalid. In BigQuery, it actually needs to have `s around the full name in the select and from.

SELECT
  `db.schema.table`.*
FROM
  `db.schema.table`
max-sixty commented 2 years ago

As long as the identifiers are valid, BQ didn't seem to require backticks.

Do you have a specific case where it raises and error?

Thank you

blackary commented 2 years ago

Here's a fully-functioning example:

prql dialect:bigquery
from `bigquery-public-data.pypi.file_downloads`
filter timestamp > '2024-08-01'

->

SELECT
  `bigquery-public-data`.pypi.file_downloads.*
FROM
  `bigquery-public-data`.pypi.file_downloads
WHERE
  timestamp > '2024-08-01'

Note that it puts backticks around biquery-public-data because of the hyphens, and I get the error

Unrecognized name: `bigquery-public-data` at [2:3]

I get the same error if the db name doesn't have hyphens in it.

If I change it so that there are backticks around the whole db.schema.table, it works (though, given the date, returns no results)

SELECT
  `bigquery-public-data.pypi.file_downloads`.*
FROM
  `bigquery-public-data.pypi.file_downloads`
WHERE
  timestamp > '2024-08-01'
max-sixty commented 2 years ago

Ah, sorry about that @blackary , mea culpa.

IIUC, the initial example would be OK — i.e., without the hyphen, it would be fine without the backticks. But if any part of the table expression has an invalid character, the whole identifier needs backticks. Lmk if that's inconsistent with how you're thinking about it.

Ironically I originally tested this with BQ, so the very first approach worked like this. But it's incompatible with Postgres (#822), which requires each part to be quoted separately.

So we'll have to split the implementation up further between dialects, I'll prioritize this and get to it later this week.

blackary commented 2 years ago

Sorry, @max-sixty, that wasn't the best example, due to the hyphenated project name -- I used that one because it's using a public dataset. Here's something closer to the real example I'm working with:

prql dialect:bigquery
from company.apps.users
SELECT
  company.apps.users.*
FROM
  company.apps.users

Even with no hyphens, BigQuery still complains about the select part

  Unrecognized name: company at [2:3]

If I put backticks around both, it works fine

SELECT
  `company.apps.users`.*
FROM
  `company.apps.users`

Thanks for being so responsive! I'm excited to try prql, but am not able to currently due to this issue.

max-sixty commented 2 years ago

I see — so it seems that:

So I think a rule is both fairly simple and complete is: "Any ident containing >2 periods should be quoted". We can write a BQ-specific implementation without too much trouble, I think. It over-quotes idents in the FROM clause, but that's OK.

An alternative is to quote every identifier, which would definitely be complete. But then it makes the SQL more "machine-generated" and less friendly — in this case less welcoming to newcomers, and in other cases more difficult to debug.

Are there any gaps you can perceive in that rule @blackary?

Thanks for being so responsive! I'm excited to try prql, but am not able to currently due to this issue.

Thanks for your feedback and patience — I would have hoped the patience would be expended on more conceptual issues, but inevitably it's these small differences between dialects which are the difficult cases 🙏

blackary commented 2 years ago

@max-sixty I think that rule would cover all the cases. If I was hand-writing the SQL, I would do your second example (just using the table name in the select, without schema or db), but I figure that might add other issues with potential name collisions.

Also, regarding over-quoting, it seems that if you quote the SELECT case, you must also quote the FROM.

e.g. this fails

SELECT
  `company.apps.users`.*
FROM
  company.apps.users

But this works

SELECT
  `company.apps.users`.*
FROM
  `company.apps.users`

Thanks for being willing to work on the fiddly dialect edge-cases.

max-sixty commented 2 years ago

https://github.com/prql/prql/pull/858 almost gets us there, but then I just spotted another case!

prql dialect:bigquery
from pj.ds.tbl

works well:

SELECT
  `pj.ds.tbl`.*
FROM
  `pj.ds.tbl`

...but because idents are currently required to be fully enclosed by backticks, the code that dbt generates doesn't work:

prql dialect:bigquery
from `pj`.`ds`.`tbl`
SELECT
  `pj``.``ds``.``tbl`.*
FROM
  `pj``.``ds``.``tbl`  # invalid

This is mostly a problem for dbt; it's rare that someone really requires this when manually writing it out. Currently the parser (i.e. the very first stage) doesn't understand ident "parts". I'll see whether I can try and inch towards a solution without building any complexity there for the moment.

max-sixty commented 1 year ago

FYI I looked at whether we could get rid of the BQ specialization after 0.3.0. I don't think that's possible, since we use full identifiers in the SELECT, and from above:

We always need them in the SELECT clause where there's a dataset / two periods

So leaving the BQ specialization in for the moment. Possibly in the future we can remove the dataset etc from the SELECT statement