TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.84k stars 161 forks source link

`sqlmesh format` is rewritting `::` to `CAST`, but only in some cases #3406

Closed plaflamme closed 3 days ago

plaflamme commented 4 days ago

Bumping from 0.129.0 to 0.133.0 causes the following diffs in our models:

SELECT
  foo::STRING,
  JSON_VALUE(j, '$.bar')::NUIMERIC AS bar
FROM ...

rewritten to:

SELECT
  foo::STRING,
  CAST(JSON_VALUE(j, '$.bar') AS NUIMERIC) AS bar
FROM ...

Can we avoid this? We much prefer the shorthand casts than the normal BigQuery syntax.

georgesittas commented 4 days ago

Yeah, this looks like a bug. Thanks for reporting.

FYI @VaggelisD

georgesittas commented 3 days ago

So the fix I proposed in that PR wasn't really complete. The crux of the problem is that :: is added only for expressions where it can't mess up precedence. It just so happens that JSON_VALUE is parsed into JSONExtractScalar in bigquery, which is both a binary expression and a function, and binaries are unsafe to add :: to because precedence can be changed, so this one's an exception where we fallback to the safer CAST syntax.

Fixing this generically (i.e. taking other similar expressions into account) is a PITA and after discussing this internally we concluded that we won't address this right now.

plaflamme commented 3 days ago

@georgesittas that's unfortunate.

FMI: what cases is JSON_VALUE considered a binary expression? From the BigQuery documentation, it's a function that always returns STRING.

georgesittas commented 3 days ago

It's a binary expression in dialects like duckdb, postgres, mysql etc: x -> some_json_path. This syntax and the function one are represented by the same expression to facilitate transpilation.

plaflamme commented 3 days ago

This syntax and the function one are represented by the same expression to facilitate transpilation.

I see.

Fixing this generically (i.e. taking other similar expressions into account) is a PITA and after discussing this internally we concluded that we won't address this right now.

Understood. Should I report an issue to sqlglot about this?

georgesittas commented 3 days ago

I think it's not a sqlglot issue, it's only a sqlmesh one. For now we'll probably keep the current behavior, cc @tobymao.

plaflamme commented 2 days ago

I think it's not a sqlglot issue, it's only a sqlmesh one

Right, though presumably, some support from sqlglot would be necessary? For example, nodes that are both Binary and Func (such as JSONExtractScalar) could expose what they transpile to in different dialects, e.g.:

class BinaryOrFunc(Binary, Func):
  """Nodes that either transpile to binary expressions or functions depending on the dialect"""
  def is_func(self, dialect) -> bool:
    """True when this will always transpile to a function call in the specified dialect"""
    pass

Perhaps this isn't how you'd want to solve this though, not sure...

georgesittas commented 2 days ago

There could be some new abstraction there to help, sure. But it wouldn't really be useful for SQLGlot, and so there's a risk of it being refactored/removed in the future without a clear indication of its impact. So ideally, since the formatter lives in SQLMesh land, a possible solution should be self-contained without creating such implicit dependencies between the two packages. And currently, a solution in SQLMesh would kinda be a PITA to maintain. Possible solutions we discussed involved hardcoding dialects (e.g. a mapping of dialect to "castable expressions"), which is also a smell, since this mapping can become out-of-date.

So the tradeoff here was that: for these rare cases, it should be ok to have a small discrepancy in how the formatter behaves. This is already true today, e.g. binary and unary expressions aren't cast using the :: syntax.

plaflamme commented 2 days ago

Yeah, that makes sense, I appreciate not wanting to bloat sqlglot for sqlmesh purposes. That said, since sqlglot is a transpiler, it does seem like it could be useful to know what an ambiguous expression will transpile to. Maybe there's only one use-case (formatting SQL), but perhaps having an open issue on sqlglot repo would surface other use cases?

georgesittas commented 2 days ago

We generally avoid keeping open-ended open issues in sqlglot, because we use the issue list to as an indication of what's actively being developed. If you want you can open an issue, cross-reference this SQLMesh issue and then we can close it as not planned, so that it's documented.