dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
406 stars 228 forks source link

[Bug] Complex types are truncated during `describe extended` #1107

Open mikealfare opened 2 months ago

mikealfare commented 2 months ago

Is this a new bug in dbt-spark?

Current Behavior

Complex types are truncated when running this macro: https://github.com/dbt-labs/dbt-spark/blob/3fc624cb99488e803956304c9dea2c10facab08d/dbt/include/spark/macros/adapters.sql#L281-L286

This happens due to DESCRIBE EXTENDED, which truncates the results before returning them.

Expected Behavior

The types should be complete.

Steps To Reproduce

Relevant log output

No response

Environment

- OS:
- Python:
- dbt-core:
- dbt-spark:

Additional Context

No response

amychen1776 commented 2 months ago

@benc-db This is the issue we were talking about yesterday about the issues with the Databricks Metadata API. Is this just a Databricks specific issue?

benc-db commented 2 months ago

It is Databricks specific, but may affect dbt-spark as well.

benc-db commented 2 months ago

lol, I didn't see where I was commenting. So, I do not know the extent to which describe extended is standard Spark vs Databricks, which is probably what you're asking here.

amychen1776 commented 2 months ago

@benc-db yup :)

@mikealfare did you find this bug running on Databricks then?

mikealfare commented 1 month ago

@amychen1776 Apologies for the late reply; my GH notifications have been out of control. I believe this was reported by a Cloud customer that was running dbt-spark with Databricks.

benc-db commented 1 month ago

I'll summarize here what I'm doing in dbt-databricks: in 1.9 I'm introducing a behavior flag to use information schema to get column types for UC tables. The reason I'm guarding with a flag is because I learned in testing that information schema is not always synced up with reality, and to ensure that it is, I have run a repair table operation before gathering columns. This adds overhead. I'm hopeful that I can remove the flag when sync gets better for information schema, because in my testing, I hit columns missing between successive dbt runs that took on the order of minutes...too long for me to feel comfortable about trusting it for this.

tinolyuu commented 1 month ago

Hi, not sure if I encountered the same issue. I got runtime error when adding a struct column to an incremental model on dbt-spark. Here's the error.

Runtime Error

 [PARSE_SYNTAX_ERROR] Syntax error at or near ','.(line 7, pos 34)

 == SQL ==
 /* {"app": "dbt", "dbt_version": "1.8.6", "profile_name": "main_spark", "target_name": "dev", "node_id": "model.main.evens_only"} */

     alter table test_db.evens_only_spark

         add columns

                struct_test struct<,... 1 more fields>
 ----------------------------------^^^

It seems the data type read in parse_describe_extended func is [<agate.Row: ('id', 'int', None)>, <agate.Row: ('struct_test', 'struct<,... 1 more fields>', None)>]. Don't know why the struct type doesn't show the internal fields.

hongtron commented 6 days ago

This impacts unit testing as well. I can't provide test values for my complex type because the ,... $N more fields> artifact gets compiled into the generated cast statement.

I'm not using Databricks.