dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.62k stars 1.59k forks source link

[Bug] `--empty` flag fails with dbt.ref in snowpark #10517

Open canderson-luminatedata opened 1 month ago

canderson-luminatedata commented 1 month ago

Is this a new bug in dbt-core?

Current Behavior

When running dbt run --empty with a snowpark model that includes dbt.ref, it falls over.

Here is a minimal snowpark model:

import pandas as pd

def model(dbt, session):
    data = dbt.ref("stg_ftv__tvt_arena")
    return pd.DataFrame({"x":[1,2,3], "y":['a','b','c']})

and dbt run --empty fails with

16:08:50    Database Error in model pymodel (models/pymodel.py)
  100357 (P0000): Python Interpreter Error:
  Traceback (most recent call last):
    File "_udf_code.py", line 97, in main
    File "_udf_code.py", line 12, in model
    File "_udf_code.py", line 61, in <lambda>
    File "_udf_code.py", line 28, in ref
    File "/usr/lib/python_udf/8c636e84b1538a317283a9cebb185cc59b812b3e7c12390bf024519dc9298cf3/lib/python3.8/site-packages/snowflake/snowpark/session.py", line 1857, in table
      validate_object_name(name)
    File "/usr/lib/python_udf/8c636e84b1538a317283a9cebb185cc59b812b3e7c12390bf024519dc9298cf3/lib/python3.8/site-packages/snowflake/snowpark/_internal/utils.py", line 218, in validate_object_name
      raise SnowparkClientExceptionMessages.GENERAL_INVALID_OBJECT_NAME(name)
  snowflake.snowpark.exceptions.SnowparkInvalidObjectNameException: (1500): The object name '(select * from DATASCIENCE_DB.SANDBOX_S.stg_ftv__tvt_arena where false limit 0)' is invalid.
   in function PYMODEL__DBT_SP with handler main
  compiled Code at target/run/dbt_ftv/models/pymodel.py
16:08:50

Expected Behavior

that it should run through all dbt models quickly without failing

Steps To Reproduce

with

dbt-core==1.8.4
dbt-extractor==0.5.1
dbt-snowflake==1.8.3
dbt-semantic-interfaces==0.5.1
snowflake-snowpark-python==1.12.0
snowflake-connector-python[pandas]==3.7.0

and minimal model

import pandas as pd

def model(dbt, session):
    data = dbt.ref("[some model name]")
    return pd.DataFrame({"x":[1,2,3], "y":['a','b','c']})

Relevant log output

It is complaining about the following autogenerated code:

def ref(*args, **kwargs):
    refs = {"stg_ftv__tvt_arena": "(select * from DATASCIENCE_DB.SANDBOX_S.stg_ftv__tvt_arena where false limit 0)"}
    key = '.'.join(args)
    version = kwargs.get("v") or kwargs.get("version")
    if version:
        key += f".v{version}"
    dbt_load_df_function = kwargs.get("dbt_load_df_function")
    return dbt_load_df_function(refs[key])

Environment

- OS: 143.1
- Python: 3.11.9
- dbt: 1.8.4

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

dbeatty10 commented 1 month ago

Thanks for reporting this @canderson-luminatedata !

I got the same error as you with the following files and commands:

### Reprex Files: `models/some_sql_model.sql` ```sql select 1 as id ``` `models/snowpark_model.py` ```python import pandas as pd def model(dbt, session): data = dbt.ref("some_sql_model") return pd.DataFrame({"x":[1,2,3], "y":['a','b','c']}) ``` Commands: ``` dbt run --empty ``` Output: ``` (dbt_1.8) $ dbt run --empty 23:44:57 Running with dbt=1.8.3 23:44:58 Registered adapter: snowflake=1.8.3 23:44:58 Found 2 models, 444 macros 23:44:58 23:45:01 Concurrency: 10 threads (target='snowflake') 23:45:01 23:45:01 1 of 2 START sql view model dbt_dbeatty.some_model ............................. [RUN] 23:45:02 1 of 2 OK created sql view model dbt_dbeatty.some_model ........................ [SUCCESS 1 in 0.96s] 23:45:02 2 of 2 START python table model dbt_dbeatty.snowpark_model ..................... [RUN] 23:45:05 2 of 2 ERROR creating python table model dbt_dbeatty.snowpark_model ............ [ERROR in 3.31s] 23:45:05 23:45:05 Finished running 1 view model, 1 table model in 0 hours 0 minutes and 6.62 seconds (6.62s). 23:45:05 23:45:05 Completed with 1 error and 0 warnings: 23:45:05 23:45:05 Database Error in model snowpark_model (models/snowpark_model.py) 100357 (P0000): Python Interpreter Error: Traceback (most recent call last): File "_udf_code.py", line 95, in main File "_udf_code.py", line 10, in model File "_udf_code.py", line 59, in File "_udf_code.py", line 26, in ref File "/usr/lib/python_udf/8c636e84b1538a317283a9cebb185cc59b812b3e7c12390bf024519dc9298cf3/lib/python3.8/site-packages/snowflake/snowpark/session.py", line 1857, in table validate_object_name(name) File "/usr/lib/python_udf/8c636e84b1538a317283a9cebb185cc59b812b3e7c12390bf024519dc9298cf3/lib/python3.8/site-packages/snowflake/snowpark/_internal/utils.py", line 218, in validate_object_name raise SnowparkClientExceptionMessages.GENERAL_INVALID_OBJECT_NAME(name) snowflake.snowpark.exceptions.SnowparkInvalidObjectNameException: (1500): The object name '(select * from analytics_dev.dbt_dbeatty.some_model where false limit 0)' is invalid. in function SNOWPARK_MODEL__DBT_SP with handler main compiled Code at target/run/my_project/models/snowpark_model.py 23:45:05 23:45:05 Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2 ``` `target/run/my_project/models/snowpark_model.py` ``` import pandas as pd def model(dbt, session): data = dbt.ref("some_model") return pd.DataFrame({"x":[1,2,3], "y":['a','b','c']}) # This part is user provided model code # you will need to copy the next section to run the code # COMMAND ---------- # this part is dbt logic for get ref work, do not modify def ref(*args, **kwargs): refs = {"some_model": "(select * from analytics_dev.dbt_dbeatty.some_model where false limit 0)"} key = '.'.join(args) version = kwargs.get("v") or kwargs.get("version") if version: key += f".v{version}" dbt_load_df_function = kwargs.get("dbt_load_df_function") return dbt_load_df_function(refs[key]) def source(*args, dbt_load_df_function): sources = {} key = '.'.join(args) return dbt_load_df_function(sources[key]) config_dict = {} class config: def __init__(self, *args, **kwargs): pass @staticmethod def get(key, default=None): return config_dict.get(key, default) class this: """dbt.this() or dbt.this.identifier""" database = "analytics_dev" schema = "dbt_dbeatty" identifier = "snowpark_model" def __repr__(self): return 'analytics_dev.dbt_dbeatty.snowpark_model' class dbtObj: def __init__(self, load_df_function) -> None: self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function) self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function) self.config = config self.this = this() self.is_incremental = False # COMMAND ---------- # To run this in snowsight, you need to select entry point to be main # And you may have to modify the return type to text to get the result back # def main(session): # dbt = dbtObj(session.table) # df = model(dbt, session) # return df.collect() # to run this in local notebook, you need to create a session following examples https://github.com/Snowflake-Labs/sfguide-getting-started-snowpark-python # then you can do the following to run model # dbt = dbtObj(session.table) # df = model(dbt, session) def materialize(session, df, target_relation): # make sure pandas exists import importlib.util package_name = 'pandas' if importlib.util.find_spec(package_name): import pandas if isinstance(df, pandas.core.frame.DataFrame): session.use_database(target_relation.database) session.use_schema(target_relation.schema) # session.write_pandas does not have overwrite function df = session.createDataFrame(df) df.write.mode("overwrite").save_as_table('analytics_dev.dbt_dbeatty.snowpark_model', table_type='transient') def main(session): dbt = dbtObj(session.table) df = model(dbt, session) materialize(session, df, dbt.this) return "OK" ```