databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
212 stars 113 forks source link

DBT python model passing date variables with different data types #583

Open tade0726 opened 7 months ago

tade0726 commented 7 months ago

Describe the feature

I noticed two data types arrived in the Python notebook for date str.

The python notebook mentioned here is the notebook dbt-databricks summits to Databricks for running python models. There is a config object in that, was defined corrodingly based on vars configured in dbt_project.yml .

As in :


# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
    refs = {}
    key = '.'.join(args)
    version = kwargs.get("v") or kwargs.get("version")
    if version:
        key += f".v{version}"
    dbt_load_df_function = kwargs.get("dbt_load_df_function")
    return dbt_load_df_function(refs[key])

def source(*args, dbt_load_df_function):
    sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
    key = '.'.join(args)
    return dbt_load_df_function(sources[key])

config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}

class config:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def get(key, default=None):
        return config_dict.get(key, default)

class this:
    """dbt.this() or dbt.this.identifier"""
    database = "savana_prod"
    schema = "dbt_gold_pipelines"
    identifier = "int_date_table"

    def __repr__(self):
        return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'

class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
        self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)

Case 1:

When a date variable has been set up in dbt_project.yml, the Python notebook in Databricks will set that object as a DateTime object.

As in a config Python dictionary

{"start_date", datetime.date(2021, 1, 1)}

Case 2:

When a date variable passed with --vars "{'date': '2021-01-01'}" in dbt cmd ,

It arrives as

{"start_date", '2021-01-01'}

Describe alternatives you've considered

I suggest having just a string for the date object.

Additional context

I am implementing an integration with Dagster + Databricks cloud + dbt, and I need both kinds of configuration to work in the same expectation of returned variable; otherwise, it will be redundant code copying and pasting in different Python models for facilitating the logic to reconcile the difference.

As in dbt Python models can not reuse the same abstracted functions / etc for now.

Who will this benefit?

It will no longer require writing extra logic to distinguish the date types.

Integration with Dagster might depend on --var {} CLI to pass configuration. But, it would be easier to debug in the dbt project when we pass the identical variable setups with the yml file.

Are you interested in contributing to this feature?

Yes, I would love to.

leo-schick commented 7 months ago

@tade0726 you can ensure that by doing a little code adjustment to case 1:

instead of

{"start_date", datetime.date(2021, 1, 1)}

write

{"start_date", datetime.date(2021, 1, 1).strftime("%d.%m.%Y")}

see also: Python Dates

tade0726 commented 7 months ago

Hi @leo-schick ,

It is not related to the user-defined dictionary. It was a config template generated by dbt-databricks.

I also updated the issue description for clarity.

See example below:

# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
    refs = {}
    key = '.'.join(args)
    version = kwargs.get("v") or kwargs.get("version")
    if version:
        key += f".v{version}"
    dbt_load_df_function = kwargs.get("dbt_load_df_function")
    return dbt_load_df_function(refs[key])

def source(*args, dbt_load_df_function):
    sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
    key = '.'.join(args)
    return dbt_load_df_function(sources[key])

config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}

class config:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def get(key, default=None):
        return config_dict.get(key, default)

class this:
    """dbt.this() or dbt.this.identifier"""
    database = "savana_prod"
    schema = "dbt_gold_pipelines"
    identifier = "int_date_table"

    def __repr__(self):
        return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'

class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
        self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)
benc-db commented 7 months ago

@tade0726 Please go ahead and submit a PR :). I agree that it makes most sense for the data type to be consistent regardless of source.

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.