tfayyaz commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the issue

Trying to run models that have partition_by in the config fails on Databricks as the format is different for Databricks.

Currently, the config is:

{{ 
    config(
        materialized='incremental',
        partition_by = {'field': 'date_day', 'data_type': 'date'},
        unique_key='lead_day_id'
        ) 
}}

And if you run the model using:

% dbt run --select marketo.intermediate.marketo__change_data_details

You get the error below

Relevant error log or model output

15:31:16  Running with dbt=1.1.0
15:31:17  Found 62 models, 69 tests, 0 snapshots, 0 analyses, 563 macros, 1 operation, 0 seed files, 26 sources, 0 exposures, 0 metrics
15:31:17  
15:31:33  
15:31:33  Running 1 on-run-start hook
15:31:33  1 of 1 START hook: marketo.on-run-start.0 ...................................... [RUN]
15:31:33  1 of 1 OK hook: marketo.on-run-start.0 ......................................... [OK in 0.00s]
15:31:33  
15:31:33  Concurrency: 2 threads (target='dev')
15:31:33  
15:31:33  1 of 1 START incremental model tfayyaz_dbt_marketing_marketo.marketo__change_data_details  [RUN]
15:31:40  1 of 1 ERROR creating incremental model tfayyaz_dbt_marketing_marketo.marketo__change_data_details  [ERROR in 7.54s]
15:31:42  
15:31:42  Finished running 1 incremental model, 1 hook in 25.15s.
15:31:42  
15:31:42  Completed with 1 error and 0 warnings:
15:31:42  
15:31:42  Runtime Error in model marketo__change_data_details (models/intermediate/marketo__change_data_details.sql)
15:31:42    Query execution failed.
15:31:42    Error message: org.apache.spark.sql.AnalysisException: Couldn't find column field in:
15:31:42    root
15:31:42     |-- lead_id: long (nullable = true)
15:31:42     |-- date_day: date (nullable = true)
15:31:42     |-- lead_day_id: string (nullable = false)
15:31:42  
15:31:42  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Expected behavior

It should work for Databricks by default. Maybe by having an if statement in the config like below which fixes this?

{{
   config(
       materialized='incremental',
       partition_by = {{ ['date_day'] if target.type = 'databricks' else {'field': 'date_day', 'data_type': 'date'} }},
       unique_key='lead_day_id'
       )
}}

or by adding docs to tell Databrricks user to add the following to their dbt_project.yml file

models:
  marketo:
    intermediate:
      marketo__change_data_details:
        +partition_by: ['date_day']

dbt Project configurations

name: 'marketing_lakehouse' version: '1.0.0' config-version: 2

This setting configures which "profile" dbt uses for this project.

profile: 'marketing_lakehouse'

These configurations specify where dbt should look for different types of files.

The `model-paths` config, for example, states that models in this project can be

found in the "models/" directory. You probably won't need to change these!

model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by dbt clean

"target"
"dbt_packages"

models: +file_format: delta

seeds: +file_format: delta

snapshots: +file_format: delta

dispatch:

macro_namespace: dbt_utils search_order: ['spark_utils', 'dbt_utils']

vars: marketo_source: marketo_database: marketing_marketo marketo_schema: marketing_marketo salesforce_source: salesforce_database: marketing_sfdc_fivetran salesforce_schema: marketing_sfdc_fivetran

models:

marketo:

intermediate:

marketo__change_data_details:

+partition_by: ['date_day']

Package versions

packages:

package: dbt-labs/spark_utils version: 0.3.0
package: fivetran/marketo_source version: [">=0.7.0", "<0.8.0"]
package: fivetran/marketo version: [">=0.7.0", "<0.8.0"]
package: fivetran/salesforce_source version: [">=0.4.0", "<0.5.0"]
package: fivetran/salesforce version: [">=0.5.0", "<0.6.0"]

What database are you using dbt with?

databricks

dbt Version

1.1.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

[ ] Yes.
[X] Yes, but I will need assistance and will schedule time during our office hours for guidance
[ ] No.

fivetran-joemarkiewicz commented 2 years ago

Hi @tfayyaz thanks for opening this issue!

It looks like the issue is tied to the fact that the package compatibility does not include Databricks yet. In addition to the partition_by needing to be adjusted, I imagine there are a few other components of this package that will need to be adjusted in order for the package to be compatible with Databricks.

I noticed you mentioned you would be open to creating a PR for this change. If so, I would recommend taking a look at some of our other packages and see how we allowed for Databricks compatibility. See how we did it in the below PRs for dbt_microsoft_ads:

dbt_microsoft_ads_source #4
dbt_microsoft_ads #6

In the meantime, I will leave this issue open to see if others would like to see Databricks compatibility supported in the package. If more people show interest, we can prioritize this feature request! Likewise, if you decide to open a PR, I would be happy to help in any way I can 😄

tfayyaz commented 2 years ago

@fivetran-joemarkiewicz thanks for getting back on this. I have identified a few other issues so will try and see if I can combine those all and then add a PR in a few weeks to add the changes and tests.

fivetran-joemarkiewicz commented 2 years ago

@tfayyaz that sounds great! Please let me know if you need any assistance when working on the PR!

fivetran-joemarkiewicz commented 1 year ago

Transitioning this issue from a bug to a feature.

Our team will tackle converting this dbt package for Databricks compatibility in our next sprint.

fivetran-joemarkiewicz commented 1 year ago

Closing this issue as the above PR offers databricks compatibility and the original issue should be addressed.

fivetran / dbt_marketo

FEATURE - partition_by does not work for Databricks #21

Is there an existing issue for this?

Describe the issue

Relevant error log or model output

Expected behavior

dbt Project configurations

This setting configures which "profile" dbt uses for this project.

These configurations specify where dbt should look for different types of files.

The `model-paths` config, for example, states that models in this project can be

found in the "models/" directory. You probably won't need to change these!

models:

marketo:

intermediate:

marketo__change_data_details:

+partition_by: ['date_day']

Package versions

What database are you using dbt with?

dbt Version

Additional Context

Are you willing to open a PR to help address this issue?

fivetran / dbt_marketo

FEATURE - partition_by does not work for Databricks #21

Is there an existing issue for this?

Describe the issue

Relevant error log or model output

Expected behavior

dbt Project configurations

This setting configures which "profile" dbt uses for this project.

These configurations specify where dbt should look for different types of files.

The model-paths config, for example, states that models in this project can be

found in the "models/" directory. You probably won't need to change these!

models:

marketo:

intermediate:

marketo__change_data_details:

+partition_by: ['date_day']

Package versions

What database are you using dbt with?

dbt Version

Additional Context

Are you willing to open a PR to help address this issue?

The `model-paths` config, for example, states that models in this project can be