dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
405 stars 227 forks source link

[ADAP-672] [Bug] Spark only writing the metadata on storage, data is not visible #822

Closed luanmorenomaciel closed 10 months ago

luanmorenomaciel commented 1 year ago

Is this a new bug in dbt-spark?

Current Behavior

Following the instructions of getting started and trying to read and write to Parquet doesn't work properly. The output metadata is being written on storage however the data is not.

Expected Behavior

Read files and write them into Parquet in another folder using the dbt-spark framework.

Steps To Reproduce

1 - Stand up the docker-compose available on https://github.com/dbt-labs/dbt-spark docker-compose up -d

2 - Install packages dependencies using packages.yml `packages:

3 - Add connectivity to the profile.yml with the following configuration spark: target: dev outputs: dev: type: spark method: thrift host: localhost port: 10000 schema: default

4 - Add the external config according to the https://github.com/dbt-labs/dbt-external-tables/blob/main/sample_sources/spark.yml `version: 2

sources:

5 - Set the dbt_project.yml file and build a simple SQL statement that would be persisting data `name: 'spark' version: '1.0.0' config-version: 2

profile: 'spark'

model-paths: ["models"]

target-path: "target" clean-targets:

models: +materialized: table +file_format: parquet`

SELECT * FROM {{ source('bronze', 'users') }}

6 - Once I execute the command below, I get the success message; hence, the metadata shows up on the HMS server. dbt -d run-operation stage_external_sources --vars "ext_full_refresh: true" --profiles-dir /Users/luanmorenomaciel/GitHub/owshq-dbt-core/dbt/

7 - Executing the command, I’ve got the green status too dbt -d run-operation stage_external_sources --vars "ext_full_refresh: true" --profiles-dir /Users/luanmorenomaciel/GitHub/owshq-dbt-core/dbt/

Relevant log output

1:51:31.862712 [debug] [MainThread]: Spark adapter: Poll status: 2, query complete
11:51:31.862987 [debug] [MainThread]: SQL status: OK in 0.18 seconds
11:51:31.868058 [info ] [MainThread]: 1 of 1 (2) OK
11:51:31.868360 [info ] [MainThread]: 1 of 1 (3) create table bronze.users (                    average_stars double,            ...  
11:51:31.869214 [debug] [MainThread]: Using spark connection "macro_stage_external_sources"
11:51:31.869531 [debug] [MainThread]: On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.3.4", "profile_name": "spark", "target_name": "dev", "connection_name": "macro_stage_external_sources"} */

    create table bronze.users (

            average_stars double,
            compliment_cool bigint,
            compliment_cute bigint,
            compliment_funny bigint,
            compliment_hot bigint,
            compliment_list bigint,
            compliment_more bigint,
            compliment_note bigint,
            compliment_photos bigint,
            compliment_plain bigint,
            compliment_profile bigint,
            compliment_writer bigint,
            cool bigint,
            elite string,
            fans bigint,
            friends string,
            funny bigint,
            name string,
            review_count bigint,
            useful bigint,
            user_id string,
            yelping_since string
    )  using parquet
    location '/Users/luanmorenomaciel/GitHub/owshq-dbt-core/storage/lakehouse/bronze/parquet/users'

11:51:31.952593 [debug] [MainThread]: Spark adapter: Poll status: 2, query complete
11:51:31.952865 [debug] [MainThread]: SQL status: OK in 0.08 seconds
11:51:31.957255 [info ] [MainThread]: 1 of 1 (3) OK
11:51:31.957543 [debug] [MainThread]: On macro_stage_external_sources: ROLLBACK
11:51:31.957799 [debug] [MainThread]: Spark adapter: NotImplemented: rollback
11:51:31.958038 [debug] [MainThread]: On macro_stage_external_sources: Close
11:51:31.967680 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10b302d60>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10b58f790>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10b58f850>]}
11:51:31.968116 [debug] [MainThread]: Flushing usage events
11:51:32.422742 [debug] [MainThread]: Connection 'macro_stage_external_sources' was properly closed.

11:55:26.953982 [info ] [MainThread]: Finished running 1 table model in 0 hours 0 minutes and 1.38 seconds (1.38s).
11:55:26.954433 [debug] [MainThread]: Connection 'master' was properly closed.
11:55:26.954786 [debug] [MainThread]: Connection 'model.spark.users' was properly closed.
11:55:27.029574 [info ] [MainThread]: 
11:55:27.030014 [info ] [MainThread]: Completed successfully
11:55:27.030448 [info ] [MainThread]: 
11:55:27.030737 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
11:55:27.031127 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10e6b0550>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10e7ebaf0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10e8f3fd0>]}
11:55:27.031467 [debug] [MainThread]: Flushing usage events

Environment

- OS:13.4.1 (22F82) Ventura
- Python: Python 3.11.4
- dbt-core: 1.3.4
- dbt-spark: 1.3.2

Additional Context

I've been trying this for a week now and it seems not to be working properly. I need somebody to help me if possible!

github-actions[bot] commented 10 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 10 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.