dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

[Spark] Fix partitions #116

Closed JCZuurmond closed 2 years ago

JCZuurmond commented 2 years ago

Description & motivation

The partitions were not working properly. Fixes #108 .

First, there was a typo, therefor the partition clause was not added.

After doing some testing we learned that the partitions should be recovered after creating the table. This is also needed for a non-full-refresh, because new partitions are only detected after running the RECOVER PARTITIONS command.

Checklist

JCZuurmond commented 2 years ago

Would be awesome if we cut a release after this PR is merged. We rely on external tables + incremental strategies. Now, due to the typo, the external tables have no partitions, thus an incremental strategy that uses the partition to filter data still has to read the full table - which is a waste of resources

JCZuurmond commented 2 years ago

Actually, we should be a bit careful here. Apparently you need to run a repair command after the table is created. As the page explains, tables created from existing data will return no rows. There is a separate command to repair the table.

jtcohen6 commented 2 years ago

@JCZuurmond Aha! That seems to answer the finding by @guillesd a few weeks ago: https://github.com/dbt-labs/dbt-external-tables/pull/113#issuecomment-964210274

I'm working separately on fixing the integration tests in this package, so we can officially declare v1.0 compatibility

JCZuurmond commented 2 years ago

This PR needs more work than a typo fix. I will do some additional testing with the use of the repair table command, after that I will propose an approach

JCZuurmond commented 2 years ago

Databricks CI still fails

jtcohen6 commented 2 years ago

Oh whoops! Looks like the error is:

13:38:36  Encountered an error while running operation: Runtime Error
  Runtime Error
    ('42000', '[42000] [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: \nDataType varchar is not supported.(line 14, pos 28)\n\n== SQL ==\n/* {"app": "dbt", "dbt_version": "1.0.0", "profile_name": "integration_ (80) (SQLExecDirectW)')

Which should be fixable by changing this from varchar to string: https://github.com/dbt-labs/dbt-external-tables/blob/d17892ba756c5746847cbafd050e0ae3a252131c/integration_tests/models/plugins/spark/spark_external.yml#L38-L40

JCZuurmond commented 2 years ago

All lights are green!