dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

Running RECOVER PARTITIONS without defining partitions #126

Closed ferdyh closed 1 year ago

ferdyh commented 2 years ago

Describe the feature

In Databricks you can recover partitions from existing parquet files when creating a table without partitions. When you define the partitions in the source (dbt) you also need to define the schema of the table. If you don't define a schema without partitions, the recover partitions still works, only the dbt_external_tables won't run the recover partitions part (cause no partitions).

Describe alternatives you've considered

Run it manually afterwards; Think a nice fix would be to supply it as a parameter or the ability to run it from cli afterwards.

Additional context

I asume this is only Spark / Databricks related.

Who will this benefit?

Anyone using parquet sources that are partitioned, but don't want to supply a schema in the source file (dbt).

jarno-r commented 2 years ago

Running ALTER TABLE RECOVER PARTITIONS or MSCK REPAIR TABLE on a table that does not have partitions causes an error. So always running it is not an option.

I've created a quick fix to this issue by running ALTER TABLE RECOVER PARTITIONS if external.recover_partitions is true. This means that you can do this in your sources.yml:

external:
  recover_partitions: true

An even better alternative would be to use 'DESCRIBE TABLE' (or something similar) to determine if the table has partitions and run ALTER TABLE RECOVER PARTITIONS accordingly. This would require changes to dbt-spark.

I can create a PR of my quick fix, if that is sufficient.

jtcohen6 commented 2 years ago

@jarno-r Definitely open to a PR for this one!

I like the idea of an explicit option for specifying that dbt + Databricks should recover/infer the partitions, when partitions is not itself defined.

I'm not strictly opposed to the cleverer approach, where dbt uses describe table to determine this on the user's behalf... but an explicit config feels in keeping with the approach on other databases that can infer partitions. As a general rule, I try to keep this package as a lightweight lens into each database's capabilities, without too many magic tricks behind the scenes.

jarno-r commented 2 years ago

I've created a PR. Link above.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jarno-r commented 1 year ago

@jtcohen6 This is still relevant for us. We've been running our own version of this package just to have this feature. It is cumbersome to maintain. Could the PR be merged?