dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.39k stars 156 forks source link

HIVE Partitions in AWS Athena #555

Open sh-rp opened 1 year ago

sh-rp commented 1 year ago

We want to enable the user to make use of AWS Athena partitions. For this we will most likely need to change the schema a bit and add the option for the user to define partitions on the resources.

Analyzing the discussion on this topic, we can implement this in 3 steps of increasing complexity (3 PRs expected)

support iceberg tables

here partitioning is available out of the box on the create table command. partitions are created based on the data in the tables

support regular, hive tables

after loading a file ADD PARTITION must be executed to map file into a partition. paritions are created based on the metadata that must be available via file layout placeholders so if a given partition column is declared in the adapter - the same column name must be a placeholder in the file layout (or be present in placeholders list)

hive partitioning based on table data

when #1382 is merged we can easily add more table formats (ie. hive) and use it for partitioning. form our discussion we know that MSCK REPAIR TABLE costs too much for incremental load so I do not yet see a value in it


this is old disussion ~Tasks:

To think about:

rudolfix commented 1 year ago

@sh-rp let's take a look at native partitioning that pyarrow has. OFC in that case we'll need to deal with several files in folders so our jobs mechanism are getting complicated (but why not to have jobs that are folders ;> OMG)

in Weaviate PR we are adding a concept of a destination adapter that injects additional information in schema to be interpreted by loader/normalizer (and a data writer - also has access to it).

so it all fits together but still looks like bigger project

gamgi commented 10 months ago

Documenting a potential use-case. I'm loading files from S3 + SFTP + API sources to S3 destination.

I'd like to use properties from the source (not data itself) to design the partition layout in the destination. The properties may be resource arguments, or indirectly source metadata like object path and/or file name.

Examples:

How others have (not) done it:

Implications for this ticket:

TL;DR I'm looking to partition based on metadata, not the data itself.

MRocholl commented 7 months ago

Just wanted to add a thing to think about.

Iceberg tables make use of hidden partitioning. And keep track of different partitioning over time. This might be much simpler to implement than the classical hive table.

https://trino.io/blog/2021/05/03/a-gentle-introduction-to-iceberg#hidden-partitions

Considering this issue is about AWS Athena partitions I do not believe this is out of scope and might be the "simpler" part of the issue.

Best

Pipboyguy commented 7 months ago

(Partly) subsumed by https://github.com/dlt-hub/dlt/issues/930

nicor88 commented 5 months ago

In case native iceberg tables are used in Athena, the partition implementation can be delegated to iceberg directly (hidden partitioning), and it's possible to use an implementation similar to what is done in BigQuery, where the PARTITION BY clause is added via SQL. Pretty much as @MRocholl mentioned.

For pure parquet writing of course what mention doesn't work, because partitions are part of the object path of the data written to S3.

@sh-rp @rudolfix @sultaniman as I'm interested in using Athena/Iceberg destination with partitions, do you see anything against what I proposed above? it's just about adding the right SQL for iceberg managed tables.

sultaniman commented 5 months ago

@nicor88 I've been working on extending layout placeholders for filesystem destination, atm we don't yet support syncing or taking date & time from parquet files for partitioning.

nicor88 commented 5 months ago

@sultaniman Layout placehoders for file-system for what I can see won't work with Iceberg athena managed tables.

For example an iceberg table can be created like that:

CREATE TABLE events (id bigint, event_date date)
  PARTITIONED BY (event_date, bucket(16, id))
  LOCATION 's3://DOC-EXAMPLE-BUCKET/your-folder/'
  TBLPROPERTIES ( 'table_type' = 'ICEBERG' )

The writing to the file systems is handle under the hood by athena itself in case of Iceberg table, not by the writer (dlt), this because of Iceberg hidden partitioning.

Therefore the only thing to do with athena iceberg tables is to specify the partition definition on table creation, or as a post-hook (e.g.using pyiceberg) to modify the partition specs of a table.

Said so, I was planning to propose a PR for that, because it's the only limitation that is blocking us to use dlt in production.

All our tables are partitioned by ingestion date for example, to reduce data scans down-streams.

Happy to have a chat with more details if the above is not clear.

sultaniman commented 5 months ago

@nicor88 sure we can have a chat about it, I think once we merge the original PR I will be more that happy for allowing iceberg support 🙂