It is currently possible to set partitioning strategy when writing to deltalake on Cloud / local. This is achieved by passing relevant parameters to the decorator / applying hints. However, it doesn't look like the filesystem destination natively supports complex partitioning strategies. Example: given a resource that emits timestamps, it is not possible to achieve granular partitioning based on year, month, and day.
Considering how common the above use-case is, it'd be very useful to have. A current workaround involves creating the year, month, and day columns in the resource itself and then using those. However, for smaller tables (as is the case with time-series data), that would incur needless storage and compute costs.
Compaction would also be nice to have, considering near real-time tables tend to have very frequent writes with each file being small in nature. This quickly makes it very hard to query data when using something like polars or duckdb to read directly from the deltalake.
Are you a dlt user?
None
Use case
We're precisely trying to find a way to partition our data without appending additional fields, as described above.
Proposed solution
I am unsure of what the syntax for this would look like, but considering how common this usecase is, I believe datetime object-specific checks could probably be integrated?
Feature description
It is currently possible to set partitioning strategy when writing to
deltalake
on Cloud / local. This is achieved by passing relevant parameters to the decorator / applying hints. However, it doesn't look like the filesystem destination natively supports complex partitioning strategies. Example: given a resource that emits timestamps, it is not possible to achieve granular partitioning based on year, month, and day.Considering how common the above use-case is, it'd be very useful to have. A current workaround involves creating the
year
,month
, andday
columns in the resource itself and then using those. However, for smaller tables (as is the case with time-series data), that would incur needless storage and compute costs.Compaction would also be nice to have, considering near real-time tables tend to have very frequent writes with each file being small in nature. This quickly makes it very hard to query data when using something like polars or duckdb to read directly from the deltalake.
Are you a dlt user?
None
Use case
We're precisely trying to find a way to partition our data without appending additional fields, as described above.
Proposed solution
I am unsure of what the syntax for this would look like, but considering how common this usecase is, I believe datetime object-specific checks could probably be integrated?
Related issues
No response