apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.31k stars 2.41k forks source link

[SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? #10316

Open zyclove opened 8 months ago

zyclove commented 8 months ago

How to configure TTL policy in hudi data table? Can the metadata (.hoodie )be separated into a directory? Only configure the appropriate TTL for the data directory, so that data cleaning can also use hierarchical storage and different life cycles, and the data can also be automatically cleaned by relying on the object storage service, and there is no cost.

EG:

s3://big-data-eu/hudi/data/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3

30 days delete by TTL with no cost.

< 15 days STANDARD S3

15 days GLACIER_IR 105 days delete by TTL with no cost.

s3://big-data-eu/hudi/table_meta/bi_ods/table_name/.hoodie

As mentioned above, if there are many data tables under the data storage and the storage periods are the same, I can just configure the storage period for the directory and rely on the object storage to automatically clean up the historical data at no cost. EG:

s3://big-data-eu/hudi/data/30days/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3

30 days delete by TTL with no cost.

s3://big-data-eu/hudi/data/90days/bi_ods/table_name/dt=20231211/data < 30 days STANDARD S3 30days < 90days GLACIER_IR

90 days delete by TTL with no cost. ....

< 15 days STANDARD S3

15 days GLACIER_IR 105 days delete by TTL with no cost.

s3://big-data-eu/hudi/table_meta/bi_ods/table_name/.hoodie

ad1happy2go commented 8 months ago

@zyclove Dont think if there is a way to point the different directory outside table directory OR having any such TTL configuration.

zyclove commented 8 months ago

@zyclove Dont think if there is a way to point the different directory outside table directory OR having any such TTL configuration. @ad1happy2go @yihua Why can't we consider storing metadata and data files independently? The data TTL can be more flexible and convenient. Can it be mentioned and submitted in subsequent planning meetings? Thanks

ad1happy2go commented 8 months ago

May be a good idea but i guess we may have already explored. Adding @nsivabalan @yihua @danny0405 @codope

stream2000 commented 7 months ago

so that data cleaning can also use hierarchical storage and different life cycles

Can you elaborate more detail about this? I working with partition TTL management and hope to understand your need!

Seems like your just want to configure different ttl between data and metadata directory? Is there any requirement to configure different ttl for different data partition? For example, 30 days ttl for product_id=1/dt=2023xxx/ and15 days ttl for product_id=2/dt=2023xxx/ ?