apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.1k stars 834 forks source link

[Feature] Multi-Location Management #3627

Open zhongyujiang opened 4 days ago

zhongyujiang commented 4 days ago

Search before asking

Motivation

Currently, Paimon metadata does not store the absolute paths of files but uses relative paths to construct the absolute file paths instead. This is very cool because it saves a long string of identical path prefixes. However, this also limits scenarios that require the use of different locations. For example, our warehouse is an internally hosted HDFS cluster, but for the purpose of saving resources, we would like to implement tiered storage. This means keeping only the hot data in the internal HDFS cluster and moving the cold data to public cloud object storage, which can save a lot of costs. But without support for table location management, we cannot achieve this in Paimon.

Solution

Therefore, I suggest introducing the ability to manage relative paths in Paimon (this does not include the management of metadata paths such as snapshots and schemas, as these metadata always relies on the warehouse path), allowing table data to be stored in different locations.

Anything else?

No response

Are you willing to submit a PR?

zhongyujiang commented 4 days ago

cc @JingsongLi Hi, what do you think?