apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 954 forks source link

[Feature] Multi-Location Management #3627

Open zhongyujiang opened 4 months ago

zhongyujiang commented 4 months ago

Search before asking

Motivation

Currently, Paimon metadata does not store the absolute paths of files but uses relative paths to construct the absolute file paths instead. This is very cool because it saves a long string of identical path prefixes. However, this also limits scenarios that require the use of different locations. For example, our warehouse is an internally hosted HDFS cluster, but for the purpose of saving resources, we would like to implement tiered storage. This means keeping only the hot data in the internal HDFS cluster and moving the cold data to public cloud object storage, which can save a lot of costs. But without support for table location management, we cannot achieve this in Paimon.

Solution

Therefore, I suggest introducing the ability to manage relative paths in Paimon (this does not include the management of metadata paths such as snapshots and schemas, as these metadata always relies on the warehouse path), allowing table data to be stored in different locations.

Anything else?

No response

Are you willing to submit a PR?

zhongyujiang commented 4 months ago

cc @JingsongLi Hi, what do you think?

BsoBird commented 4 months ago

@zhongyujiang In fact, I don't think it seems necessary to manage multiple Locations if it's just to tier cold data with hot data.

BsoBird commented 4 months ago

If you are using HDFS tiered storage, then, first of all, you can mount the object storage as a virtual physical disc, and then configure the storage policy for this disc drive in HDFS.You just need to regularly set the hdfs directory to a different storage policy, and then perform a compaction can be.

BsoBird commented 4 months ago

If you don't want to do this, then you can also mount the object store to hdfs, through a specific path to read and write to the object store.But if you do this, When is the best time to do your data migrated to this path, it becomes a problem.There also seems to be a consistency issue to consider, what if for some other reason the same data files appear in both paths?It's an unavoidable question.

zhongyujiang commented 4 months ago

Hi @BsoBird

Tired storage is one of the application scenarios. Implementing multi-location management has additional benefits, such as enabling a smoother transition of data from Hadoop to cloud object storage. Of course, this also requires support for multi-location metadata management.

You just need to regularly set the hdfs directory to a different storage policy, and then perform a compaction can be.

Are you referring to Paimon data compaction? I think that’s something we want to avoid. Compaction requires a complete rewrite of the data, and the overhead of decoding and encoding is not trivial.

When is the best time to do your data migrated to this path, it becomes a problem.There also seems to be a consistency issue to consider, what if for some other reason the same data files appear in both paths?It's an unavoidable question.

I believe data administrators can perform data migration operations based on demand. Data consistency is indeed a consideration, it can be achived by using atomic swaps. I don’t think this is an unavoidable issue.

BsoBird commented 4 months ago

I believe data administrators can perform data migration operations based on demand. Data consistency is indeed a consideration, it can be achived by using atomic swaps. I don’t think this is an unavoidable issue.

So,suppose I want to move data from three months ago to object storage every day/month/week/year, are you saying that the administrator needs to constantly stop flink writing job, then perform a data migration while the writing is stopped, and then resume writing to the flink after the migration is complete?Because each re-location can only process a small amount of data.It's not an automated process. If that's what you mean, I'm sure no data manager would want to take on such a responsibility.That's why I assumed from the outset that you would use compaction for such functionality. @zhongyujiang

zhongyujiang commented 4 months ago

@BsoBird I refer to data administrators as the platform’s data administrators, not the data owners. Table owners can set cold data partition archiving options based on their needs, and the platform service will handle the archiving. Moreover, I believe data migration does not require pausing write operations.

BsoBird commented 4 months ago

@zhongyujiang So.How? Changing the location means submitting a new metadata. This operation is always slow.If you don't stop streaming writes, there is a high probability that it will fail.

zhongyujiang commented 4 months ago

Although moving large amounts of data is a heavy load, the load for metadata changes is not high, so metadata commits should be relatively fast. This is similar to concurrent writes, as long as the data involved in two commits does not conflict, the Sink job can complete the commit through retries.
We have had successful practices with Iceberg in this area. Although I do not have extensive experience with concurrent writing in Paimon, based on the design of Paimon’s metadata, I believe it is feasible. @JingsongLi could you please help confirm this? Thanks.

On Jul 10, 2024, at 09:59, PLASH SPEED @.***> wrote:

@zhongyujiang https://github.com/zhongyujiang So.How? Changing the location means submitting a new metadata. This operation is always slow.If you don't stop streaming writes to this re-location-job, there is a high probability that it will fail.

— Reply to this email directly, view it on GitHub https://github.com/apache/paimon/issues/3627#issuecomment-2219331119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKHLOGGYKQQZKEXJONKOR5LZLSIR5AVCNFSM6AAAAABKBGAS2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJZGMZTCMJRHE. You are receiving this because you were mentioned.

BsoBird commented 4 months ago

Let's say that we currently hold a particular version of metadata M. Client A performs a normal commit, which produces a new version M1. At the same time,Client B performs a re-location, which produces a new version M2. it look like this:

M --> M1 : CLIENT-A 
M --> M2 : CLIENT-B 

According to you, we will end up committing both versions M1 and M2 successfully, i.e., there are three version records A -> M1 -> M2. If that's the case, then we're actually breaking linear commits.This can lead to problems. For example, users may not be able to fetch new data as expected, or users may not be able to query the change history of data as expected.This may even affect the cleanup of expired snapshots.As expected, M2 should be calculated from M1.But now we have violated this constraint. Personally, I don't think breaking a linear submission is a good behaviour.

JingsongLi commented 4 months ago

Hi @zhongyujiang , it looks like a valid requirement. Maybe we can support external location in DataFileMeta.

zhongyujiang commented 4 months ago

@BsoBird Well, that’s not what I meant. What I meant is that if Client-A and Client-B commit simultaneously, once one succeeds, the other commit should fail, but it should not cause the job to fail. It should refresh to the latest snapshot and retry the commit based on the latest snapshot.

zhongyujiang commented 4 months ago

@JingsongLi Thanks or replying. We are still in the very early stages of planning this solution, so we’d like to confirm with the community whether this approach is feasible. Once we begin implementation, we will share a detailed design document with the community for review.

JingsongLi commented 4 months ago

@JingsongLi Thanks or replying. We are still in the very early stages of planning this solution, so we’d like to confirm with the community whether this approach is feasible. Once we begin implementation, we will share a detailed design document with the community for review.

Looks forward to your next step!