apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
912 stars 292 forks source link

[FEATURE] Add JuiceFS support in Fileset #4359

Open theoryxu opened 1 month ago

theoryxu commented 1 month ago

Describe the feature

Fileset is a new concept introduced in 0.5.0 to manage non-tabular data; the current implementation uses HCFS to manage physical data. Now, HCFS doesn't support JuiceFS.

In this issue, we should discuss: how to support JuiceFS in Fileset and how to achieve it.

Motivation

JuiceFS is a high-performance, cloud-native distributed file system that is developing rapidly. Support of this could help Gravitino to be used in more scenarios in the future

Describe the solution

No response

Additional context

No response

xloya commented 1 month ago

As far as I know, JuiceFS community version provides Hadoop SDK, please refer to: https://juicefs.com/docs/zh/community/hadoop_java_sdk. So I think JuiceFS can be supported on Fileset using Hadoop SDK like S3.

Suave commented 1 month ago

as @xloya mentions, JuiceFS is compatible with HDFS API via its Java SDK and also supports S3 API (ref). But I highly recommend Gravitino support POSIX for all the generic file systems, including JuiceFS, Lustre, CephFS, and more.

2005hithlj commented 1 month ago

@Suave Is Hadoop SDK a better choice for big data scenarios?

shaofengshi commented 1 month ago

@theoryxu thanks for creating this issue. What I'm curious is, what are the pain points or challenges when using Juicefs, and using Gravitino Fileset can overcome or solve? If you can share some of them, that will be good for others to understand this feature. Thank you!

Suave commented 1 month ago

@Suave Is Hadoop SDK a better choice for big data scenarios?

Yes, JuiceFS Java SDK works better in Hadoop ecosystem, it's compatible with Hadoop 2.x and 3.x both