datastrato / gravitino

World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.
https://datastrato.ai/docs/
Apache License 2.0
401 stars 166 forks source link

[#1616] feat(fileset): Add gravitino hadoop file system support #1700

Closed xloya closed 2 months ago

xloya commented 4 months ago

What changes were proposed in this pull request?

This PR proposes to add the code skeleton for gravitino hadoop file system to proxy hdfs file system.

Why are the changes needed?

Fix: #1616

How was this patch tested?

Add uts to cover the main interface method.

xloya commented 4 months ago

@jerryshao @coolderli Please take a look of this patch when you have time, thanks

jerryshao commented 4 months ago

@xloya thanks lot for kicking off this. I think we can have more discussion here.

xloya commented 4 months ago

My thought is that when GravitinoFileSystem is initialized, we can complete the interaction with Gravitino Server through the Java client in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we use GravitinoFileSystem to proxy these FileSystems.
I'm not sure if we need to check the path for every operation in GravitinoFileSystem. For example, if the user use a fileset path which is gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefix gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1?

jerryshao commented 4 months ago

My thought is that when GravitinoFileSystem is initialized, we can complete the interaction with Gravitino Server through the Java client in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we use GravitinoFileSystem to proxy these FileSystem. I'm not sure if we need to check the path for every operation in GravitinoFileSystem. For example, if the user use a fileset path which is gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefix gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1?

I'm not so sure, maybe you can take a try. From my understanding, each FS implementation can only know its scheme and accept legal scheme. For example, if we use Gravitino FS to handle HDFS Path, maybe there's an issue.

xloya commented 4 months ago

My thought is that when GravitinoFileSystem is initialized, we can complete the interaction with Gravitino Server through the Java client in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we use GravitinoFileSystem to proxy these FileSystem. I'm not sure if we need to check the path for every operation in GravitinoFileSystem. For example, if the user use a fileset path which is gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefix gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1?

I'm not so sure, maybe you can take a try. From my understanding, each FS implementation can only know its scheme and accept legal scheme. For example, if we use Gravitino FS to handle HDFS Path, maybe there's an issue.

I added the path checking and close the fs cache to ensure that on the client side, a fileset uses a separate fs to implement reading and writing.

qqqttt123 commented 3 months ago

Maybe we should have a document to tell users how to use this.

qqqttt123 commented 3 months ago

Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?

xloya commented 3 months ago

Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?

We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources. image

qqqttt123 commented 3 months ago

Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?

We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources. image

I mean that users may use file systems as follows.

fileSytem.initialize("gvfs://fileset/melake/schema/fileset1", conf)
fileSystem.open("gvfs://fileset/metalke/shcema/fileset2");
xloya commented 3 months ago

Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?

We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources. image

I mean that users may use file systems as follows.

fileSytem.initialize("gvfs://fileset/melake/schema/fileset1", conf)
fileSystem.open("gvfs://fileset/metalke/shcema/fileset2");

Currently an exception will be thrown because these belong to two different fileset resources

xloya commented 3 months ago

Users can use like this:

Path path1 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_1");
FileSystem fs1 = path1.getFileSystem(conf);
fs1.open(path1);

Path path2 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_2");
FileSystem fs2 = path2.getFileSystem(conf);
fs2.open(path2);

qqqttt123 commented 3 months ago

Users can use like this:

Path path1 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_1"); FileSystem fs1 = path1.getFileSystem(conf); fs1.open(path1);

Path path2 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_2"); FileSystem fs2 = path2.getFileSystem(conf); fs2.open(path2);

The use cache may differ with other file systems. It won't be convenient for users.

xloya commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

qqqttt123 commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

xloya commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable

qqqttt123 commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable

We can have a default file proxy file system like Hadoop Filesystem.

xloya commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable

We can have a default file proxy file system like Hadoop Filesystem.

Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems

qqqttt123 commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable

We can have a default file proxy file system like Hadoop Filesystem.

Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems

Simply we can cached a file system if we create one file system. One java client have one user, so we can cache some file sets for this user. the user has the permission for the file set.

xloya commented 3 months ago

@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?

We should authorize the file set request in the Gravitino server.

In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable

We can have a default file proxy file system like Hadoop Filesystem.

Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems

Simply we can cached a file system if we create one file system. One java client have one user, so we can cache some file sets for this user. the user has the permission for the file set.

I might need to check a scene inside and see if that's a problem. Because an internal scenario provides a public proxy layer, the proxy layer helps users read HDFS data and return it. In this case, the proxy layer is equivalent to a java client, but corresponds to multiple users. If cached as the default FS, this may cause reading problems for some users

jerryshao commented 3 months ago

@xloya Do we support proxying to different FS for gvfs? For example, if we have fileset1 for hdfs, fileset2 for S3, can we use one gvfs to support both hdfs and s3 at once?

xloya commented 3 months ago

@xloya Do we support proxying to different FS for gvfs? For example, if we have fileset1 for hdfs, fileset2 for S3, can we use one gvfs to support both hdfs and s3 at once?

Yes, this is supported. After we get the configuration of the user token in the initialize method, as long as the fileset accessed by the user can obtain metadata through the current token, the user is allowed to operate the StorageLocation corresponding to the fileset. We maintain a Map to store accessible filesets and their corresponding fs. In the example you gave, if the current user token has access rights to fileset1 and fileset2, there will be a Map storing <fileset1's identifier, <Fileset, HDFS FileSystem>> and <fileset2's identifier, <Fileset, S3 FileSystem>> two records.

jerryshao commented 2 months ago

@xloya Is it ready for review?

xloya commented 2 months ago

@xloya Is it ready for review?

@jerryshao Yeah, i have finished the unit tests. Please take a look when you have time. Thanks.

xloya commented 2 months ago

I decided to give the user the choice of whether to set fs.gvfs.impl.disable.cache to true, so that if there is no multi-token scenario, the user can reuse the same fs; and if there is a multi-token scenario, it can be recommended the user configure the configuration fs.gvfs.impl.disable.cache=true in the configuration to initialize different fs for different tokens.

qqqttt123 commented 2 months ago

@jerryshao @yuqi1129 Do we need javadoc for this module? I suggest that we should add. WDYT?

yuqi1129 commented 2 months ago

@jerryshao @yuqi1129 Do we need javadoc for this module? I suggest that we should add. WDYT?

It seems that it will be directly accessible to users, I believe we need to add Java doc for it.

qqqttt123 commented 2 months ago

@jerryshao @yuqi1129 Do we need javadoc for this module? I suggest that we should add. WDYT?

It seems that it will be directly accessible to users, I believe we need to add Java doc for it.

We also need to add the check of javadoc. Could you give some guidance?

yuqi1129 commented 2 months ago

@jerryshao @yuqi1129 Do we need javadoc for this module? I suggest that we should add. WDYT?

It seems that it will be directly accessible to users, I believe we need to add Java doc for it.

We also need to add the check of javadoc. Could you give some guidance?

Please see: https://github.com/datastrato/gravitino/blob/5ceb2b82571de968737ac72475f9c661ef2cf9b6/build.gradle.kts#L368

xloya commented 2 months ago

@jerryshao @qqqttt123 @yuqi1129 Have fixed the comments, please take a look again, thanks!

qqqttt123 commented 2 months ago

Could you add a document to explain how to use this client?

xloya commented 2 months ago

Could you add a document to explain how to use this client?

2640 Add another issue for this, will do it later.

qqqttt123 commented 2 months ago

LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.

yuqi1129 commented 2 months ago

LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.

Can you confirm if the Java doc works locally?

image

image
xloya commented 2 months ago

LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.

Can you confirm if the Java doc works locally?

image image

Yes, see: image image

yuqi1129 commented 2 months ago

OK, LGTM

jerryshao commented 2 months ago

I think the follow-up thing is to add e2e IT test for gvfs.