Closed xloya closed 2 months ago
@jerryshao @coolderli Please take a look of this patch when you have time, thanks
@xloya thanks lot for kicking off this. I think we can have more discussion here.
My thought is that when GravitinoFileSystem
is initialized, we can complete the interaction with Gravitino Server
through the Java client
in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we use GravitinoFileSystem
to proxy these FileSystems.
I'm not sure if we need to check the path for every operation in GravitinoFileSystem
. For example, if the user use a fileset path which is gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet
to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefix gtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1
?
My thought is that when
GravitinoFileSystem
is initialized, we can complete the interaction withGravitino Server
through theJava client
in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we useGravitinoFileSystem
to proxy these FileSystem. I'm not sure if we need to check the path for every operation inGravitinoFileSystem
. For example, if the user use a fileset path which isgtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet
to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefixgtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1
?
I'm not so sure, maybe you can take a try. From my understanding, each FS implementation can only know its scheme and accept legal scheme. For example, if we use Gravitino FS to handle HDFS Path
, maybe there's an issue.
My thought is that when
GravitinoFileSystem
is initialized, we can complete the interaction withGravitino Server
through theJava client
in exchange for the real storage path corresponding to the fileset resource and check the authentication. We then create the corresponding FileSystem based on the real storage path, such as Hadoop Distributed FileSystem (hdfs ://), Azure Blob FileSystem (abfs://) and Aliyun OSS FileSystem (oss://), and we useGravitinoFileSystem
to proxy these FileSystem. I'm not sure if we need to check the path for every operation inGravitinoFileSystem
. For example, if the user use a fileset path which isgtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1/xxx/a.parquet
to initialize the GravitinoFileSystem, should we check on every operation that the path match the prefixgtfs://fileset/metalake_1/catalog_1/schema_1/fileset_1
?I'm not so sure, maybe you can take a try. From my understanding, each FS implementation can only know its scheme and accept legal scheme. For example, if we use Gravitino FS to handle HDFS
Path
, maybe there's an issue.
I added the path checking and close the fs cache to ensure that on the client side, a fileset uses a separate fs to implement reading and writing.
Maybe we should have a document to tell users how to use this.
Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?
Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?
We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources.
Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?
We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources.
I mean that users may use file systems as follows.
fileSytem.initialize("gvfs://fileset/melake/schema/fileset1", conf)
fileSystem.open("gvfs://fileset/metalke/shcema/fileset2");
Now, we bind a file set level proxy file system as a default file system. If we initialize with one file set, we need to access to another file set. Will it cause some issues?
We should not allow one FS to access multiple different fileset resources, because users need to be authenticated before accessing the resourse. Since we have turned off the cache, the user can obtain a new FileSystem instance and access the corresponding fileset resources.
I mean that users may use file systems as follows.
fileSytem.initialize("gvfs://fileset/melake/schema/fileset1", conf) fileSystem.open("gvfs://fileset/metalke/shcema/fileset2");
Currently an exception will be thrown because these belong to two different fileset resources
Users can use like this:
Path path1 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_1");
FileSystem fs1 = path1.getFileSystem(conf);
fs1.open(path1);
Path path2 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_2");
FileSystem fs2 = path2.getFileSystem(conf);
fs2.open(path2);
Users can use like this:
Path path1 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_1"); FileSystem fs1 = path1.getFileSystem(conf); fs1.open(path1);
Path path2 = new Path("gvfs://fileset/metalake/catalog/schema/fileset_2"); FileSystem fs2 = path2.getFileSystem(conf); fs2.open(path2);
The use cache may differ with other file systems. It won't be convenient for users.
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable
We can have a default file proxy file system like Hadoop Filesystem.
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable
We can have a default file proxy file system like Hadoop Filesystem.
Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable
We can have a default file proxy file system like Hadoop Filesystem.
Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems
Simply we can cached a file system if we create one file system. One java client have one user, so we can cache some file sets for this user. the user has the permission for the file set.
@qqqttt123 I'm not sure if this scenario exists in the community. Within us, not all users may have permissions for a certain fileset. We need to ensure that users do not gain unauthorized access to resources without permission by obtaining FS. Do you have any better suggestions for this problem?
We should authorize the file set request in the Gravitino server.
In a distributed computing environment, such as Spark, this may significantly increase the qps for authentication to access gravirtino server, while maintaining multiple fileset and proxy filesystem mappings, which I am not sure is reasonable
We can have a default file proxy file system like Hadoop Filesystem.
Maybe my understanding is wrong, I think you can describe the solution in detail, because I actually do not understand why the default file system can solve the above mentioned authentication problems
Simply we can cached a file system if we create one file system. One java client have one user, so we can cache some file sets for this user. the user has the permission for the file set.
I might need to check a scene inside and see if that's a problem. Because an internal scenario provides a public proxy layer, the proxy layer helps users read HDFS data and return it. In this case, the proxy layer is equivalent to a java client, but corresponds to multiple users. If cached as the default FS, this may cause reading problems for some users
@xloya Do we support proxying to different FS for gvfs? For example, if we have fileset1 for hdfs, fileset2 for S3, can we use one gvfs to support both hdfs and s3 at once?
@xloya Do we support proxying to different FS for gvfs? For example, if we have fileset1 for hdfs, fileset2 for S3, can we use one gvfs to support both hdfs and s3 at once?
Yes, this is supported. After we get the configuration of the user token in the initialize
method, as long as the fileset accessed by the user can obtain metadata through the current token, the user is allowed to operate the StorageLocation
corresponding to the fileset. We maintain a Map
to store accessible filesets and their corresponding fs. In the example you gave, if the current user token has access rights to fileset1 and fileset2, there will be a Map storing <fileset1's identifier, <Fileset, HDFS FileSystem>> and <fileset2's identifier, <Fileset, S3 FileSystem>> two records.
@xloya Is it ready for review?
@xloya Is it ready for review?
@jerryshao Yeah, i have finished the unit tests. Please take a look when you have time. Thanks.
I decided to give the user the choice of whether to set fs.gvfs.impl.disable.cache
to true
, so that if there is no multi-token scenario, the user can reuse the same fs; and if there is a multi-token scenario, it can be recommended the user configure the configuration fs.gvfs.impl.disable.cache=true
in the configuration to initialize different fs for different tokens.
@jerryshao @yuqi1129 Do we need javadoc
for this module? I suggest that we should add. WDYT?
@jerryshao @yuqi1129 Do we need
javadoc
for this module? I suggest that we should add. WDYT?
It seems that it will be directly accessible to users, I believe we need to add Java doc for it.
@jerryshao @yuqi1129 Do we need
javadoc
for this module? I suggest that we should add. WDYT?It seems that it will be directly accessible to users, I believe we need to add Java doc for it.
We also need to add the check of javadoc. Could you give some guidance?
@jerryshao @yuqi1129 Do we need
javadoc
for this module? I suggest that we should add. WDYT?It seems that it will be directly accessible to users, I believe we need to add Java doc for it.
We also need to add the check of javadoc. Could you give some guidance?
Please see: https://github.com/datastrato/gravitino/blob/5ceb2b82571de968737ac72475f9c661ef2cf9b6/build.gradle.kts#L368
@jerryshao @qqqttt123 @yuqi1129 Have fixed the comments, please take a look again, thanks!
Could you add a document to explain how to use this client?
Could you add a document to explain how to use this client?
LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.
LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.
Can you confirm if the Java doc works locally?
LGTM. let @yuqi1129 confirm the part of javadoc and the maven publish.
Can you confirm if the Java doc works locally?
Yes, see:
OK, LGTM
I think the follow-up thing is to add e2e IT test for gvfs.
What changes were proposed in this pull request?
This PR proposes to add the code skeleton for gravitino hadoop file system to proxy hdfs file system.
Why are the changes needed?
Fix: #1616
How was this patch tested?
Add uts to cover the main interface method.