Open coolderli opened 3 months ago
@jerryshao @shaofengshi @xunliu Can you share your thoughts?Thanks. cc @xloya @YxAc @zhoukangcn
Thanks @coolderli to bring this out. I think HCFS API is not Posix compliant API, so using HCFS API with fuse has many limitations. I don't know how well fsspec support Posix, we need to investigate.
The concern from my side is the performance of using python to achieve fuse, using fuse requires lots of context switch (between user space and kernel space) and it will affect the performance a lot. If we use a dynamic language, then the performance will be worse.
Currently, I don't have a better solution, maybe we should investigate more to have a better solution.
Hi Peidian, I don't have much knowledge about FUSE; For the solution2, is that only available for python? Which means, only in Python application, fsspec mounts a remote storage as a local path, and the user can read/write that from Python codes.
Hi Peidian, I don't have much knowledge about FUSE; For the solution2, is that only available for python? Which means, only in Python application, fsspec mounts a remote storage as a local path, and the user can read/write that from Python codes.
@shaofengshi Not exactly. Users need to run a piece of Python code to perform a mount operation first, and then they can use other applications to access it.
Hi Peidian, I don't have much knowledge about FUSE; For the solution2, is that only available for python? Which means, only in Python application, fsspec mounts a remote storage as a local path, and the user can read/write that from Python codes.
@shaofengshi Not exactly. Users need to run a piece of Python code to perform a mount operation first, and then they can use other applications to access it.
Got it; thanks Peidian's input; Not sure how compatible and stable it is, such as OS support, Python versions etc. If that is not very good, we may not be able to persuade a large group of user to use it, this is my concern.
Hi @coolderli I heard that you encountered some issues with fsspec
. I’d like to understand the details. Could you help list the specific problems?
Hi @coolderli I heard that you encountered some issues with
fsspec
. I’d like to understand the details. Could you help list the specific problems?
The first issue is from fusepy. It's about TypeError, meaning datetime can't be converted to an int. I have fixed it.
For now, the fsspec fuse can work. I submit a draft PR about fsspec fuse. You can take a look and have a try. And I need more tests. https://github.com/apache/gravitino/pull/4634
@diqiu50 I'm not sure if the fsspec fuse is a good way. Because in most cases we need to use it in a container environment (k8s), we not only need to implement fuse, but also need to implement k8s CSI. https://juicefs.com/docs/zh/csi/introduction/
On the other hand, using CSI in a cloud environment may not be feasible because it requires some configuration in the k8s cluster, although this is not a problem in our private cluster. Using fsspec fuse can help us avoid this situation, as we do not need to make any modifications to the existing k8s cluster. You can take a look at this article: https://mp.weixin.qq.com/s/j6AlSqKxKInAKeBfADJdOA. The Juice team has also raised similar concerns about CSI.
@diqiu50 I have tested the fsspec fuse again. I mounted an HDFS directory and found the writer wasn't successful. The list and read are successful.
>>> with open('/tmp/fileset/fileset_xx/tmp/test_hdfs_test/test_file.txt', 'w') as file:
... file.write(content) # 写入内容
...
DEBUG:fsspec.fuse:getattr /test_file.txt
DEBUG:fuse:FUSE operation getattr raised a <class 'fuse.FuseOSError'>, returning errno 2.
Traceback (most recent call last):
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/fuse.py", line 33, in getattr
info = self.fs.info(path)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/gravitino/filesystem/gvfs.py", line 192, in info
actual_info: Dict = context_pair.filesystems()[0].info(
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 90, in info
return self._make_entry(info)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 109, in _make_entry
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), info.path)
FileNotFoundError: [Errno 2] No such file or directory: '/xxxxxx/test_file.txt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 739, in _wrapper
return func(*args, **kwargs) or 0
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 779, in getattr
return self.fgetattr(path, buf, None)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 1032, in fgetattr
attrs = self.operations('getattr', self._decode_optional_path(path), fh)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 1081, in __call__
return getattr(self, op)(*args)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/fuse.py", line 35, in getattr
raise FuseOSError(ENOENT) from exc
fuse.FuseOSError: [Errno 2] No such file or directory
DEBUG:fsspec.fuse:create ('/test_file.txt', 33204)
DEBUG:fsspec.fuse:getattr /test_file.txt
DEBUG:fuse:FUSE operation ioctl raised a <class 'fuse.FuseOSError'>, returning errno 25.
Traceback (most recent call last):
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 739, in _wrapper
return func(*args, **kwargs) or 0
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 1065, in ioctl
return self.operations('ioctl', path.decode(self.encoding),
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 1081, in __call__
return getattr(self, op)(*args)
File "/home/mi/IdeaProjects/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fuse.py", line 1148, in ioctl
raise FuseOSError(errno.ENOTTY)
fuse.FuseOSError: [Errno 25] Inappropriate ioctl for device
@coolderli Does the problem occur only when using the HDFS file system?
We track the development process through issue #5504
Describe the feature
Implement fuse for gvfs to support mounting fileset to local directories. The instance defaults to mounting
fileset://fileset/fileset_catalog/schema/fileset_name
to/fileset/fileset_catalog/schema/fileset_name
. So we can access it via posix protocol. In addition, we can support mounting to user-defined directories, so that users do not need to modify any code.Motivation
In AI scenarios, users often use the posix protocol to access data. The data is stored in media such as JuiceFS, NAS, or CPFS, and then mounted to a local directory. Directly using these storage has the following disadvantages:
Describe the solution
Use the underlying fuse directly.
This means that fileset needs to manage a local directory. I think this is not a good solution, users will bypass gvfs without any benefits.
Using fsspec fuse to implement gvfs fuse fsspec provides the feature of fuse, which supports forwarding fuse operations to fsspec fs operations: https://filesystem-spec.readthedocs.io/en/latest/features.html#mount-anything-with-fuse We could do some optimization based on fsspec fuse to support gvfs fuse.
Implement gvfs fuse using JNI to call GravitinoVirtualFileSystem
At present, Solution 2 and Solution 3 are similar. Solution 2 is implemented by calling Python gvfs, and Solution 3 is implemented by calling Java gvfs.
Additional context
No response