Open hateyouinfinity opened 1 year ago
Looking at the docstring of fsspec.implementations.smb.SMBFileSystem
(link), I noticed it talks about using the class via fsspec.core.open(URI)
, in which case URI must contain a netloc.
fsspec.core.open
calls fsspec.core.open_files
, which calls fsspec.core.get_fs_token_paths
.
get_fs_token_paths
does roughly the following (link):
sftp://foo:pass@localhost:2222/upload/filename
)cls = fsspec.implementations.sftp.SFTPFileSystem
)cls._get_kwargs_from_urls
to extract params used to instantiate a class from url (params not contained in url are to be passed to get_fs_token_paths
as storage_options
)cls._strip_protocol
on its input, producing a valid filepathContrast example below with OP.
docker run --name sftp -p 2222:22 -d atmoz/sftp foo:pass:::upload
from fsspec.core import get_fs_token_paths
from fsspec.core import open as fsopen
fs, token, filepath = get_fs_token_paths("sftp://foo:pass@localhost:2222/upload/filename")
print(fs.host, fs.ssh_kwargs, filepath)
# localhost {'port': 2222, 'username': 'foo', 'password': 'pass'} ['/upload/filename']
with fsopen(filepath[0], "wb") as f:
f.write(b"Bye!")
print(fs.ls('.'))
# ['./upload']
print(fs.ls('./upload'))
# ['./upload/filename']
print(fs.cat("./upload/filename"))
# b'test'
I reckon fsspec implementations fall into two groups:
makedirs
and treat first non-scheme segment of _strip_protocol
's input as part of the filepath.makedirs
is called and treat first non-scheme segment of _strip_protocol
's input as a netloc.prefect.filesystems.RemoteFileSystem
assumes that all implementations fall into the first group, leading to the above described problems.
First check
Bug summary
Currently it does not seem possible to use
RemoteFileSystem
with WebHDFS as the underlying implementation. There are 2 problems afaict. Assume you define your filesystem block as follows:myfs = RemoteFileSystem(basepath="webhdfs://home/user/project", settings={"host": "example.com"})
.Calling
write_path
fails due to an improperly formatted url.myfs.write_path("filename", b"content")
callsmyfs.filesystem.makedirs("webhdfs://home/user/project)
, but the underlying implementation doesn't do any preprocessing and basically appendspath
to base url, producing something likehttps://example.com/webhdfs/v1webhdfs%3A//home/user/project?op=MKDIRS
.Calling the above url fails. Was this url generated properly it would look like this:
https://example.com/webhdfs/v1/home/user/project?op=MKDIRS
. That is,path
param that gets passed toWebHDFS._call
should begin with a slash and have no scheme.This doesn't seem to be a problem for other
RemoteFileSystem
methods since all of them call (be it directly or implicitly)fs.filesystem.open
, which (in the case of WebHDFS) callsfsspec.utils.infer_storage_options
, stripping the scheme. However,infer_storage_options
causes another problem.First segment of
fs.basepath
gets stripped, leading to accessing incorrect remote paths.fs.filesystem.open
callsfs.filesystem._strip_protocol
(link). Filesystem implementations commonly override_strip_protocol
. WebHDFS implementation's of_strip_protocol
callsfsspec.utils.infer_storage_options
. As far as I can infer,infer_storage_options
expects its input to either have no scheme (in which case the whole path is returned), or have netloc following the scheme (in which case netloc is stripped away along with the scheme). As a result,/user/project
gets accessed instead of/home/user/project
.One can work around this by prepending an extra segment to basepath (e.g.
basepath="webhdfs://fakehost/home/user/project"
), but that requires knowing how a particular implementation behaves (and is ugly to boot). Of note here is that it treats s3/gcs schemes as special cases (doesn't strip the first segment), so the above method can't be used blindly. I'd like to mention that current docs have usage examples only for cloud storage providers, which are seemingly immune to this issue.As an aside, it's not clear why an implementation that takes hostname/port as parameters expects
path
to contain a netloc at all.Previous section got me thinking whether WebHDFS is unique or maybe there are other implementations that have the same problem? So I picked some implementations and wrote a script to compare what
_strip_protocol
outputs for the same input path.Script
```python # pip install adlfs gcsfs ocifs paramiko pyarrow s3fs smbprotocol webdav4 #!/usr/bin/env python3 from typing import Dict, List from fsspec import get_filesystem_class DEFAULT_SCHEMES = [ "arrow_hdfs", "az", "dbfs", "file", "ftp", "github", "gs", "hdfs", "http", "https", "oci", "s3", "sftp", "smb", "webdav", "webhdfs", ] def get_resolved_path(scheme: str, path: str) -> str: return get_filesystem_class(scheme)._strip_protocol(path) def main( schemes_to_check: List[str] = DEFAULT_SCHEMES, no_scheme_path="/home/user/file" ) -> Dict[str, str]: res = {} for scheme in schemes_to_check: try: res[scheme] = get_resolved_path( scheme=scheme, path=f"{scheme}://{no_scheme_path.lstrip('/')}" ) except Exception as e: print(e) return res if __name__ == "__main__": for scheme, resolved_path in main().items(): print(f"{scheme: <20}{resolved_path}") ```Here's what I get if I run it:
Going by this, a few other filesystems might have something similar going on. SFTP seems reasonably easy to test.
WebHDFS doesn't seem to be the only problematic implementation.
Reproduction
Error
No response
Versions
Version: 2.4.2 API version: 0.8.0 Python version: 3.10.4 Git commit: 65807e84 Built: Fri, Sep 23, 2022 10:43 AM OS/Arch: win32/AMD64 Profile: default Server type: ephemeral Server: Database: sqlite SQLite version: 3.37.2
Additional context
No response