fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.02k stars 358 forks source link

"File name too long" with HTTP presigned requests #1504

Open cgrass opened 9 months ago

cgrass commented 9 months ago

Discussed in https://github.com/fsspec/filesystem_spec/discussions/1490

Originally posted by **cgrass** January 4, 2024 Hello, I'm new to python and fsspec, so hopefully there is an obvious answer to my question. Thanks for the help! ### Issue I need to fetch a file from an s3 bucket using https and a pre-signed URL. The URL produced is quite long, commonly over 1500 characters (mostly query params). If I use the code below, I get an OS exception: `File name too long`. ``` def test_this(): kwargs = {} spec = fsspec.filesystem("https", **kwargs) frompath = {longurl} topath = "/var/" ### in prod this is a real dest response = spec.get(frompath, topath, recursive=False, **kwargs) print(response) ``` I walked through the code and can see that the `rpath` is typically appended to the `lpath` for storage in `asyn.py`: ``` lpaths = other_paths( rpaths, lpath, exists=exists, flatten=not source_is_str, ) def other_paths( paths: list[str], path2: str | list[str], exists: bool = False, flatten: bool = False, ) -> list[str]: if isinstance(path2, str): path2 = path2.rstrip("/") if flatten: path2 = ["/".join((path2, p.split("/")[-1])) for p in paths] else: cp = common_prefix(paths) if exists: cp = cp.rsplit("/", 1)[0] if not cp and all(not s.startswith("/") for s in paths): path2 = ["/".join([path2, p]) for p in paths] else: path2 = [p.replace(cp, path2, 1) for p in paths] else: assert len(paths) == len(path2) return path2 ``` ### Question Am I misusing the lib or protocol? Is there a way to configure fsspec to scrub query params (or use some arbitrary string) for the appended destination string? ### Setup MacOS Ventura M1 Pro fsspec-2023.12.2 Python 3.9, 3.10 (tried both)

Linked bug report

cgrass commented 9 months ago

In the original discussion I suggested fixing the bug by breaking up the other_paths use cases into distinct implementations. in that design the local path could be constructed using a hash of the source filename. e.g., get() for single files should be simple: 1) destination (path2/lpath) is file -> write rpath contents to destination file. because user defines lpath, it's up to them to ensure reasonable length. 2) destination (path2/lpath) is dir -> hash (md5) filename to construct lpath. this prevents long rpath filenames from being used directly to construct lpath.

i think it would also be helpful to return the constructed lpath from get().

martindurant commented 9 months ago

We assume that the local filename should match the remote one when copying to inside a directory - this is what any copy operation would guarantee. The question is, what part of the remote URL we consider the "filename". Specifically for HTTP, it's not obvious whether query parameters are or are not part; but maybe the correct information is available in the headers.

cgrass commented 9 months ago

I don't think you can guarantee that behavior; the source system might have fundamentally different path limits or requirements than the dest system.

if we assume that the query param is removed entirely and a legal 1000 character filename is part of the url, what can/should fsspec do to copy the file locally?

martindurant commented 9 months ago

I think that in cases where the local filesystem can't handle a name, the caller should supply explicit names to write to using list inputs, or use get_file() instead of get().

cgrass commented 9 months ago

the caller should supply explicit names to write to using list inputs

it seems awkward to force callers to create a list if they are interacting with a single file. especially if that requirement is only valid when rpath is longer than the local filesystem can handle.

or use get_file() instead of get()

I created a couple unit tests today and it works great when rpath and lpath are file locations. thanks for pointing out that alternative method! should it work for all fs implementations? i saw that it's unimplemented in async.py, but it's not clear to me if that will be a problem or not in a live env.

I found that passing in a dir for lpath failed with: IsADirectoryError: [Errno 21] Is a directory: '/var/myloc/'. If lpath must be a file location you might want to update the docs/comments.

here is the test that shows the behavior:

def test_http_output():
    kwargs = {}
    fs = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
    expected_output_path = "/var/myloc/"
    rpath = {longS3SignedUrl}
    lpath = expected_output_path
    fs.get_file(rpath, lpath)
martindurant commented 9 months ago

if that requirement is only valid when rpath is longer than the local filesystem can handle

I don't think there is a general solution to this

should it work for all fs implementations?

Yes!

I found that passing in a dir for lpath failed

Indeed, this copies from a file path to a file path, which is why it gets around the case of auto-generated names