Open cgrass opened 9 months ago
In the original discussion I suggested fixing the bug by breaking up the other_paths
use cases into distinct implementations. in that design the local path could be constructed using a hash of the source filename. e.g., get()
for single files should be simple:
1) destination (path2
/lpath
) is file -> write rpath
contents to destination file. because user defines lpath
, it's up to them to ensure reasonable length.
2) destination (path2
/lpath
) is dir -> hash (md5) filename to construct lpath
. this prevents long rpath
filenames from being used directly to construct lpath
.
i think it would also be helpful to return the constructed lpath
from get()
.
We assume that the local filename should match the remote one when copying to inside a directory - this is what any copy operation would guarantee. The question is, what part of the remote URL we consider the "filename". Specifically for HTTP, it's not obvious whether query parameters are or are not part; but maybe the correct information is available in the headers.
I don't think you can guarantee that behavior; the source system might have fundamentally different path limits or requirements than the dest system.
if we assume that the query param is removed entirely and a legal 1000 character filename is part of the url, what can/should fsspec do to copy the file locally?
I think that in cases where the local filesystem can't handle a name, the caller should supply explicit names to write to using list inputs, or use get_file() instead of get().
the caller should supply explicit names to write to using list inputs
it seems awkward to force callers to create a list if they are interacting with a single file. especially if that requirement is only valid when rpath
is longer than the local filesystem can handle.
or use get_file() instead of get()
I created a couple unit tests today and it works great when rpath
and lpath
are file locations. thanks for pointing out that alternative method! should it work for all fs implementations? i saw that it's unimplemented in async.py
, but it's not clear to me if that will be a problem or not in a live env.
I found that passing in a dir for lpath
failed with: IsADirectoryError: [Errno 21] Is a directory: '/var/myloc/'
. If lpath
must be a file location you might want to update the docs/comments.
here is the test that shows the behavior:
def test_http_output():
kwargs = {}
fs = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
expected_output_path = "/var/myloc/"
rpath = {longS3SignedUrl}
lpath = expected_output_path
fs.get_file(rpath, lpath)
if that requirement is only valid when rpath is longer than the local filesystem can handle
I don't think there is a general solution to this
should it work for all fs implementations?
Yes!
I found that passing in a dir for lpath failed
Indeed, this copies from a file path to a file path, which is why it gets around the case of auto-generated names
Discussed in https://github.com/fsspec/filesystem_spec/discussions/1490
Linked bug report