fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
210 stars 36 forks source link

Behaviour of joinpath when the second path has protocol #213

Open KuribohG opened 2 months ago

KuribohG commented 2 months ago
UPath('a').joinpath(UPath('s3:///b'))

This call will return PosixUPath('/b'), but

UPath('a').joinuri(UPath('s3:///b'))

will return S3Path('s3:///b').

Maybe in joinpath, we should return just the second path when the second path has protocol (like in joinuri)?

ap-- commented 2 months ago

Hello @KuribohG

Thank you for opening the issue.

Could you explain your initial use case? The example you provide seems a bit unusual regarding two points:

  1. You're trying to join a local relative path (as returned by UPath without protocol) to an s3 path. What was your intent?
  2. Your s3 uri has an empty string as a bucket.

The solution here should be that (1) we should raise an exception if joinpath is used with paths of different protocols. (2) we should raise an exception for s3 uris with empty buckets (empty netloc)

Andreas

Extra notes:

s3fs raises ValueError: Attempt to open non key-like path: /b when a 'empty-bucket-path' is trying to be accessed. We should catch this early in UPath.

>>> import fsspec
>>> fsspec.open("s3:///b")
<OpenFile '/b'>
>>> x=_
>>> x.open()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
           ^^^^^^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1293, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 685, in _open
    return S3File(
           ^^^^^^^
  File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 2152, in __init__
    raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: /b
KuribohG commented 2 months ago

Using empty buckets for s3 uri is my mistake, but in this issue I am focusing on joining paths from different protocol.

In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.

Does all the paths with protocols means somewhat an "absolute" path? Do we allow a path like s3:///../a/b/c? If not, joining two paths path1, path2 from different protocol may be easier, because we can just return path2 since it is an "absolute" uri. But if we do allow this kind of path, throwing an exception seems to be a meaningful behaviour.

ap-- commented 2 months ago

In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.

If I understand correctly, you want paths within your dataset to be pointing to relative locations, and exchange the root of the dataset?

rel_a = UPath("path/to/fileA.txt")
rel_b = UPath("path/to/somewhere/else/fileB.txt")

root_x = UPath("s3://mybucket/somepath")
root_y = UPath("file:///opt/someotherpath")

# and now
x_a = root_x.joinpath(rel_a)
x_b = root_x.joinpath(rel_b)

y_a = root_y.joinpath(rel_a)
y_b = root_y.joinpaht(rel_b)

You might also want to checkout https://intake.readthedocs.io/en/latest/ if you want to describe your datasets declaratively and load from different locations.

Does all the paths with protocols means somewhat an "absolute" path?

Yes. All UPath's (with the exception of PosixUPath and WindowsUPath) are absolute.

Do we allow a path like s3:///../a/b/c?

I will interpret this as s3://mybucket/../a/b/c and the answer is that this is currently undefined behavior. UPath("s3://mybucket/../a/b/c").resolve() will evaluate to UPath("s3://mybucket/a/b/c"), because mybucket/ is considered the path anchor and pathlib resolves /../a/b/c to /a/b/c

If not, joining two paths path1, path2 from different protocol may be easier, because we can just return path2 since it is an "absolute" uri.

I still think .joinpath should raise an error if the two protocols differ. I currently don't see what the reason for joining absolute paths of different protocols would be. (I can understand that use case if you want urljoin behavior, but that's what .joinuri is for)

KuribohG commented 2 months ago

Sorry I didn't explain my case clearly enough.

We have an argument for specifying the dataset location (like using cmd args). When we run the script, we need to join the workdir and this argument.

Say, the workdir is /home/username/datasets. User may specify the location as some/relative/path/data, or s3:///path/with/protocol/data. We want to get the final location of the dataset, so I used joinpath in my code. So the expectation is

UPath(`/home/username/datasets`).joinpath(UPath('some/relative/path/data')) -> UPath('/home/username/datasets/some/relative/path/data')
UPath(`/home/username/datasets`).joinpath(UPath('s3:///path/with/protocol/data')) -> UPath('s3:///path/with/protocol/data')

This may be implemented by UPath.resolve or UPath.absolute, but will be unavailable if workdir is specified by some environment variable. Also, joinuri is seem not for this case. I currently just check whether the user arg has a protocol, and if so use that arg as the final location.

ap-- commented 2 months ago

Thank you for the clarification. If the workdir is the current working directory, you can just do:

def cli(arg):
    pth = UPath(arg).absolute()

This will correctly prepend the cwd if the path is a relative local path, and because all other UPaths are absolute, will just return for example the S3 path if the user provides that.

In the case that workdir is also configurable by the user you could do:


def cli(arg, workdir):
    pth = UPath(arg)
    if not pth.is_absolute():
         pth = UPath(workdir).joinpath(pth)

Let me know if that solves your issue, Cheers, Andreas

KuribohG commented 2 months ago

Thanks, this will solve the problem in my use case.

I think joinpath should be clarified for joining paths from different protocol, and I agree that an exception could be thrown.