Open KuribohG opened 2 months ago
Hello @KuribohG
Thank you for opening the issue.
Could you explain your initial use case? The example you provide seems a bit unusual regarding two points:
UPath
without protocol) to an s3 path. What was your intent?The solution here should be that (1) we should raise an exception if joinpath is used with paths of different protocols. (2) we should raise an exception for s3 uris with empty buckets (empty netloc)
Andreas
s3fs
raises ValueError: Attempt to open non key-like path: /b
when a 'empty-bucket-path' is trying to be accessed. We should catch this early in UPath.
>>> import fsspec
>>> fsspec.open("s3:///b")
<OpenFile '/b'>
>>> x=_
>>> x.open()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 135, in open
return self.__enter__()
^^^^^^^^^^^^^^^^
File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/core.py", line 103, in __enter__
f = self.fs.open(self.path, mode=mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/fsspec/spec.py", line 1293, in open
f = self._open(
^^^^^^^^^^^
File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 685, in _open
return S3File(
^^^^^^^
File "/Users/andreaspoehlmann/Development/loot/venv/lib/python3.11/site-packages/s3fs/core.py", line 2152, in __init__
raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: /b
Using empty buckets for s3 uri is my mistake, but in this issue I am focusing on joining paths from different protocol.
In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.
Does all the paths with protocols means somewhat an "absolute" path? Do we allow a path like s3:///../a/b/c
? If not, joining two paths path1, path2
from different protocol may be easier, because we can just return path2
since it is an "absolute" uri. But if we do allow this kind of path, throwing an exception seems to be a meaningful behaviour.
In my case, I want to implement a dataset, which can be constructed by a parameter for (1) relative path, (2) absolute path, (3) path with protocols. Each dataset has a working directory, so I need to join the working dir path with this path parameter.
If I understand correctly, you want paths within your dataset to be pointing to relative locations, and exchange the root of the dataset?
rel_a = UPath("path/to/fileA.txt")
rel_b = UPath("path/to/somewhere/else/fileB.txt")
root_x = UPath("s3://mybucket/somepath")
root_y = UPath("file:///opt/someotherpath")
# and now
x_a = root_x.joinpath(rel_a)
x_b = root_x.joinpath(rel_b)
y_a = root_y.joinpath(rel_a)
y_b = root_y.joinpaht(rel_b)
You might also want to checkout https://intake.readthedocs.io/en/latest/ if you want to describe your datasets declaratively and load from different locations.
Does all the paths with protocols means somewhat an "absolute" path?
Yes. All UPath's (with the exception of PosixUPath and WindowsUPath) are absolute.
Do we allow a path like
s3:///../a/b/c
?
I will interpret this as s3://mybucket/../a/b/c
and the answer is that this is currently undefined behavior. UPath("s3://mybucket/../a/b/c").resolve()
will evaluate to UPath("s3://mybucket/a/b/c")
, because mybucket/
is considered the path anchor and pathlib resolves /../a/b/c
to /a/b/c
If not, joining two paths
path1, path2
from different protocol may be easier, because we can just returnpath2
since it is an "absolute" uri.
I still think .joinpath
should raise an error if the two protocols differ. I currently don't see what the reason for joining absolute paths of different protocols would be. (I can understand that use case if you want urljoin behavior, but that's what .joinuri
is for)
Sorry I didn't explain my case clearly enough.
We have an argument for specifying the dataset location (like using cmd args). When we run the script, we need to join the workdir and this argument.
Say, the workdir is /home/username/datasets
. User may specify the location as some/relative/path/data
, or s3:///path/with/protocol/data
. We want to get the final location of the dataset, so I used joinpath
in my code. So the expectation is
UPath(`/home/username/datasets`).joinpath(UPath('some/relative/path/data')) -> UPath('/home/username/datasets/some/relative/path/data')
UPath(`/home/username/datasets`).joinpath(UPath('s3:///path/with/protocol/data')) -> UPath('s3:///path/with/protocol/data')
This may be implemented by UPath.resolve
or UPath.absolute
, but will be unavailable if workdir is specified by some environment variable. Also, joinuri
is seem not for this case. I currently just check whether the user arg has a protocol, and if so use that arg as the final location.
Thank you for the clarification. If the workdir is the current working directory, you can just do:
def cli(arg):
pth = UPath(arg).absolute()
This will correctly prepend the cwd if the path is a relative local path, and because all other UPaths are absolute, will just return for example the S3 path if the user provides that.
In the case that workdir is also configurable by the user you could do:
def cli(arg, workdir):
pth = UPath(arg)
if not pth.is_absolute():
pth = UPath(workdir).joinpath(pth)
Let me know if that solves your issue, Cheers, Andreas
Thanks, this will solve the problem in my use case.
I think joinpath
should be clarified for joining paths from different protocol, and I agree that an exception could be thrown.
This call will return
PosixUPath('/b')
, butwill return
S3Path('s3:///b')
.Maybe in
joinpath
, we should return just the second path when the second path has protocol (like injoinuri
)?