Closed joouha closed 9 months ago
Personally I would expect it to behave like
urljoin
.
I would agree. Is there actually a use case for double slashes in the middle of a url path?
Is there actually a use case for double slashes in the middle of a url path?
Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python
I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin
behaviour a try.
I've been thinking about this for a bit, and I wonder what's the best way to address this.
For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"
If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:
>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")
Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me :sweat_smile:) usually expect:
UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")
which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:
>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")
>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")
# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")
# --> so we should not normalize trailing slashes on those filesystems, I guess
And regarding the switch to urljoin
: I usually find the urljoin behavior unintuitive. For example just check the behavior below:
from urllib.parse import urljoin
roots = [
"http://example.com",
"http://example.com/",
"http://example.com/c",
"http://example.com/c/",
]
paths = [
"",
"a/b",
"/a/b",
"//a/b",
"///a/b",
"////a/b",
"/////a/b",
]
for root in roots:
for path in paths:
print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))
# output of the above script
urljoin('http://example.com', '') == 'http://example.com'
urljoin('http://example.com', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b') == 'http://a/b'
urljoin('http://example.com', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/', '') == 'http://example.com/'
urljoin('http://example.com/', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b') == 'http://a/b'
urljoin('http://example.com/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c', '') == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b') == 'http://a/b'
urljoin('http://example.com/c', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c/', '') == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b') == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b') == 'http://a/b'
urljoin('http://example.com/c/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'
I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?
Hi,
For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.
Say I load the page http://www.example.com/a/b/index.html
with the following content:
<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">
I would expect to be able to join the page's URL with any resource link using the /
operator,
and end up at the same resources which a browser would load (which is also urljoin
's behaviour):
>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")
>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")
Since upath
works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.
I would expect UPath normalization and joining rules to differ from pathlib
, since pathlib
works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.
So as a user, I would expect the following posix paths to be equivalent:
PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")
but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:
HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")
(which they actually do).
RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.
Also, I like this as a way of constructing URI paths with double slashes - very elegant!
>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")
Hello!
Should
UPath._make_child
replicate the behaviour of likepathlib.PurePath._make_child
as it does currently, or should it behave likeurllib.parse.urljoin
?Personally I would expect it to behave like
urljoin
.Thoughts?