fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
251 stars 44 forks source link

Use `urllib.parse.urljoin` when joining paths #88

Closed joouha closed 9 months ago

joouha commented 1 year ago

Hello!

Should UPath._make_child replicate the behaviour of like pathlib.PurePath._make_child as it does currently, or should it behave like urllib.parse.urljoin?

>>> UPath("http://example.com/a") / "b/c"
HTTPPath('http://example.com/a/b/c')

>>> UPath("http://example.com/a/") / "b/c"
HTTPPath('http://example.com/a//b/c')  # I think this one is a bug...

>>> urljoin("http://example.com/a", "b/c")
'http://example.com/b/c'

>>> urljoin("http://example.com/a/", "b/c")
'http://example.com/a/b/c'

Personally I would expect it to behave like urljoin.

Thoughts?

normanrz commented 1 year ago

Personally I would expect it to behave like urljoin.

I would agree. Is there actually a use case for double slashes in the middle of a url path?

joouha commented 1 year ago

Is there actually a use case for double slashes in the middle of a url path?

Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:

https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python
normanrz commented 1 year ago

I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin behaviour a try.

ap-- commented 1 year ago

I've been thinking about this for a bit, and I wonder what's the best way to address this.

For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"

If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me :sweat_smile:) usually expect:

UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")

which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:

>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")

>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")

# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")

# --> so we should not normalize trailing slashes on those filesystems, I guess

And regarding the switch to urljoin: I usually find the urljoin behavior unintuitive. For example just check the behavior below:

from urllib.parse import urljoin

roots = [
    "http://example.com",
    "http://example.com/",
    "http://example.com/c",
    "http://example.com/c/",
]

paths = [
    "",
    "a/b",
    "/a/b",
    "//a/b",
    "///a/b",
    "////a/b",
    "/////a/b",
]

for root in roots:
    for path in paths:
        print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))

# output of the above script
urljoin('http://example.com', '')            == 'http://example.com'
urljoin('http://example.com', 'a/b')         == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b')        == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b')       == 'http://a/b'
urljoin('http://example.com', '///a/b')      == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b')     == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b')    == 'http://example.com///a/b'
urljoin('http://example.com/', '')           == 'http://example.com/'
urljoin('http://example.com/', 'a/b')        == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b')      == 'http://a/b'
urljoin('http://example.com/', '///a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b')    == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b')   == 'http://example.com///a/b'
urljoin('http://example.com/c', '')          == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b')      == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b')     == 'http://a/b'
urljoin('http://example.com/c', '///a/b')    == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b')   == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b')  == 'http://example.com///a/b'
urljoin('http://example.com/c/', '')         == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b')      == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b')    == 'http://a/b'
urljoin('http://example.com/c/', '///a/b')   == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b')  == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'

I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?

joouha commented 1 year ago

Hi,

For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.

Say I load the page http://www.example.com/a/b/index.html with the following content:

<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">

I would expect to be able to join the page's URL with any resource link using the / operator, and end up at the same resources which a browser would load (which is also urljoin's behaviour):

>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")

>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")

Since upath works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.

I would expect UPath normalization and joining rules to differ from pathlib, since pathlib works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.

So as a user, I would expect the following posix paths to be equivalent:

PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")

but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:

HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")

(which they actually do).

RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.


Also, I like this as a way of constructing URI paths with double slashes - very elegant!

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")