fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
251 stars 44 forks source link

The prefix file:// results in ignoring the first path component. #108

Closed pedro-r-marques closed 9 months ago

pedro-r-marques commented 1 year ago

I'm seeing this issue in both macos and linux

PATHS = ['/var/tmp', 'file:/var/tmp', 'file://var/tmp', 'file:///var/tmp']
{p:list(upath.UPath(p).iterdir()) for p in PATHS}

The variant "file://var/tmp" gives me the contents of "/tmp". The other variants work as expected.

ap-- commented 1 year ago

Hi @pedro-r-marques

thank you for opening an issue!

I do have a pretty good idea where the error is occurring here, but I'd love to have some more information from you beforehand.

  1. Could you tell me what the first thing you tried was, before you were investigating the behavior of the different amount of slashes after file:, i.e. how did you generate your URIs?
  2. Could you have a short read of this: https://en.wikipedia.org/wiki/File_URI_scheme#How_many_slashes? and let me know if it is clear to you why the behavior is different
  3. Would a warning have helped to address this?

Basically file://var/tmp is interpreted as host=var and path=/tmp and we currently seem to ignore the host for local paths. The question is, how we should address this in the best way. Very likely with both: better documentation AND actually using the host and erroring with a clear error msg.

Cheers, Andreas 😃

pedro-r-marques commented 1 year ago

I initially tried "file://" expecting it to follow other url-like parameters. If the implementation interprets the first string as hostname it should make sure that the value is usable and throw an error otherwise. i.e. if hostname != 'localhost' and hostname != '' then it should imho raise an ArgumentError.

ap-- commented 1 year ago

~Notes:~

~We should provide a correct error message for now, and in case pathlib or we decide to support parsing relative path uri's we can switch, see:~

moved to new issue #142

barneygale commented 1 year ago

file://foo/bar does not represent a relative path on Windows - it represents a UNC path \\foo\bar.

In my CPython PR I made it work the same on POSIX, and so:

>>> pathlib.Path.from_uri('file://var/tmp')
PosixPath('//var/tmp')

Note that paths with two leading slashes are allowed by POSIX:

A pathname consisting of a single slash shall resolve to the root directory of the process. A null pathname shall not be successfully resolved. A pathname that begins with two successive slashes may be interpreted in an implementation-defined manner, although more than two leading slashes shall be treated as a single slash.

ap-- commented 1 year ago

Thanks for the clarification @barneygale!

So does that mean file://localhost/absolute/path/file.txt would actually not map to /absolute/path/file.txt on posix systems, but to the path //localhost/absolute/path/file.txt?

how does that fit together with rfc8089? Are "file" ":" "//" "auth-part" -style URIs with a host not compatible with posix?

barneygale commented 1 year ago

I've never come across file URIs that use file://localhost/ in the wild, but they should probably be handled specially. I'll adjust the PR.

The RFC also says:

   The "host" is the fully qualified domain name of the system on which
   the file is accessible.  This allows a client on another system to
   know that it cannot access the file system, or perhaps that it needs
   to use some other local mechanism to access the file.

... which works fine on Windows, but not on other systems :(

ap-- commented 9 months ago

Fixed in universal_pathlib==0.2.0. Behavior for "file://path" uris is now identical to fsspec.

>>> import os
>>> os.getcwd()
'/home/user'
>>> import upath
>>> PATHS = ['/tmp/test', 'file:/tmp/test', 'file://tmp/test', 'file:///tmp/test']
>>> {p: upath.UPath(p) for p in PATHS}
{
  '/tmp/buh': PosixUPath('/tmp/buh'),
  'file:/tmp/buh': FilePath('file:///tmp/buh'),
  'file://tmp/buh': FilePath('file:///home/user/tmp/buh'),
  'file:///tmp/buh': FilePath('file:///tmp/buh'),
}