fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

Inconsistencies regarding quoting / escaping in URLs / paths #1713

Open mxmlnkn opened 1 month ago

mxmlnkn commented 1 month ago

I think that the interface regarding path inputs should be more consistent(ly defined).

E.g., consider this case where ls even returns some mix of escaped and unescaped symbols, which seems to be clearly a bug:

Server set up with:

mkdir -p /tmp/served
echo foo > /tmp/served/'#not-a-good-name!?'
ruby -run -e httpd /tmp/served/ --port 8000 --bind-address=127.0.0.1
import pprint
import urllib
from fsspec.implementations.http import HTTPFileSystem as HFS

url = "http://127.0.0.1:8000"
fs = HFS(url)
# What I would have expected to work:
# result = fs.ls("/")
results = fs.ls(url)
pprint.pprint(results)
print()

for result in results:
    if '/?' not in result['name']:
        # Neither do work
        # pprint.pprint(fs.stat(urllib.parse.quote(result['name'])))
        pprint.pprint(fs.stat(result['name']))

Output:

[{'name': 'http://127.0.0.1:8000/%23not-a-good-name!?',
  'size': None,
  'type': 'file'},
 {'name': 'http://127.0.0.1:8000/?N=D', 'size': None, 'type': 'file'},
 {'name': 'http://127.0.0.1:8000/?S=D', 'size': None, 'type': 'file'},
 {'name': 'http://127.0.0.1:8000/?M=D', 'size': None, 'type': 'file'}]

{'ETag': '4013b0-4-670a77e8',
 'mimetype': 'application/octet-stream',
 'name': 'http://127.0.0.1:8000/%23not-a-good-name%21%3F',
 'size': 4,
 'type': 'file',
 'url': 'http://127.0.0.1:8000/%23not-a-good-name!%3F'}

Traceback (most recent call last):
  File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 422, in _info
    await _file_info(
  File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 832, in _file_info
    r = await session.get(url, allow_redirects=ar, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/aiohttp/client.py", line 590, in _request
    raise err_exc_cls(url)
aiohttp.client_exceptions.InvalidUrlClientError: http://127.0.0.1:8000/%2523not-a-good-name!%3F

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/d/Myself/projects/ratarmount/worktrees/1/test-http.py", line 15, in <module>
    pprint.pprint(fs.stat(urllib.parse.quote(result['name'])))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/fsspec/spec.py", line 1605, in stat
    return self.info(path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "~/.local/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "~/.local/lib/python3.12/site-packages/fsspec/implementations/http.py", line 435, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: http%3A//127.0.0.1%3A8000/%2523not-a-good-name%21%3F

The # is escaped as %23 but ? is not escaped leading to the observed FileNotFoundError.

Using urllib.parse.quote also does not work because it would escape % again even after going through the trouble of parsing the whole URL because we do not want to escape the special characters in http://127.0.0.1:8000 only those in the path. Even this would result in %2523not-a-good-name%21%3F, which also leads to a FileNotFoundError.

HTTPFileSystem returning those sorting actions such as ?N=D is another matter altogether, but because it parses generated HTML listing from arbitrary HTTP servers, I am not surprised that this is not stable and might be impossible to fix for all cases.

Possibly related issues:

martindurant commented 1 month ago

Some file systems return full paths in ls, some relative paths.

Where do you see an example of returning relative paths?

martindurant commented 1 month ago

for HTTPFileSystem, I have to specify the whole URL

HTTP is probably the only case of this, and the reason is, that the filesystem is responsible for both "http" and "https", but knowing which is imperative for getting data back. This is unfortunate...

It might be reasonable to have a separate http and https FS instance, each of which can only respond to the given protocol, and only return ls() results for links of the same protocol (the same_scheme= argument already exists on HTTPFileSystem, we would have it always True).

Another thought, is the have the "in FS name" and "canonical name" be separate fields in ls()/find()/info(). The question becomes: