PyFilesystem / pyfilesystem2

Python's Filesystem abstraction layer
https://www.pyfilesystem.org
MIT License
1.98k stars 174 forks source link

Feature request: directly open a file url? #336

Open longern opened 5 years ago

longern commented 5 years ago

Is there a method that supports directly open a file URL like smart-open? https://pypi.org/project/smart-open/

open('s3://commoncrawl/robots.txt')
chfw commented 5 years ago

I have such a need and so have some ready code for it.

chfw commented 5 years ago

I have a set of similar use cases here:

https://github.com/moremoban/moban/blob/dev/moban/file_system.py

where you can find:

  1. read_text(a_fs_url)
  2. read_binary(a_fs_url)

And I would need similar functionality from os.path:

  1. os.path.exist -> the_file_system.path_exists(a_fs_url)
  2. os.path.isfile -> the_file_system.is_file(a_fs_url)
  3. os.path.isdir -> the_file_system.is_dir(a_fs_url) ....

But I thought it is only me who have such a need and I am not sure if such use cases fit in with pyfs2's concept: always open parent directory, then open a file.

willmcgugan commented 5 years ago

Not as such, but there is the open method which will split a path from the FS URL.

>>> from fs.opener import open
>>> zip_fs, path = open("zip://foo.zip!/bar/egg")
>>> zip_fs.readtext(path)
longern commented 5 years ago

However fs.opener.open won't work for nonexistent path.

willmcgugan commented 5 years ago

@longern What would you expect to happen for a nonexistent path?

longern commented 5 years ago

Some of the methods may accept a nonexistent path as the argument, such as mkdir, exists, and sometimes write to a new file. Is there any shortcut for them?

exists('s3://commoncrawl/robots.txt')
mkdir('ftp://some-url/some-path/dirname')
willmcgugan commented 5 years ago

I'm not sure I follow. Are you looking for something like this?

with open_fs("s3://commoncrawl") as fs:
    robots_exists = fs.exists("robots.txt")
longern commented 5 years ago

Sometimes file URL is from user input so I need to split fs URL and path for every operation. I'm looking for some methods to directly operate file URL.

willmcgugan commented 5 years ago

You can use this method to parse FS URLs.

lurch commented 5 years ago

@willmcgugan The documentation for ParseResult mentions a path part, but https://pyfilesystem2.readthedocs.io/en/latest/openers.html doesn't document how to include the path in an FS URL.

chfw commented 5 years ago

And it didn’t say how to open a file but a path.

chfw commented 5 years ago

I can make my module as an independent lib if there are enough interests.

https://github.com/moremoban/moban/blob/dev/moban/file_system.py

chfw commented 5 years ago

Or I can upstream it into PyFilesystem2 if it fits its mission.

CMCDragonkai commented 4 years ago

I was looking for this, but I foudn that fs.opener.open didn't work for a file in the current directory. It just keeps saying that the root path does not exist.

CMCDragonkai commented 4 years ago

Seems like we just have to use:

import os

(fspath, filename) = os.path.split('s3://commoncrawl/a/b/c/robots.txt')
# note that this keeps the query parameter in the filename

Not sure if query parameters matter here.

dargueta commented 4 years ago

The problem is some file system abstractions like s3 and gs use the first component of the URL as the bucket and don't expose it as part of the abstraction. It's an argument to the constructor, basically. You'd have to have file systems implement a classmethod to open an arbitrary URL to get around this.

CMCDragonkai commented 4 years ago

Example? Are you saying the s3 fs impl cannot open the path including the directory?

dargueta commented 4 years ago

Sorry that was a bad example

CMCDragonkai commented 4 years ago

I wrote something like this:

def parse_file_url(url: str) -> Tuple[str, str]:
    fs_url = ''
    file_path = ''
    url_parsed = urllib.parse.urlparse(url)
    # if there's no scheme, it's a filesystem path
    if not url_parsed.scheme:
        fs_url += 'osfs://'
        # if it is an absolute path, the fs_url must start at the root
        if url_parsed.path.startswith('/'):
            fs_url += '/'
        # remove any leading slashes
        file_path += url_parsed.path.lstrip('/')
        if url_parsed.params:
            file_path += f';{url_parsed.params}'
        if url_parsed.fragment:
            file_path += f'#{url_parsed.fragment}'
    else:
        if not url_parsed.path:
            fs_url += f'{url_parsed.scheme}://'
            if url_parsed.query:
                fs_url += f'?{url_parsed.query}'
            file_path += url_parsed.netloc
        else:
            fs_url += f'{url_parsed.scheme}://'
            if url_parsed.netloc:
                fs_url += url_parsed.netloc
            if url_parsed.query:
                fs_url += f'?{url_parsed.query}'
            file_path += url_parsed.path
            if url_parsed.params:
                file_path += f';{url_parsed.params}'
            if url_parsed.fragment:
                file_path += f'#{url_parsed.fragment}'
    return (fs_url, file_path)

@contextlib.contextmanager
def open_file_url(url: str,
                  mode: str = 'r',
                  buffering=-1,
                  encoding=None,
                  errors=None,
                  newline='') -> Iterator[IO]:
    (fs_url, file_path) = parse_file_url(url)
    with fs.open_fs(fs_url) as fs_:
        with fs_.open(file_path, mode, buffering, encoding, errors,
                      newline) as file:
            yield file
mezhaka commented 3 years ago

I ended up with something like this:

@contextmanager
def open_file(url: str,
              mode: str = "r",
              create: bool = False,
              buffering: int = -1,
              encoding: Optional[str] = None,
              errors: Optional[str] = None,
              newline: str = "",
              **options) -> typing.IO:
    writeable = True if "w" in mode else False
    dir_url, file_name = os.path.split(url)
    with open_fs(dir_url, writeable, create) as fs_:
        with fs_.open(file_name, mode, buffering, encoding, errors, newline, **options) as file_:
            yield file_