geopython / pygeoapi

pygeoapi is a Python server implementation of the OGC API suite of standards. The project emerged as part of the next generation OGC API efforts in 2018 and provides the capability for organizations to deploy a RESTful OGC API endpoint using OpenAPI, GeoJSON, and HTML. pygeoapi is open source and released under an MIT license.
https://pygeoapi.io
MIT License
460 stars 250 forks source link

[WIP] File system/Data storage abstraction #1640

Open MTachon opened 2 months ago

MTachon commented 2 months ago

Overview

This PR intends to abstract away the local/remote file system or byte storage where the data used by the providers are stored. The implementation leverages a file-system interface provided through fsspec. fsspec provides already support for various file-systems and cloud storage services through built-in implementations. It also allows for using other known implementations, as well as implementing and registering new backends.

The data storage abstraction happens mostly in the implementation of the BaseProvider, where a new instance attribute fs is introduced. This attribute, which is a fsspec file-system interface, is inherited by other providers. Other providers can use this file-system interface in their implementation to access the data.

The instantiation of fsspec file-system objects may use configuration variables set in the providers section (some backends used by fsspec may also use environment variables).

This PR introduces a new optional file_system section, in the providers section of pygeoapi's runtime configuration. If the file_system section is omitted, the fs attribute of the BaseProvider will be an instance of the fsspec.implementations.local.LocalFileSystem class. In that case, the fs attribute can be used by other providers to access and read files on the "local" file system. In the implementations of the providers, calls to the builtin open functions can then be replaced by the open method of the LocalFileSystem:


# Calls of 'open' builtin function...
with open(self.data, mode='rt') as f:
    ...

# ... can be replaced with calls of 'open' method of LocalFileSystem instance
with self.fs.open(self.data, mode='rt') as f:
    ...

The file_system section, if given in pygeoapi's configuration, has one protocol mandatory field. The value passed to this field must be one of the protocols supported by fsspec (see fsspec.available_protocols()). For faster access, the data can be cached locally (e.g. when the data is on remote storage). This is of course not suitable for very large datasets, as the data needs to be downloaded on the first query, which is in that case both time consuming and takes much space. To cache locally the data, one can configure pygeoapi's runtime as follows:


providers
    - type: ...
      ...
      data: <my-bucket>/<key>
      file_system:  # optional
          protocol: gs  # mandatory, anything from the `fsspec.available_protocols()` list
          storage_options:  # optional
              # Credentials and other keywords parameters, specific to implementations supported by fsspec.
              # See https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations and https://filesystem-spec.readthedocs.io/en/latest/api.html#external-implementations
              ...
          cache_storage: /path/to/cached/data  # optional, if not given, a temporary directory (cleaned up when process ends) will be used
          cache_options:  # optional
              # see https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.cached.WholeFileCacheFileSystem
              expiry_time: ...
              ...

Related Issue / discussion

https://github.com/geopython/pygeoapi/issues/824

Additional information

Dependency policy (RFC2)

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

tomkralidis commented 3 weeks ago

@MTachon I see this PR is marked as WIP. Is this still the case or is it ready for review? Thanks.