pygeoapi is a Python server implementation of the OGC API suite of standards. The project emerged as part of the next generation OGC API efforts in 2018 and provides the capability for organizations to deploy a RESTful OGC API endpoint using OpenAPI, GeoJSON, and HTML. pygeoapi is open source and released under an MIT license.
This PR intends to abstract away the local/remote file system or byte storage where the data used by the providers are stored. The implementation leverages a file-system interface provided through fsspec. fsspec provides already support for various file-systems and cloud storage services through built-in implementations. It also allows for using other known implementations, as well as implementing and registering new backends.
The data storage abstraction happens mostly in the implementation of the BaseProvider, where a new instance attribute fs is introduced. This attribute, which is a fsspec file-system interface, is inherited by other providers. Other providers can use this file-system interface in their implementation to access the data.
The instantiation of fsspec file-system objects may use configuration variables set in the providers section (some backends used by fsspec may also use environment variables).
This PR introduces a new optional file_system section, in the providers section of pygeoapi's runtime configuration. If the file_system section is omitted, the fs attribute of the BaseProvider will be an instance of the fsspec.implementations.local.LocalFileSystem class. In that case, the fs attribute can be used by other providers to access and read files on the "local" file system. In the implementations of the providers, calls to the builtin open functions can then be replaced by the open method of the LocalFileSystem:
# Calls of 'open' builtin function...
with open(self.data, mode='rt') as f:
...
# ... can be replaced with calls of 'open' method of LocalFileSystem instance
with self.fs.open(self.data, mode='rt') as f:
...
The file_system section, if given in pygeoapi's configuration, has one protocol mandatory field. The value passed to this field must be one of the protocols supported by fsspec (see fsspec.available_protocols()). For faster access, the data can be cached locally (e.g. when the data is on remote storage). This is of course not suitable for very large datasets, as the data needs to be downloaded on the first query, which is in that case both time consuming and takes much space. To cache locally the data, one can configure pygeoapi's runtime as follows:
providers
- type: ...
...
data: <my-bucket>/<key>
file_system: # optional
protocol: gs # mandatory, anything from the `fsspec.available_protocols()` list
storage_options: # optional
# Credentials and other keywords parameters, specific to implementations supported by fsspec.
# See https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations and https://filesystem-spec.readthedocs.io/en/latest/api.html#external-implementations
...
cache_storage: /path/to/cached/data # optional, if not given, a temporary directory (cleaned up when process ends) will be used
cache_options: # optional
# see https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.cached.WholeFileCacheFileSystem
expiry_time: ...
...
This PR also adapted the implementation of several providers (GeoJSON, CSV, FileSystem, Rasterio, Xarray) to the data storage abstraction, which should give a good starting point for the implementation of other providers.
As for now, the implementation of the data storage abstraction cannot be used by the OGRProvider, as the Open method of drivers from osgeo.gdal/osgeo.ogr accepts only a path (str) to the data to open, and not a file-like handle. However, this provider can already access data on remote storage with GDAL's virtual file systems. The environment variables (if any needed) need then to be set in pygeoapi's parent process to make it work.
fsspec was added to requirements.txt as it is used in pygeoapi/provider/base.py. fsspec has had a OS package for Ubuntu for some time now, and does not seem to have known high or critical vulnerabilities.
[X] I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
[X] I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines
[X] I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
[X] I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines
Overview
This PR intends to abstract away the local/remote file system or byte storage where the data used by the providers are stored. The implementation leverages a file-system interface provided through
fsspec
.fsspec
provides already support for various file-systems and cloud storage services through built-in implementations. It also allows for using other known implementations, as well as implementing and registering new backends.The data storage abstraction happens mostly in the implementation of the
BaseProvider
, where a new instance attributefs
is introduced. This attribute, which is afsspec
file-system interface, is inherited by other providers. Other providers can use this file-system interface in their implementation to access the data.The instantiation of
fsspec
file-system objects may use configuration variables set in theproviders
section (some backends used byfsspec
may also use environment variables).This PR introduces a new optional
file_system
section, in theproviders
section of pygeoapi's runtime configuration. If thefile_system
section is omitted, thefs
attribute of theBaseProvider
will be an instance of thefsspec.implementations.local.LocalFileSystem
class. In that case, thefs
attribute can be used by other providers to access and read files on the "local" file system. In the implementations of the providers, calls to the builtinopen
functions can then be replaced by theopen
method of theLocalFileSystem
:The
file_system
section, if given in pygeoapi's configuration, has oneprotocol
mandatory field. The value passed to this field must be one of the protocols supported byfsspec
(seefsspec.available_protocols()
). For faster access, the data can be cached locally (e.g. when the data is on remote storage). This is of course not suitable for very large datasets, as the data needs to be downloaded on the first query, which is in that case both time consuming and takes much space. To cache locally the data, one can configure pygeoapi's runtime as follows:Related Issue / discussion
https://github.com/geopython/pygeoapi/issues/824
Additional information
OGRProvider
, as theOpen
method of drivers fromosgeo.gdal
/osgeo.ogr
accepts only a path (str
) to the data to open, and not a file-like handle. However, this provider can already access data on remote storage with GDAL's virtual file systems. The environment variables (if any needed) need then to be set in pygeoapi's parent process to make it work.fsspec
was added torequirements.txt
as it is used inpygeoapi/provider/base.py
.fsspec
has had a OS package for Ubuntu for some time now, and does not seem to have known high or critical vulnerabilities.fsspec
has support for builtin implementations. Packages for external implementations supported byfsspec
can be added to therequirements-provider.txt
file.netCDF4
was replaced byh5py
andh5netcdf
to allow reading from remote storage withxarray
.Dependency policy (RFC2)
Updates to public demo
Contributions and licensing
(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)
Updates to public demo
Contributions and licensing
(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)