Open flahn opened 5 years ago
Hi, i have a strong opinion here: It is not useful to model every kind of data as a cube that can contain any kind of objects (pixel, voxel, list, table, dictionary, feature, string, ...) as values in a hyper-dimensional array, especially for an exchange format between backend and UDF server. IMHO i see no convincing reasons to do so, except: just because we can or because we fell in love with the products of our intellects.
A couple of questions regarding the proposed UDF format: What software library implements this format and allows easy conversion to established "runtime" formats like numpy, shapely or xarray? Why do we need to invent new complex formats outside existing approaches, with feature dimensions and irregular spatial dimensions? What software exists that the UDF developer can use to process this format in Python and R?
We should use common and well established data formats, so it is easy to implement UDF's for the user and ist easy to exchange data between the backend and the UDF REST server. UDF's should only crunch data as fast as possible. The proposed JSON specification of the current UDF approach has its direct equivalent as Python libraries and R libraries that are used for data processing. The REST JSON specification is only relevant for the backend developer, not for the UDF user! We do not need a new data format. Its designed only for data exchange between the backend and the UDF REST service!
Using GeoJSON makes it very clear for the backend, what kind of data was processed and is directly supported by the most common geo-data processing libraries: GEOS and OGR. The library pandas.DatetimeIndex is used for the time stamps. No need for data conversion. A Python UDF developer uses the libraries geopandas and shapely directly to process the data in a UDF[1]. The R developer can use geojsonR in R to process the data.
If you need raster data with more than three dimensions (i.e.: 2 regular time axes) or you have dense data like MODIS (HDF) or climate model data (netCDF), then the hypercube format can be used. This is implemented in Python via xarray, hence the UDF developer can use xarray directly, no data conversion required. Arrays are supported in R directly.
The RasterCollectionTile is a more specialized approach of the hypercube that supports an irregular time dimension and 2D raster slices, which may be better fitted for remote sensing data of non-geostationary satellites, that have sparse data in time and space. The Python libraries are pandas.DatetimeIndex and numpy arrays, that can be used in the Python UDF code directly. I think the difference between a RasterCollectionTile and a hypecube is small at the moment, so it may be meaningful to use only one format. The difference is that a 2D raster slice can be stored for a specific irregular time instance or time interval in a RasterCollectionTile. However, for computational reasons is the spatial density of a single 2D slice exactly the same as in a hypercube. All 2D slices have exactly the same spatial extent and resolution. I plan that each slice can have a different spatial extent, since it represents a single satellite image band at a specific time instance. Most of the NULL data that must be filled in a hypercube approach (hypercube spatial extent is the disjoint union of all 2D slice extents) is simply outside the spatial extent of the slice. However, this makes the computation a bit more complex, since the UDF algorithm must be aware, that each slice (in Python a numpy array) has a different spatial extent and must align them for kernel or aggregation computation. On the other hand, UDF's that apply algorithms (raster statistics, NDVI, ...) to each slice that do not need to know what the temporal neighbors in the same RasterCollectionTile are, can greatly benefit from this approach.
About structured and unstructured data: It is much easier for an UDF developer to store nested data that can contain lists, tables and dictionaries in their native data formats (lists, tables, dictionaries) that have their equivalent in many programming languages. Forcing these data into a multi-dimensional array is IMHO not useful. Especially when the metadata of the data cube must be extended to support this. Metadata for a specific computation can be stored this way: i.e. statistical information about the resulting hypercube ... . The backend for example can stream this data downstream to an info plotter node that displays metadata about the computed hypercube.
Regarding machine learning model: To load a machine learn model in an UDF each time a slice was send is computationally very expensive. Machine learn models can be quite large, hence the UDF server should cache the models, if they were requested once in a UDF request. The latest UDF implementation exposes endpoints to load pre-trained machine learn models directly to the UDF server and to reference them in the machine learn model definition. The user should not have access to the UDF sever directly (data mounting, ...), but the backend provider can deploy any machine learn model on the UDF server and expose their ids to the UDF developer. The current UDF approach makes the application of a specific machine learn model very easy for the UDF developer, he just calls the predict() function of the already pre-loaded machine learn model and passes the data [2]. Be aware that in parallel processing mode, many, many tiles are send to the UDF server, that all may use the same machine learn model. Pre-loading the machine learn model is the only option here to avoid extreme expensive overhead. And this can only be supported in the UDF server itself.
[1] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/functions/feature_collections_buffer.py [2] https://github.com/Open-EO/openeo-udf/blob/master/src/openeo_udf/functions/raster_collections_sklearn_ml.py#L68
I think we should have a discussion in person to figure how to proceed. I see room for alignment between the API/processes and UDFs and both side have good points, which we should align for the best outcome. A good place to discuss would be (before?) the 3rd year planning on the 12th of September so that we can release a "stable" 1.0 of the API at the end of the year, which includes the UDFs, of course.
cc @edzer
@huhabla
IMHO i see no convincing reasons to do so, except: just because we can or because we fell in love with the products of our intellects.
For starters, we defined it as our main view on EO data in the glossary (https://open-eo.github.io/openeo-api/glossary/) and from a users / udf script developers view it might get complicated if you have to deal with different data models from which you don't know when or under what circumstances you get those as data in the UDF. If this would somehow be specified I am happy with that.
A couple of questions regarding the proposed UDF format: What software library implements this format and allows easy conversion to established "runtime" formats like numpy, shapely or xarray? Why do we need to invent new complex formats outside existing approaches, with feature dimensions and irregular spatial dimensions? What software exists that the UDF developer can use to process this format in Python and R?
The exchange format is still JSON and the data
part is still a nested Array - MultidimensionalArray
We should use common and well established data formats, so it is easy to implement UDF's for the user and ist easy to exchange data between the backend and the UDF REST server.
I agree, but we currently have defined something that works out of practicability with Python, but in my oppinion we need a more generic approach for the metadata that ships with the data, so all back-ends and UDF services are able to interprete the data correctly.
Exchanging JSON between back-ends and UDF services is the lowest common denominator for all back-ends, simply because every back-end needs this to be compliant to the API, besides that we might have an option to use NetCDF, which supports multidimensional data handling. But then we would still have some work to do on agreeing to a common meta data handling. The only downside is that maybe not every back-end supports NetCDF.
UDF's should only crunch data as fast as possible. The proposed JSON specification of the current UDF approach has its direct equivalent as Python libraries and R libraries that are used for data processing.
Yes, thats right those are build on arrays (please also consider my explanation what MultidimensionalArrays are). But what about the metadata on the indices in those arrays. I do see your point for having those models, but you do some assumptions when calling them like spatial_raster_tiles
. If it is an option like that, then you might have a spatial dimension, a temporal and wavelengths. With my approach the data becomes more self explanatory, which should be the goal for UDFs.
The model, as I suggested it, is a first draft and has its room for improvements, of course.
The REST JSON specification is only relevant for the backend developer, not for the UDF user! We do not need a new data format. Its designed only for data exchange between the backend and the UDF REST service!
Yes, a back-end developer has to know it. But the UDF script developer might not be a developer of the back-end and hence need more explanatory data or at least a common way to know what data to expect.
Also I as a UDF service developer for R have to know what data comes in, in order to translate it into something an R user / UDF script developer can understand, e.g. stars
or an array with sufficient metadata.
Using GeoJSON makes it very clear for the backend, what kind of data was processed and is directly supported by the most common geo-data processing libraries: GEOS and OGR. The library pandas.DatetimeIndex is used for the time stamps. No need for data conversion. A Python UDF developer uses the libraries geopandas and shapely directly to process the data in a UDF[1]. The R developer can use geojsonR in R to process the data.
GeoJSON is still on the table. For a potential vector data cube, I would model the spatial feature data as GeoJSON and hence as values
of an irregularly spaced dimension of one of the specials type=feature
. This dimension being a FeatureDimension
would enable me to use geojsonR in the first place.
If you need raster data with more than three dimensions (i.e.: 2 regular time axes) or you have dense data like MODIS (HDF) or climate model data (netCDF), then the hypercube format can be used. This is implemented in Python via xarray, hence the UDF developer can use xarray directly, no data conversion required. Arrays are supported in R directly.
Yes, that's right again. data
can be translated into an array directly. Having sufficient meta data for the dimensions is the other thing. Maybe see my suggested data model as a metadata extension for your hypercube format. Maybe we can start from there.
The RasterCollectionTile is a more specialized approach of the hypercube that supports an irregular time dimension and 2D raster slices, which may be better fitted for remote sensing data of non-geostationary satellites, that have sparse data in time and space. The Python libraries are pandas.DatetimeIndex and numpy arrays, that can be used in the Python UDF code directly. I think the difference between a RasterCollectionTile and a hypecube is small at the moment, so it may be meaningful to use only one format. The difference is that a 2D raster slice can be stored for a specific irregular time instance or time interval in a RasterCollectionTile. However, for computational reasons is the spatial density of a single 2D slice exactly the same as in a hypercube. All 2D slices have exactly the same spatial extent and resolution. I plan that each slice can have a different spatial extent, since it represents a single satellite image band at a specific time instance. Most of the NULL data that must be filled in a hypercube approach (hypercube spatial extent is the disjoint union of all 2D slice extents) is simply outside the spatial extent of the slice. However, this makes the computation a bit more complex, since the UDF algorithm must be aware, that each slice (in Python a numpy array) has a different spatial extent and must align them for kernel or aggregation computation. On the other hand, UDF's that apply algorithms (raster statistics, NDVI, ...) to each slice that do not need to know what the temporal neighbors in the same RasterCollectionTile are, can greatly benefit from this approach.
OK, but why do wouldn't we split this into separate UDF calculation requests? For each tile time series separate?
About structured and unstructured data: It is much easier for an UDF developer to store nested data that can contain lists, tables and dictionaries in their native data formats (lists, tables, dictionaries) that have their equivalent in many programming languages. Forcing these data into a multi-dimensional array is IMHO not useful. Especially when the metadata of the data cube must be extended to support this. Metadata for a specific computation can be stored this way: i.e. statistical information about the resulting hypercube ... . The backend for example can stream this data downstream to an info plotter node that displays metadata about the computed hypercube.
The multidimensional array would be a one dimensional one in this case. A list would have an index dimension (regular with attributes offset=0 or 1 and delta=1) or if you have a named list, then use an irregularly one with values.
Regarding machine learning model: To load a machine learn model in an UDF each time a slice was send is computationally very expensive. Machine learn models can be quite large, hence the UDF server should cache the models, if they were requested once in a UDF request. The latest UDF implementation exposes endpoints to load pre-trained machine learn models directly to the UDF server and to reference them in the machine learn model definition. The user should not have access to the UDF sever directly (data mounting, ...), but the backend provider can deploy any machine learn model on the UDF server and expose their ids to the UDF developer. The current UDF approach makes the application of a specific machine learn model very easy for the UDF developer, he just calls the predict() function of the already pre-loaded machine learn model and passes the data [2]. Be aware that in parallel processing mode, many, many tiles are send to the UDF server, that all may use the same machine learn model. Pre-loading the machine learn model is the only option here to avoid extreme expensive overhead. And this can only be supported in the UDF server itself.
I understand and would agree that this should somehow be covered.
I agree with @m-mohr that a face to face meeting would be a good idea to discuss this further, I'm looking forward to it.
@huhabla
IMHO i see no convincing reasons to do so, except: just because we can or because we fell in love with the products of our intellects.
For starters, we defined it as our main view on EO data in the glossary (https://open-eo.github.io/openeo-api/glossary/) and from a users / udf script developers view it might get complicated if you have to deal with different data models from which you don't know when or under what circumstances you get those as data in the UDF. If this would somehow be specified I am happy with that.
I do not understand this. Why does the UDF developer do not know what to expect? There is a Python API specifically designed for Python UDF developers with well known data formats. The UDF developer expects a specific format as input, since he designs his algorithms for this format. If the backend does not provide this format in the UdfData object, than the UDF raises an exception that it expects for example: a hypercube with one temporal and two spatial dimensions and vector points as FeatureCollectionTile, because its job is to sample hypercubes with vector points to generate new time stamped vector points with new attributes as output.
The UDF developers can tell the backend with "in code key-words" what it expects and what it produces, so that the backend can check if the User provides the correct input for the UDF in the process graph and if the UDF output is compatible with downstream nodes in the process graph.
The designer of the process graph is responsible to know what data formats are expected by the UDF that he wants to use. Hence, the UDF node must have data source nodes as inputs that provide the required formats.
A couple of questions regarding the proposed UDF format: What software library implements this format and allows easy conversion to established "runtime" formats like numpy, shapely or xarray? Why do we need to invent new complex formats outside existing approaches, with feature dimensions and irregular spatial dimensions? What software exists that the UDF developer can use to process this format in Python and R?
The exchange format is still JSON and the
data
part is still a nested Array - MultidimensionalArray = Array[]... (with as many dimensions as specified). I putObject
as placeholder for different data types, because besides float it might be int, boolean, string or something else. Maybe this was not clear enough.
No, that was perfectly clear. You want to have a multi dimensional array with metadata and any kind of objects as array values and axis definition. And that is in my opinion not a good approach, because it is not a common format that you can use for processing. If you have arbitrary objects as data, then the processing library that is used in the UDF must support these kinds of objects. It must know what kind of operators can be applied to these objects (+, -, *, /, ...). There is no software available that supports arbitrary objects. For example: What is the meaning to have features as objects in a multi dimensional array, if there is no software that can handle it? In Python you have to convert this into a geopandas dataframe or a list of shapely objects. And you need additional metadata that describes the vector format (data types, projection, extent, ...). The GeoJSON format includes all of this, no need to reinvent the wheel. Otherwise the UDF server must convert multi dimensional feature arrays back into a format that is supported by common processing libraries. This is IMHO unnecessary overhead. The backend should put the data for processing in formats that can be directly transformed in a common format in the UDF server -> in Python UDF this is geopandas.DataFrame, numpy, xarray and pandas.DatetimeIndex.
In Python these data formats have self describing metadata like number of features, number of dimensions, dimension name, units, the shape of the multi-dimensional array and so on.
In addition, we can put all metadata that is necessary for algorithms as dictionary into the UdfData object. I have no problem with that.
I think we should have a telephone conference with Edzer and Matthias before the 12. of September to discuss this topic.
To keep this issue up-to-date: Internally it was discussed to have a closer look at https://covjson.org/spec/
Also Apache Parquet as exchange file format was considered.
@flahn Is Parquet still considered or was it not good enough?
Apache arrow does not support geopandas or xarray, hence can not be used to process hypercubes or vector data efficiently.
covjson does not support vector data properly to be used as an exchange format between backends and UDF REST services.
Apache Parquet is a columnar data format specific for HADOOP applications. We can consider this format to be used as exchange format, however, we need to specify what to store in this format.
I would like to suggest an exchange format that suits all requirements (single file, JSON, easy to serialize, data cube support, vector data support, image collections support), that can be implemented as JSON format or relational table structure.
It supports in a single file:
Here a simple data cube example:
{
"type": "TopologicalDataCollection",
"crs": {
"EPSG": 4326,
"WKT": null,
"temporal": "gregorian"
},
"metadata": {
"name": "Datacollection",
"description": "New collection",
"number_of_object_collections": 1,
"number_of_geometries": 0,
"number_of_field_collections": 2,
"number_of_time_stamps": 1,
"creator": "Soeren",
"creation_time": "2001-01-01T10:00:00",
"modification_time": "2001-01-01T10:00:00",
"source": null,
"link": null,
"userdata": null
},
"object_collections": {
"data_cubes": [
{
"name": "Data Cube",
"description": "This is a data cube",
"dim": ["t", "y", "x"],
"dimensions": [
{
"name": "t",
"unit": "ISO:8601",
"size": 3,
"coordinates": ["2001-01-01T00:00:00", "2001-01-01T00:01:00", "2001-01-01T00:02:00"]
},
{
"name": "x",
"unit": "degree",
"size": 3,
"coordinates": [0, 1, 2]
},
{
"name": "y",
"unit": "degree",
"size": 3,
"coordinates": [0, 1, 2]
}
],
"field_collection": 0,
"timestamp": 0
}
],
"image_collections": [],
"simple_feature_collections": [],
"topological_feature_collections": []
},
"geometry_collection": [],
"field_collections": [
{
"name": "Climate data",
"size": [3, 3, 3],
"number_of_fields": 2,
"attributes": [
{
"name": "Temperature",
"description": "Temperature",
"unit": "degree celsius",
"values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0,
19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0],
"labels": []
},
{
"name": "Precipitation",
"description": "Precipitation",
"unit": "mm",
"values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0,
10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0,
19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0],
"labels": []
}
]
}
],
"timestamps": [
[
"2001-01-01T10:00:00",
"2001-01-01T00:02:00"
]
]
}
A simple feature example:
{
"type": "TopologicalDataCollection",
"crs": {
"EPSG": 4326,
"WKT": null,
"temporal": "gregorian"
},
"metadata": {
"name": "SimpleFeatureCollection",
"description": "New collection of 3 simple features that all point to the same field",
"number_of_object_collections": 1,
"number_of_geometries": 3,
"number_of_field_collections": 1,
"number_of_time_stamps": 1,
"creator": "Soeren",
"creation_time": "2001-01-01T10:00:00",
"modification_time": "2001-01-01T10:00:00",
"source": null,
"link": null,
"userdata": null
},
"object_collections": {
"data_cubes": [],
"image_collections": [],
"simple_feature_collections": [
{
"name": "Boundary of three lines",
"description": "Boundary of three lines",
"number_of_features": 3,
"bbox": {"min_x": 0.0, "max_x": 3.0, "min_y": 0.0, "max_y": 2.0, "min_z": 0.0, "max_z": 0.0},
"features": [
{
"type": "LineString",
"predecessors": [],
"geometry": 0,
"field": [0, 0],
"timestamp": 0
},
{
"type": "LineString",
"predecessors": [],
"geometry": 1,
"field": [0, 0],
"timestamp": 0
},
{
"type": "LineString",
"predecessors": [],
"geometry": 2,
"field": [0, 0],
"timestamp": 0
}
]
}
],
"topological_feature_collections": []
},
"geometry_collection": [
"LineString (2 0, 2 2)",
"LineString (2 2, 0 1, 2 0)",
"LineString (2 2, 3 1, 2 0)"
],
"field_collections": [
{
"name": "Border",
"size": [1],
"number_of_fields": 1,
"attributes": [
{
"name": "Landuse",
"description": "Landuse",
"unit": "category",
"values": [],
"labels": ["Border"]
}
]
}
],
"timestamps": [["2001-01-01T10:00:00", null]]
}
Attached are the JSON schemas and the Python classes that define the data format:
class CoordinateReferenceSystems(BaseModel):
"""Coordinate reference systems for spatial and temporal coordinates"""
EPSG: int = Field(None, description="EPSG code", examples=[{"EPSG": 4326}])
WKT: str = Field(None, description="The WKT description string, if there is no EPSG code")
temporal: str = Field(None, description="The temporal calender", examples=[{"temporal": "gregorian"}])
class ObjectCollection(BaseModel):
"""Object collection that contains data cubes, image collections,
simple feature collections and topological feature collections"""
data_cubes: List[DataCube] = Field(None, description="A list of data cubes")
image_collections: List[ImageCollection] = Field(None, description="A list of image collections")
simple_feature_collections: List[SimpleFeatureCollection] = Field(None,
description="A list of simple "
"features collections")
topological_feature_collections: List[TopologicalFeatureCollection] = Field(None,
description="A list of topological "
"feature collections")
class TopologicalDataCollection(BaseModel):
"""Topological data collection"""
type: str = "TopologicalDataCollection"
crs: CoordinateReferenceSystems = Field(..., description="The coordinate reference systems")
metadata: Metadata = Field(..., description="The metadata object for the topological data collection")
object_collections: ObjectCollection = Field(...,
description="A collection of different "
"data objects like data cubes, "
"image collections, simple feature collections "
"and topological feature collections")
geometry_collection: List[str] = Field(...,
description="A list of WKT geometry strings that are referenced by the "
"objects in the object collection.")
field_collections: List[FieldCollection] = Field(..., description="A list of field collections")
timestamps: List[Tuple[str, Union[str, None]]] = Field(..., description="A list of timestamp tuples as strings.")
class Metadata(BaseModel):
"""Metadata description of the topological data collection"""
name: str = Field(...,
description="The name of topological data collection. Allowed characters [a-z][A-Z][0-9][_].",
examples=[{"name": "Climate_data_collection_1984"}])
description: str = Field(..., description="Description of the topological data collection.")
number_of_object_collections: int = Field(..., description="Number of all collections "
"(data cubes, image collection, "
"simple feature collections,"
"topological feature collections).")
number_of_geometries: int = Field(..., description="Number of all geometries.")
number_of_field_collections: int = Field(..., description="Number of all field collections.")
number_of_time_stamps: int = Field(..., description="Number of time tamps.")
creator: str = Field(None, description="The name of the creator.")
creation_time: str = Field(None, description="Time of creation.")
modification_time: str = Field(None, description="Time of last modification.")
source: str = Field(None, description="The source of the data collections.")
link: str = Field(None, description="URL link to a specific web source.")
userdata: dict = Field(None, description="A dictionary of additional metadata (STAC).")
class SimpleFeature(BaseModel):
"""A simple feature definition that may contain (multi)points, (multi)lines or (multi)polygons"""
type: str = Field(...,
description="The type of the simple feature: Point, LineString, "
"Polygon, MultiPoint, MultiLine, MultiPolygon.")
predecessors: List[int] = Field(None, description="A list of predecessors from which this feature was created.")
geometry: int = Field(..., description="The index of a geometry from the geometry collection.")
field: List[int] = Field(None, description="The index of the assigned "
"field collection and the value/label index.")
timestamp: int = Field(None, description="The index of the assigned timestamp.")
class SimpleFeatureCollection(BaseModel):
"""Simple feature collection: (multi)points, (multi)lines or (multi)polygons"""
name: str = Field(...,
description="The unique name of the simple feature collection."
" Allowed characters [a-z][A-Z][0-9][_].",
examples=[{"name": "borders_1984"}])
description: str = Field(None, description="Description.")
number_of_features: int = Field(..., description="The number of features.")
bbox: SpatialBoundingBox = Field(..., description="The bounding box of all features.")
features: List[SimpleFeature] = Field(..., description="A list of features.")
class SpatialBoundingBox(BaseModel):
"""Spatial bounding box definitions"""
min_x: float = pyField(..., description="The minimum x coordinate of the 3d bounding box.")
max_x: float = pyField(..., description="The maximum x coordinate of the 3d bounding box.")
min_y: float = pyField(..., description="The minimum y coordinate of the 3d bounding box.")
max_y: float = pyField(..., description="The maximum y coordinate of the 3d bounding box.")
min_z: float = pyField(..., description="The minimum z coordinate of the 3d bounding box.")
max_z: float = pyField(..., description="The maximum z coordinate of the 3d bounding box.")
class Dimension(BaseModel):
"""Description of a data cube dimension"""
name: str = Field(..., description="The name/identifier of the dimension.")
unit: str = Field(...,
description="The unit of the dimension in SI units.",
examples=[{"unit": "seconds"}, {"unit": "m"}, {"unit": "hours"},
{"unit": "days"}, {"unit": "mm"}, {"unit": "km"}])
size: int = Field(..., description="The size of the dimension.")
coordinates: List[Union[int, float, str]] = Field(..., description="A list of coordinates for this dimension")
class DataCube(BaseModel):
"""A multidimensional representation of a data cube"""
name: str = Field(...,
description="The unique name of the data cube. Allowed characters [a-z][A-Z][0-9][_].",
examples=[{"name": "Climate_data_cube_1984"}])
description: str = Field(None, description="Description of the data cube.")
dim: List[str] = Field(...,
description="A an ordered list of dimension names of the data cube. The dimensions "
"are applied in the provided order.",
examples=[{"dim": ["t", "y", "x"]}])
dimensions: List[Dimension] = Field(..., description="A list of dimension descriptions.")
field_collection: int = Field(None, description="The integer index of the field collection. All fields and their "
"values of this collection are assigned to the "
"data cube and must have the same size")
timestamp: int = Field(None, description="The integer index of the assigned timestamp from the timestamp array")
class Field(BaseModel):
"""This represents a field definition with values and labels"""
name: str = pyField(...,
description="Name of the attribute.")
description: str = pyField(None, description="Description of the attribute.")
unit: str = pyField(...,
description="The unit of the field.",
examples=[{"unit": "m"}, {"unit": "NDVI"}, {"unit": "Watt"}])
values: List[Union[float, int]] = pyField(...,
description="The field values that must be numeric.",
examples=[{"values": [1, 2, 3]}])
labels: List[str] = pyField(...,
description="Label for each field value.",
examples=[{"labels": ["a", "b", "c"]}])
class FieldCollection(BaseModel):
"""A collection of fields that all have the same size"""
name: str = pyField(..., description="Name of the field collection.")
size: List[int] = pyField(..., description="The size of the field collection. Each field of "
"this collection must have the same size. The size of "
"the fields can be mutli-dimensional. However, fields are stored "
"as one dimensional array and must be "
"re-shaped in the multi-dimensional form for processing.",
examples=[{"size": [100]}, {"size": [3, 3, 3]}])
number_of_fields: int = pyField(..., description="The number of fields in this collection." )
attributes: List[Field] = pyField(..., description="A list of fields with the same size.",
alias="fields")
class Image(BaseModel):
"""Description of a raster image. The link to the field collection allows a raster image to have multiple
values like bands or climate data."""
name: str = Field(..., description="The name of the image.")
bbox: SpatialBoundingBox = Field(..., description="The bounding box of this image.")
number_of_rows: int = Field(..., description="The number of rows of this image.")
number_of_cols: int = Field(..., description="The number of columns of this image.")
field_collection: int = Field(..., description="The field collection that is associated to this image. "
"These field collections must have the same size as this image "
"and may contain several fields (bands, channels) for this image.")
timestamp: int = Field(None, description="The index of the assigned timestamp.")
class ImageCollection(BaseModel):
"""An image collection that contains a list of timestamped images that may have multiple bands as data"""
name: str = Field(...,
description="The unique name of the image collection. Allowed characters [a-z][A-Z][0-9][_].",
examples=[{"name": "Landsat_image_collection_1984"}])
description: str = Field(None, description="Description of the image collection.")
number_of_images: int = Field(..., description="Number of images in this collection.")
images: List[Image] = Field(..., description="The list of images.")
timestamp: int = Field(None, description="The index of the assigned timestamp that represents the "
"full temporal extent of the image collection.")
class Polygon(BaseModel):
"""The definition of topological polygons that reference arcs"""
arcs: List[int] = Field(..., description="The index of the arcs that define the polygon.")
predecessors: List[int] = Field(None, description="A list of predecessors from which this feature was created.")
field: List[int] = Field(None, description="The index of the assigned field collection and value.",
examples=[{"field": [0, 15]}])
timestamp: int = Field(None, description="The index of the assigned timestamp.")
class Arc(BaseModel):
"""The definition of a topological arc that is composed of a single LineString"""
predecessors: List[int] = Field(None, description="The index of predecessors of the same type.")
geometry: int = Field(..., description="The index of the geometry from the geometry collection.")
field: List[int] = Field(None, description="The index of the assigned field collection and value.",
examples=[{"field": [1, 30]}])
timestamp: int = Field(None, description="The index of the assigned timestamp.")
left_polygon: int = Field(None, description="The index of the left side polygon.")
right_polygon: int = Field(None, description="The index of the right side polygon.")
shared_arcs_begin: List[Tuple[int, float]] = Field(...,
description="The indexes and angle tuple of all arcs that are "
"shared at the start node in clock wise direction.")
shared_arcs_end: List[Tuple[int, float]] = Field(...,
description="The indexes and angle tuple of all arcs that are "
"shared at the end node in clock wise direction.")
class TopologicalFeatureCollection(BaseModel):
"""Topological feature collection that may contain arcs and polygons"""
name: str = Field(...,
description="The unique name of the topological feature collection. "
"Allowed characters [a-z][A-Z][0-9][_].",
examples=[{"name": "area_and_borders_1984"}])
description: str = Field(None, description="Description")
number_of_arcs: int = Field(..., description="The number of arcs.")
number_of_polygons: int = Field(..., description="The number of polygons.")
bbox: SpatialBoundingBox = Field(..., description="The bounding box of all features.")
polygons: List[Polygon] = Field(None, description="A list of topological polygons.")
arcs: List[Arc] = Field(..., description="A list of topological arcs.")
{
"title": "TopologicalDataCollection",
"description": "Topological data collection",
"type": "object",
"properties": {
"type": {
"title": "Type",
"default": "TopologicalDataCollection",
"type": "string"
},
"crs": {
"title": "Crs",
"description": "The coordinate reference systems",
"allOf": [
{
"$ref": "#/definitions/CoordinateReferenceSystems"
}
]
},
"metadata": {
"title": "Metadata",
"description": "The metadata object for the topological data collection",
"allOf": [
{
"$ref": "#/definitions/Metadata"
}
]
},
"object_collections": {
"title": "Object Collections",
"description": "A collection of different data objects like data cubes, image collections, simple feature collections and topological feature collections",
"allOf": [
{
"$ref": "#/definitions/ObjectCollection"
}
]
},
"geometry_collection": {
"title": "Geometry Collection",
"description": "A list of WKT geometry strings that are referenced by the objects in the object collection.",
"type": "array",
"items": {
"type": "string"
}
},
"field_collections": {
"title": "Field Collections",
"description": "A list of field collections",
"type": "array",
"items": {
"$ref": "#/definitions/FieldCollection"
}
},
"timestamps": {
"title": "Timestamps",
"description": "A list of timestamp tuples as strings.",
"type": "array",
"items": {
"type": "array",
"items": [{"type": "string"}, {"type": "string"}]
}
}
},
"required": ["crs", "metadata", "object_collections", "geometry_collection", "field_collections", "timestamps"],
"definitions": {
"CoordinateReferenceSystems": {
"title": "CoordinateReferenceSystems",
"description": "Coordinate reference systems for spatial and temporal coordinates",
"type": "object",
"properties": {
"EPSG": {
"title": "Epsg",
"description": "EPSG code",
"examples": [{"EPSG": 4326}],
"type": "integer"
},
"WKT": {
"title": "Wkt",
"description": "The WKT description string, if there is no EPSG code",
"type": "string"
},
"temporal": {
"title": "Temporal",
"description": "The temporal calender",
"examples": [{"temporal": "gregorian"}],
"type": "string"
}
}
},
"Metadata": {
"title": "Metadata",
"description": "Metadata description of the topological data collection",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The name of topological data collection. Allowed characters [a-z][A-Z][0-9][_].",
"examples": [{"name": "Climate_data_collection_1984"}],
"type": "string"
},
"description": {
"title": "Description",
"description": "Description of the topological data collection.",
"type": "string"
},
"number_of_object_collections": {
"title": "Number Of Object Collections",
"description": "Number of all collections (data cubes, image collection, simple feature collections,topological feature collections).",
"type": "integer"
},
"number_of_geometries": {
"title": "Number Of Geometries",
"description": "Number of all geometries.",
"type": "integer"
},
"number_of_field_collections": {
"title": "Number Of Field Collections",
"description": "Number of all field collections.",
"type": "integer"
},
"number_of_time_stamps": {
"title": "Number Of Time Stamps",
"description": "Number of time tamps.",
"type": "integer"
},
"creator": {
"title": "Creator",
"description": "The name of the creator.",
"type": "string"
},
"creation_time": {
"title": "Creation Time",
"description": "Time of creation.",
"type": "string"
},
"modification_time": {
"title": "Modification Time",
"description": "Time of last modification.",
"type": "string"
},
"source": {
"title": "Source",
"description": "The source of the data collections.",
"type": "string"
},
"link": {
"title": "Link",
"description": "URL link to a specific web source.",
"type": "string"
},
"userdata": {
"title": "Userdata",
"description": "A dictionary of additional metadata (STAC).",
"type": "object"
}
},
"required": ["name", "description", "number_of_object_collections", "number_of_geometries", "number_of_field_collections", "number_of_time_stamps"]
},
"Dimension": {
"title": "Dimension",
"description": "Description of a data cube dimension",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The name/identifier of the dimension.",
"type": "string"
},
"unit": {
"title": "Unit",
"description": "The unit of the dimension in SI units.",
"examples": [{"unit": "seconds"}, {"unit": "m"}, {"unit": "hours"}, {"unit": "days"}, {"unit": "mm"}, {"unit": "km"}],
"type": "string"
},
"size": {
"title": "Size",
"description": "The size of the dimension.",
"type": "integer"
},
"coordinates": {
"title": "Coordinates",
"description": "A list of coordinates for this dimension",
"type": "array",
"items": {
"anyOf": [{"type": "integer"}, {"type": "number"}, {"type": "string"}]
}
}
},
"required": ["name", "unit", "size", "coordinates"]
},
"DataCube": {
"title": "DataCube",
"description": "A multidimensional representation of a data cube",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The unique name of the data cube. Allowed characters [a-z][A-Z][0-9][_].",
"examples": [{"name": "Climate_data_cube_1984"}],
"type": "string"
},
"description": {
"title": "Description",
"description": "Description of the data cube.",
"type": "string"
},
"dim": {
"title": "Dim",
"description": "A an ordered list of dimension names of the data cube. The dimensions are applied in the provided order.",
"examples": [{"dim": ["t", "y", "x"]}],
"type": "array",
"items": {
"type": "string"
}
},
"dimensions": {
"title": "Dimensions",
"description": "A list of dimension descriptions.",
"type": "array",
"items": {
"$ref": "#/definitions/Dimension"
}
},
"field_collection": {
"title": "Field Collection",
"description": "The integer index of the field collection. All fields and their values of this collection are assigned to the data cube and must have the same size",
"type": "integer"
},
"timestamp": {
"title": "Timestamp",
"description": "The integer index of the assigned timestamp from the timestamp array",
"type": "integer"
}
},
"required": ["name", "dim", "dimensions"]
},
"SpatialBoundingBox": {
"title": "SpatialBoundingBox",
"description": "Spatial bounding box definitions",
"type": "object",
"properties": {
"min_x": {
"title": "Min X",
"description": "The minimum x coordinate of the 3d bounding box.",
"type": "number"
},
"max_x": {
"title": "Max X",
"description": "The maximum x coordinate of the 3d bounding box.",
"type": "number"
},
"min_y": {
"title": "Min Y",
"description": "The minimum y coordinate of the 3d bounding box.",
"type": "number"
},
"max_y": {
"title": "Max Y",
"description": "The maximum y coordinate of the 3d bounding box.",
"type": "number"
},
"min_z": {
"title": "Min Z",
"description": "The minimum z coordinate of the 3d bounding box.",
"type": "number"
},
"max_z": {
"title": "Max Z",
"description": "The maximum z coordinate of the 3d bounding box.",
"type": "number"
}
},
"required": ["min_x", "max_x", "min_y", "max_y", "min_z", "max_z"]
},
"Image": {
"title": "Image",
"description": "Description of a raster image. The link to the field collection allows a raster image to have multiple\nvalues like bands or climate data.",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The name of the image.",
"type": "string"
},
"bbox": {
"title": "Bbox",
"description": "The bounding box of this image.",
"allOf": [
{
"$ref": "#/definitions/SpatialBoundingBox"
}
]
},
"number_of_rows": {
"title": "Number Of Rows",
"description": "The number of rows of this image.",
"type": "integer"
},
"number_of_cols": {
"title": "Number Of Cols",
"description": "The number of columns of this image.",
"type": "integer"
},
"field_collection": {
"title": "Field Collection",
"description": "The field collection that is associated to this image. These field collections must have the same size as this image and may contain several fields (bands, channels) for this image.",
"type": "integer"
},
"timestamp": {
"title": "Timestamp",
"description": "The index of the assigned timestamp.",
"type": "integer"
}
},
"required": ["name", "bbox", "number_of_rows", "number_of_cols", "field_collection"]
},
"ImageCollection": {
"title": "ImageCollection",
"description": "An image collection that contains a list of timestamped images that may have multiple bands as data",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The unique name of the image collection. Allowed characters [a-z][A-Z][0-9][_].",
"examples": [{"name": "Landsat_image_collection_1984"}],
"type": "string"
},
"description": {
"title": "Description",
"description": "Description of the image collection.",
"type": "string"
},
"number_of_images": {
"title": "Number Of Images",
"description": "Number of images in this collection.",
"type": "integer"
},
"images": {
"title": "Images",
"description": "The list of images.",
"type": "array",
"items": {
"$ref": "#/definitions/Image"
}
},
"timestamp": {
"title": "Timestamp",
"description": "The index of the assigned timestamp that represents the full temporal extent of the image collection.",
"type": "integer"
}
},
"required": ["name", "number_of_images", "images"]
},
"SimpleFeature": {
"title": "SimpleFeature",
"description": "A simple feature definition that may contain (multi)points, (multi)lines or (multi)polygons",
"type": "object",
"properties": {
"type": {
"title": "Type",
"description": "The type of the simple feature: Point, LineString, Polygon, MultiPoint, MultiLine, MultiPolygon.",
"type": "string"
},
"predecessors": {
"title": "Predecessors",
"description": "A list of predecessors from which this feature was created.",
"type": "array",
"items": {
"type": "integer"
}
},
"geometry": {
"title": "Geometry",
"description": "The index of a geometry from the geometry collection.",
"type": "integer"
},
"field": {
"title": "Field",
"description": "The index of the assigned field collection and the value/label index.",
"type": "array",
"items": {
"type": "integer"
}
},
"timestamp": {
"title": "Timestamp",
"description": "The index of the assigned timestamp.",
"type": "integer"
}
},
"required": [
"type",
"geometry"
]
},
"SimpleFeatureCollection": {
"title": "SimpleFeatureCollection",
"description": "Simple feature collection: (multi)points, (multi)lines or (multi)polygons",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The unique name of the simple feature collection. Allowed characters [a-z][A-Z][0-9][_].",
"examples": [
{
"name": "borders_1984"
}
],
"type": "string"
},
"description": {
"title": "Description",
"description": "Description.",
"type": "string"
},
"number_of_features": {
"title": "Number Of Features",
"description": "The number of features.",
"type": "integer"
},
"bbox": {
"title": "Bbox",
"description": "The bounding box of all features.",
"allOf": [
{
"$ref": "#/definitions/SpatialBoundingBox"
}
]
},
"features": {
"title": "Features",
"description": "A list of features.",
"type": "array",
"items": {
"$ref": "#/definitions/SimpleFeature"
}
}
},
"required": ["name", "number_of_features", "bbox", "features"]
},
"Polygon": {
"title": "Polygon",
"description": "The definition of topological polygons that reference arcs",
"type": "object",
"properties": {
"arcs": {
"title": "Arcs",
"description": "The index of the arcs that define the polygon.",
"type": "array",
"items": {
"type": "integer"
}
},
"predecessors": {
"title": "Predecessors",
"description": "A list of predecessors from which this feature was created.",
"type": "array",
"items": {
"type": "integer"
}
},
"field": {
"title": "Field",
"description": "The index of the assigned field collection and value.",
"examples": [{"field": [0, 15]}],
"type": "array",
"items": {
"type": "integer"
}
},
"timestamp": {
"title": "Timestamp",
"description": "The index of the assigned timestamp.",
"type": "integer"
}
},
"required": [
"arcs"
]
},
"Arc": {
"title": "Arc",
"description": "The definition of a topological arc that is composed of a single LineString",
"type": "object",
"properties": {
"predecessors": {
"title": "Predecessors",
"description": "The index of predecessors of the same type.",
"type": "array",
"items": {
"type": "integer"
}
},
"geometry": {
"title": "Geometry",
"description": "The index of the geometry from the geometry collection.",
"type": "integer"
},
"field": {
"title": "Field",
"description": "The index of the assigned field collection and value.",
"examples": [{"field": [1, 30]}],
"type": "array",
"items": {
"type": "integer"
}
},
"timestamp": {
"title": "Timestamp",
"description": "The index of the assigned timestamp.",
"type": "integer"
},
"left_polygon": {
"title": "Left Polygon",
"description": "The index of the left side polygon.",
"type": "integer"
},
"right_polygon": {
"title": "Right Polygon",
"description": "The index of the right side polygon.",
"type": "integer"
},
"shared_arcs_begin": {
"title": "Shared Arcs Begin",
"description": "The indexes and angle tuple of all arcs that are shared at the start node in clock wise direction.",
"type": "array",
"items": {
"type": "array",
"items": [{"type": "integer"}, {"type": "number"}]
}
},
"shared_arcs_end": {
"title": "Shared Arcs End",
"description": "The indexes and angle tuple of all arcs that are shared at the end node in clock wise direction.",
"type": "array",
"items": {
"type": "array",
"items": [{"type": "integer"}, {"type": "number"}]
}
}
},
"required": [
"geometry",
"shared_arcs_begin",
"shared_arcs_end"
]
},
"TopologicalFeatureCollection": {
"title": "TopologicalFeatureCollection",
"description": "Topological feature collection that may contain arcs and polygons",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "The unique name of the topological feature collection. Allowed characters [a-z][A-Z][0-9][_].",
"examples": [
{
"name": "area_and_borders_1984"
}
],
"type": "string"
},
"description": {
"title": "Description",
"description": "Description",
"type": "string"
},
"number_of_arcs": {
"title": "Number Of Arcs",
"description": "The number of arcs.",
"type": "integer"
},
"number_of_polygons": {
"title": "Number Of Polygons",
"description": "The number of polygons.",
"type": "integer"
},
"bbox": {
"title": "Bbox",
"description": "The bounding box of all features.",
"allOf": [
{
"$ref": "#/definitions/SpatialBoundingBox"
}
]
},
"polygons": {
"title": "Polygons",
"description": "A list of topological polygons.",
"type": "array",
"items": {
"$ref": "#/definitions/Polygon"
}
},
"arcs": {
"title": "Arcs",
"description": "A list of topological arcs.",
"type": "array",
"items": {
"$ref": "#/definitions/Arc"
}
}
},
"required": ["name", "number_of_arcs", "number_of_polygons", "bbox", "arcs"]
},
"ObjectCollection": {
"title": "ObjectCollection",
"description": "Object collection that contains data cubes, image collections,\nsimple feature collections and topological feature collections",
"type": "object",
"properties": {
"data_cubes": {
"title": "Data Cubes",
"description": "A list of data cubes",
"type": "array",
"items": {
"$ref": "#/definitions/DataCube"
}
},
"image_collections": {
"title": "Image Collections",
"description": "A list of image collections",
"type": "array",
"items": {
"$ref": "#/definitions/ImageCollection"
}
},
"simple_feature_collections": {
"title": "Simple Feature Collections",
"description": "A list of simple features collections",
"type": "array",
"items": {
"$ref": "#/definitions/SimpleFeatureCollection"
}
},
"topological_feature_collections": {
"title": "Topological Feature Collections",
"description": "A list of topological feature collections",
"type": "array",
"items": {
"$ref": "#/definitions/TopologicalFeatureCollection"
}
}
}
},
"Field": {
"title": "Field",
"description": "This represents a field definition with values and labels",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "Name of the attribute.",
"type": "string"
},
"description": {
"title": "Description",
"description": "Description of the attribute.",
"type": "string"
},
"unit": {
"title": "Unit",
"description": "The unit of the field.",
"examples": [{"unit": "m"}, {"unit": "NDVI"}, {"unit": "Watt"}],
"type": "string"
},
"values": {
"title": "Values",
"description": "The field values that must be numeric.",
"examples": [{"values": [1, 2, 3]}],
"type": "array",
"items": {
"anyOf": [{"type": "number"}, {"type": "integer"}]
}
},
"labels": {
"title": "Labels",
"description": "Label for each field value.",
"examples": [{"labels": ["a", "b", "c"]}],
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["name", "unit", "values", "labels"]
},
"FieldCollection": {
"title": "FieldCollection",
"description": "A collection of fields that all have the same size",
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "Name of the field collection.",
"type": "string"
},
"size": {
"title": "Size",
"description": "The size of the field collection. Each field of this collection must have the same size. The size of the fields can be mutli-dimensional. However, fields are stored as one dimensional array and must be re-shaped in the multi-dimensional form for processing.",
"examples": [{"size": [100]}, {"size": [3, 3, 3]}],
"type": "array",
"items": {
"type": "integer"
}
},
"number_of_fields": {
"title": "Number Of Fields",
"description": "The number of fields in this collection.",
"type": "integer"
},
"fields": {
"title": "Fields",
"description": "A list of fields with the same size.",
"type": "array",
"items": {
"$ref": "#/definitions/Field"
}
}
},
"required": ["name", "size", "number_of_fields", "fields"]
}
}
}
Haven't looked into it in details and I'm not into the recent discussions, but some questions from my side (without any judgement, just for clarification):
It looks quite good for a JSON model. But I'm concerned with the referencing. In the covJSON specification there is a part under domain called "referencing" which creates a relation between named single dimensions and the reference system. I think this might be useful to avoid confusion with the axis order (e.g. lat/lon, lon/lat). If this is predefined for space (x,y) and time (t) we should document this somewhere.
Haven't looked into it in details and I'm not into the recent discussions, but some questions from my side (without any judgement, just for clarification):
- Have we settled on JSON?
No. The suggested schema can be implemented using a table structure as well: SQlite?
- What of this will be visible to a user? Anything that would leak through (e.g. metadata passed in a parameter or so)? If that's the case we should try to align with the API to give a consistent user experience.
The user should not see the communication between the backend and a UDF Rest server, except for Python or R exceptions, if the code failed. But this will show only the Python or R API errors, not the exchange format.
- Why do we have separate image/vector collections and data cubes?
Because these are different datatypes. We should not force everything into a data cube and lose important features, like feature specific time stamps, sparse image collection with intersecting time intervals.
For the UDF we can simply focus on data cubes and simple features. That should be absolutely sufficient. However, the suggested format provides many more features.
It looks quite good for a JSON model. But I'm concerned with the referencing. In the covJSON specification there is a part under domain called "referencing" which creates a relation between named single dimensions and the reference system. I think this might be useful to avoid confusion with the axis order (e.g. lat/lon, lon/lat). If this is predefined for space (x,y) and time (t) we should document this somewhere.
The order of the dimension for a data cube is set with the dim option:
"dim": ["t", "y", "x"]
Which makes it clear for me, in what order the dimensions are stored in the field data. We can simply state this as the expected default.
- What of this will be visible to a user? Anything that would leak through (e.g. metadata passed in a parameter or so)? If that's the case we should try to align with the API to give a consistent user experience.
The user should not see the communication between the backend and a UDF Rest server, except for Python or R exceptions, if the code failed. But this will show only the Python or R API errors, not the exchange format.
Agreed. I'm just asking because there's some metadata that could be also interesting for a user and if that's (partially) passed into the UDF, it should be aligned with how the rest of the API works anyway. But that's more naming and some structures and I guess you are open to align that as it wouldn't really change the way your proposal would work.
- Why do we have separate image/vector collections and data cubes?
Because these are different datatypes. We should not force everything into a data cube and lose important features, like feature specific time stamps, sparse image collection with intersecting time intervals.
For the UDF we can simply focus on data cubes and simple features. That should be absolutely sufficient. However, the suggested format provides many more features.
Oh, I see. The last paragraph was important. So the user would only use datacubes (and SF?) in the UDF, but the UDF server and back-end could use different types of exchange formats. That is important, because the data that is passed through the processes of a process graph is never an image collection (vector data is to be discussed, I guess), but always a data cube.
I tried to make it a bit more graphical, which may make it easier to discuss.
To address my concern, I would suggest to following change: Allow multiple CoordinateReferenceSystems
objects on TopologicalDataCollection
and add an attribute axis
to the crs object which allows an array of string that correspond to the name in Dimension
. Also make this attribute optional so it does not interfere with the other Collection types.
So you can allow multiple CRS that refer to the appropriate dimension. As a use case example, consider the fact that you might have heights as an additional dimension which can have also a reference system or even a geocentric coordinate reference system which offers x,y,z from the Earths center. It also solves the problem of WGS84 lat/lon or lon/lat use, because you can simply see this by the order of the axis.
I was wondering wether the UDF data schema can be remodelled? In my opinion we can model the data as data cubes. The reason is that a raster or feature collection tile are simply very special cases of your current hyper cube model. Even the structured data can be modeled as 1-dimensional or even 2-dimensional data.
The reason for this generalization is simple to make it more clear for both a back-end developer and a udf service developer how the data is structured and how it can be translated into language dependend objects, e.g.
stars
in R orgeopandas
in Python, or how the UDF results have to be interpreted by the back-end.The current UDF request model looks like this:
By revisiting the implementation of Edzer's
stars
package for R and the thoughts @m-mohr put into the dimension extension in STAC I would put up for discussion something like this:I'm not sure about the machine learning models. Does it really have to be part of a general UDF API or might it be better suited to load those in the UdfCode? As I see it the critical part is that it needs to be loaded from the local file system of the UDF service, which might be solved by uploading such data into the back-ends personal workspace and to mount this in the UDF service instance.