fiboa / specification

Field Boundaries for Agriculture (fiboa) - a specification that describes important properties of field boundaries
Apache License 2.0
10 stars 3 forks source link

Relation between collection-level metadata and STAC #4

Closed m-mohr closed 3 months ago

m-mohr commented 10 months ago

Currently the spec requires collection-level metadata (such as the version and extensions) to be a STAC Collection.

We need to discuss whether this is a good idea. It predefines a couple of fields for us that we don't need to care about any longer, but also requires a temporal extent for example.

Additionally, the current wording of the spec requires to embed the STAC Collection in GeoParquet file-level metadata and in the GeoJSON FeatureCollection. The aim is to keep files complete without requiring an external dependency (except for extensions?).

The embedding feels a bit weird to me, so I recommend to also provide the STAC Collections separately with an asset pointing to the GeoParquet / GeoJSON FeatureCollection.

cc @cholmes

m-mohr commented 8 months ago

There was a good bit of discussion in the Slack:

@calebrob6 wrote:

[H]ow should we represent collection level metadata in fiboa? E.g. here is a sample that should represent a single patch in the ASU South Africa dataset (i.e. imagery and set of intersecting field boundaries). In the "collection_3.json" file we need to have pointers to imagery that covers the patch.

I wrote:

Extending Collections is actually an open point that is up for discussion. While I had foreseen Properties for Collections in the fiboa extension template, it actually doesn't have any schema for it. The Collection extensions would actually be STAC extensions. I guess this needs another revision. Focus so far was on Feature Properties, but we need an intuitive and simple way to extend the Collections. @cholmes, any thoughts from your side? Should we just ask people to create STAC extensions for additions to the Collections or should that better be part of the fiboa extension template? I'm unsure right now... The other more general question is whether we keep the STAC Collection as overarching entity.

@andyjenkinson wrote:

Yes to me the more fundamental issue to confirm is: does the specification depend on STAC, and if so why? I say this because, as a potential implementor of the format as both data provider and data consumer, I do not have any other demand to provide anything via STAC. It seems to make sense that a STAC collection might exist for the AI reference dataset use case (because there are actual images involved), but as a representation of a field boundary (rather than some other asset connected to that boundary) the GeoJSON/GeoParquet features are the assets, not metadata describing the spatiotemporal extent of some other asset. Hence I would be providing a STAC collection containing zero items, purely to satisfy the constraint that one must exist. There is also no actual mechanism to link them together anyway (i.e. how does the data consumer know that the features in GeoJSON resource/file A are part of the collection described by STAC resource/file B?) Arguably using the FeatureCollection is a less imposing and simpler to implement constraint. By which I mean, add some properties at the root object that describes the format, extensions, versions etc and dataset-level metadata. It makes it tractable to correctly evaluate the uniqueness constraint for feature IDs - both the collection context and the features are in the same file - whereas if these are in separate resources you'd have to include a back-reference to the collection(s) each feature is part of anyway. What we could do is to include an optional URI at the root of the FeatureCollection for a 'related' STAC collection, should one exist, which should have defined semantics (i.e. is it that the geometries within this FIBOA collection are the same set of geometries in the STAC collection of items?

@cholmes wrote:

Yeah, I've been thinking a decent bit about 'collections' recently, and questioning whether directly using STAC right now is the best. The 'ideal' that we never quite got to is that there's a 'dataset metadata' (funny, I actually started this message before Andy's reply, but it aligns exactly - was trying to look up the OGC references where this was discussed a bit.) Like STAC should just be a specialized class of 'dataset-level metadata', but OGC never quite got there. I think the thing to do is to define our own dataset-level metadata, but to align it exactly with the STAC collection metadata - id, title, description, providers, extent, etc. If people want the STAC tooling to work they can just slap on stac_version. There's probably some subtle trade offs here, but it feels cleaner to not have to bring STAC conceptually into the mix.

Hence I would be providing a STAC collection containing zero items, purely to satisfy the constraint that one must exist.

And to be clear, that was the idea - but it's less to satisfy the constraint one may exist, it was simply to make use of the dataset level metadata. There are a number of STAC users who do that, like Google Earth Engine. And you can have collection level 'assets', with 0 items. But yeah, I think it's cleaner to not have to explain these subtleties - anyone who uses STAC should come and see how the collection / dataset level metadata makes sense (is the same). But anyone who doesn't use / know STAC shouldn't have to try to understand it and how we're using it.

I wrote:

So we always assume the presence of FeatureCollections for the GeoJSON encoding? How to handle this if we have a bunch of GeoJSON Features in files? Then you'd have to have a collection (whether it's STAC or not) JSON again (e.g. if served via OGC API - Features). Whether we have a STAC Collection or just a JSON with a bunch of properties is not so different, I think. STAC offers us the ecosystem (validation, extensions), while if we start fresh it's just a lot of reinventing (new spec, new validation, new extension mechanism). Could partially be covered with what we have, but still...

andyjenkinson commented 8 months ago

An OGC Features API call response is a FeatureCollection anyway, is it not? Plus, the API contract makes single features always part of a hierarchical resource that contains the ID of the collection: /collections/foo/items/1234 In this case, the identifier of the resource is a URI that includes the collection ID and the feature ID, from which you can also resolve the collection ID to a collection resource.

If you only had a single Feature object in a separate file it would similarly be decoupled from any STAC collection anyway (this was my comment about backlinks from the feature to the collection).

It seems to be that the description of a collection (if indeed there even needs to be one as a mandatory ingredient?) can be expressed in multiple forms: as an OGC collection, as a FeatureCollection object or a STAC Collection. Of these, the FeatureCollection is the least deviant from the minimum dependency (i.e. just GeoJSON).

m-mohr commented 8 months ago

An OGC Features API call response is a FeatureCollection anyway, is it not?

/collections/.../items is a FeatureCollection, /collections/.../items/itemId is a Feature.

Plus, the API contract makes single features always part of a hierarchical resource that contains the ID of the collection: /collections/foo/items/1234

Yes, until you download/extract/... individual Features. Then your relation to the Collection is gone. So ideally the collection ID is in the Features and we introduced the collection property in fiboa for this.

If you only had a single Feature object in a separate file it would similarly be decoupled from any STAC collection anyway (this was my comment about backlinks from the feature to the collection).

Indeed, we don't necessarily need the Collection to be a STAC Collection. STAC adds a bit of overhead, but also gives us the ecosystem and extension support. Otherwise we need to do that again in fiboa. I'm relatively neutral on which path to go.

collection (if indeed there even needs to be one as a mandatory ingredient?)

I think yes, we need a place to specify the fiboa version and extensions. Also it's a good place to expose "global" data such as license, provider, etc.

Of these, the FeatureCollection is the least deviant from the minimum dependency (i.e. just GeoJSON).

What about GeoParquet?

andyjenkinson commented 8 months ago

Does the question about GeoParquet not also apply to a STAC collection? Whether the metadata appears in a STAC collection or e GeoJSON feature collection would have no effect on GeoParquet would it? Both of them would need to be mapped.

Regarding a collection property on a feature, is it an array? Because as I mentioned, a feature that is not presented in context of a collection (ie a FeatureCollection or discovered via a parent collection of some other kind such as in an API) can be part of multiple collections.

m-mohr commented 8 months ago

Does the question about GeoParquet not also apply to a STAC collection? Whether the metadata appears in a STAC collection or e GeoJSON feature collection would have no effect on GeoParquet would it? Both of them would need to be mapped.

So the collection metadata would be living inside the GeoParquet file in the metadata. For a FeatureCollection, it would be a bit weird. Either it would be an empty Feature Collection (+ metadata) or it would be a duplication of all the Features (+ metadata), although we just need the metadata. The container format is just weird in the GeoParquet context. So another container for the metadata might be better. I started with STAC to have something we don't need to define ourselves at the beginning. The STAC Collection (or whetever else) could be embedded into the FeatureCollection, too.

{
  "id": "FeatureCollection",
  "features": [...],
  "collection": {
    "fiboa_version": "0.1.0",
    ...
  }
}

The Collection Object is also embedded in GeoParquet, but for GeoJSON Features it probably always lives externally and we should probably explain implementors how to connect them...

Regarding a collection property on a feature, is it an array?

No, it's a string (the collection ID). Multiple collection would lead to a potential conflict in the metadata, e.g. differences in versions or extensions. I'd like to avoid that and only allow a single Collection as responsible parent (although it could be part of multiple collections). But as always, this can all be discussed and changed, of course.

andyjenkinson commented 8 months ago

Ok here's what I'm trying to say:

Having a JSON collection object only makes sense for JSON features in the first place, Geoparquet has an entirely different structure. So its relevance here is about how to convert from one to the other - you need to get the Collection metadata from somewhere to embed into the parquet header? The geoparquet issue you described for FeatureCollection (ie needing to have an empty collection) is the same as the one I raised for STAC (an empty STAC collection), the only difference is that the actual features are already in GeoJSON format so typically it's not an empty FeatureCollection at all, it's the FeatureCollection containing the actual FIBOA features. It's weird to not take this collection-level metadata from the same file as the features themselves using the object in the GeoJSON spec that's designed for this very purpose, and then force you to provide another file following a different spec that is designed for containing a different type of object (an asset, not a vector).

The reason this is particularly important is because a pure GeoJSON implementation is much neater, and makes standard GeoJSON files and the OGC Features API specification natively compatible with FIBOA - you simply ensure the required properties are included in your existing data and now you have a FIBOA implementation. You also don't need to add a back-reference collection ID inside each one of millions of Feature objects, because it's already specified in the parent FeatureCollection inside the same file/API resource. This potential for adapting existing data is a great opportunity to make FIBOA very easy to implement that didn't really exist with STAC. You can make your existing API endpoints and data distributions FIBOA compliant without standing up new separate endpoints. As soon as you force a STAC collection to exist you break that, and I see no good reason for it - just copy the syntax of the subset of STAC collection metadata you need into FIBOA and you're done. Conversion to Geoparquet is very easy - it's one input file (or if you like, a merge of any number of files whose collection IDs are the same).

Now, the only instances where this "FeatureCollection in the same file" solution for providing collection-level metadata doesn't work are those where a GeoJSON Feature is only available serialised into a file with a single feature, ie out of context of any FeatureCollection. I'm not aware of any distributions that do that but maybe it's a valid use case. Your options here are therefore one of:

Basically, I don't see any reason to force creating a second JSON file when the data is already in a FeatureCollection, but if you really don't have any FeatureCollection even though all the features are GeoJSON then you need to solve two problems - a second file to hold it, and a link inside every single feature to reference it (or some other convention about how to autonomously find it like a specific filename at the root of a directory where the feature json files are...). For this extra file in fact any JSON file would do, it doesn't have to be STAC or FeatureCollection but you could design it so it's allowed to be if the author wants (ie a STAC collection can serve also as a FIBOA collection, and so can a FeatureCollection). All of the properties are FIBOA-specified anyway, it's just right now you're doing it 'by proxy'. You want it to have an ID, version etc so just say so directly rather than saying "I require a STAC collection, because that spec requires it to have an ID".

andyjenkinson commented 8 months ago

Also to be clear, the featurecollection IS the collection, it does not need to contain one like in your example. The collection is an object and it has a member called "features", which holds the actual boundary objects. This payload is typically used to represent API resources that are themselves collections (like /collections/foo/items which gives BOTH the collection AND the items in its response)

Just make fiboa_version a property of the collection and you're done.

m-mohr commented 8 months ago

It feels like we are misunderstanding each other. Potentially better to discuss this in the fiboa call? Anyway, I'll try to clarify below.

Generally, I'm happy to have global/collection-level metadata in the FeatureCollection.

Also to be clear, the featurecollection IS the collection, it does not need to contain one like in your example.

It felt better to have the collection properties clearly separated, also makes it easier in conversion between formats, I believe. But there's not a big difference.

So any of the following work for me, no strong preference from my side:

(1) fiboa Collection combined with a JSON FeatureCollection:

{
  "type": "FeatureCollection",
  "features": [...],
  "fiboa_version": "0.1.0",
  "fiboa_extensions": "0.1.0",
  "license": "CC-0",
}

and/or (2) fibao Collection inside GeoParquet:

fiboa as JSON FeatureCollection (i.e. remove GeoJSON properties):

{
  "fiboa_version": "0.1.0",
  "fiboa_extensions": "0.1.0",
  "license": "CC-0",
  ...
}

and/or (3) STAC Collection integrated into a JSON FeatueCollection:

{
  "type": "FeatureCollection",
  "features": [...],
  "collection": {
    "stac_version": "1.0.0",
    "type": "Collection",
    "fiboa_version": "0.1.0",
    "fiboa_extensions": "0.1.0",
    "license": "CC-0",
    ...
  }
}

You can't combine STAC Collections and JSON FeatureCollections into a single object though because the type property conflicts (type: Collection in STAC, type: FeatureCollection in GeoJSON). In this case you need the separation as pointed out in variant 3.

The advantage of a STAC Collection is to have the pre-defined fields and ecosystem. The disadvantge is probably the added complexity. We can discuss this with the group, as I said, I'm pretty much happy with all of the variants.

Having a JSON collection object only makes sense for JSON features in the first place, Geoparquet has an entirely different structure.

I don't agree, we embed data that is valid for into the GeoParquet metadata, similar to what GeoParquet does with its geo-releated metadata. This is used to explain and validate the GeoParquet file, e.g. define the fiboa version, add the list of extensions, and provide additional metadata that you don't want to repeat in every single row. For example license, provider etc.

I'm not aware of any distributions that do that but maybe it's a valid use case.

We should clarify that. If we don't need individual feature, we can disallow that and enforce FeatureCollections always. Make life simpler, indeed.

you need to get the Collection metadata from somewhere to embed into the parquet header?

Indeed, currently the tooling asks you to provide a JSON file that contains the collection metadata during GeoParquet creation.

(an empty STAC collection)

What is an empty STAC Collection?

OGC Features API specification natively compatible with FIBOA

That's already the case as far as I know.

You can make your existing API endpoints and data distributions FIBOA compliant without standing up new separate endpoints.

I don't get it. I've never asked to implement separate endpoints?!

just copy the syntax of the subset of STAC collection metadata you need into FIBOA and you're done

Isn't that what I've proposed before and again in the examples above?

PowerChell commented 8 months ago

Let's make a breakout meeting for this discussion for sometime in the next few weeks.

m-mohr commented 7 months ago

I've created a proposal for this in PR #21, maybe this can already be accepted as a compromise.

Would love to hear feedback.

m-mohr commented 7 months ago

Would you consider putting basic universal properties of a dataset/collection like ID, name, description, license not in a separate special FIBOA-specific object called "fiboa" but as normal properties (like they are for GeoJSON features)?

One reason is I am thinking we should try not to reinvent the wheel, eg align to existing dataset publishing standards that are well adopted in related communities that have thought about the domain much more, for example the Dublin Core covers I think all of the terms in the example (license, publisher, description): https://www.dublincore.org/specifications/dublin-core/dcmi-terms/ It makes FIBOA less 'demanding' if it limits spreading the word "FIBOA" over everything that isn't specific to FIBOA (eg "fiboa_version" is more reasonable than "fiboa->license" when you can just call it "license") and name it as an optional field in FIBOA that happens to be the same as the Dublin core term. Then I can make a single object that is compliant with JSON-LD, fiboa, GeoJSON and DCMI all in the same file/API endpoint.

The other reason is to avoid wherever possible forcing people to create separate fiboa-specific implementations of data distributions they already provide. My target really one of: a provider already has an API that provides FeatureCollections (eg we have one, Digifarm has one, planet has one, anyone who has implemented OGC Features API compliant API has one), or datasets that they already export. So, my assessment will be: what is the minimal I have to do to make what I already provide FIBOA compliant, and how proprietary is that? That is why it's different from STAC: here we're not starting from a green field, we have existing standards and existing implementations it would be advantageous to align to, because it gives us adoption very quickly. I would like to be able to make the native API and downloads we already provide to be FIBOA compliant, rather than create special FIBOA variants, which is what happens if you force me to repeat properties inside a special "fiboa" container. If we need to go down that route of separating out the fiboa domain, I would rather do it either namespaces (so make fiboa core its own namespace)

Originally posted by @andyjenkinson in https://github.com/fiboa/specification/issues/21#issuecomment-2045795577

m-mohr commented 7 months ago

Would you consider putting basic universal properties of a dataset/collection like ID, name, description, license not in a separate special FIBOA-specific object called "fiboa" but as normal properties (like they are for GeoJSON features)?

I think it's not a good idea as then it's not clear (especially for individual features) which properties are collection-level and which are not. It would require a definitive set of fields in the files, which I think we don't want to aim for. For example, OGC API - Features adds additional properties to the FeatureCollection, which are not collection level metadata (numberMatched, numberReqturned, pagination links). Just moving around a single object is much simpler when migrating between file formats for example. Anyway, you can also link to an external file in an OGC API compliant way and then you don't need to embed them in a fiboa-specific object.

One reason is I am thinking we should try not to reinvent the wheel, eg align to existing dataset publishing standards that are well adopted in related communities that have thought about the domain much more, for example the Dublin Core covers I think all of the terms in the example (license, publisher, description): https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

I think we should recommend one standard for metadata. I'm happy to discuss which that might be, whether it's STAC, Dublin Core, OGC APIs, DCAT or whatever. I've paved the way allow this by just requiring the fiboa_version and fiboa_extensions fields. Everything else is right now open to implementors. But as we should guide users to something for now, I recommended STAC. But if discussions across fiboa participants lead to something else, happy to switch. I'd say open an issue and propose a different standard for collection-level metadata to start the discussion...

It makes FIBOA less 'demanding' if it limits spreading the word "FIBOA" over everything that isn't specific to FIBOA (eg "fiboa_version" is more reasonable than "fiboa->license" when you can just call it "license") and name it as an optional field in FIBOA that happens to be the same as the Dublin core term. Then I can make a single object that is compliant with JSON-LD, fiboa, GeoJSON and DCMI all in the same file/API endpoint.

Isn't that already possible with this proposal as long as you can link to an external collection that includes the two required fiboa fields? Look at the individual-features example. There it's just license, not fiboa -> license...

The other reason is to avoid wherever possible forcing people to create separate fiboa-specific implementations of data distributions they already provide. My target really one of: a provider already has an API that provides FeatureCollections (eg we have one, Digifarm has one, planet has one, anyone who has implemented OGC Features API compliant API has one), or datasets that they already export. So, my assessment will be: what is the minimal I have to do to make what I already provide FIBOA compliant, and how proprietary is that?

I'd need to look at the specific APIs above, but this proposal is OGC API - Features compliant AFAIK. Any pointers where I can find documentation about the other APIs?

That is why it's different from STAC: here we're not starting from a green field, we have existing standards and existing implementations it would be advantageous to align to, because it gives us adoption very quickly.

STAC didn't start from a green field either ;-) OpenSearch, OGC CSW, ISO 19115, even Dublin Core was in the discussion.

I would like to be able to make the native API and downloads we already provide to be FIBOA compliant, rather than create special FIBOA variants, which is what happens if you force me to repeat properties inside a special "fiboa" container.

Is that feasible at all if the (non-Collection) metadata is not already fully aligned? Right now you'd need to align the features itself and add the two fiboa_* properties or a link to something includes these two fields.

andyjenkinson commented 7 months ago

I don't really understand your comments as all the discussion is about collection level metadata and you're giving examples of other collection level metadata defined by OGC API, but then saying it is non-collection metadata. I don't see how it is possible to confuse these things. Everything expressed at the root of the FeatureCollection is about the collection, and everything inside a feature is about the feature. A stac collection is compatible with a OGC Features API collection - it's expressly stated to be so in the STAC spec. That's all I'm advocating - to take the same approach here. An OGC Features API collection is a GeoJSON featurecollection, these are not separate concepts - it has an id, title, description etc which, coincidentally, are also part of the STAC collection object schema. It just has also a "features" member containing all the individual features. STAC doesn't put its properties in a separate "stac" object inside the collection, it just defines the properties at the root of the object - some of which are the same ones defined by the OGC spec (id, title). I don't see why we can't call it a FIBOA collection and take the same approach, just don't choose properties that clash with OGC Features API, reuse them, and prefix anything that's expressly only applicable to FIBOA itself like you already are (fiboa_version). Job done. FIBOA doesn't need to directly depend on STAC or refer to a second separate collection, it just needs to define a JSON document with fields that are largely the same as STAC, whilst also being compatible with an OGC API FeatureCollecrion like STAC is. No external files. If you do r want to mandate that a FiBOA collection is a FeatureCollection fair enough, you'll need to include a reference to a separate collection inside every feature but that's no problem.

I'm also confused by the suggestion of referencing additional files which is the exact thing that isn't already part of the existing APIs I'm suggesting to try to be compatible with - you have to create a new endpoint just to provide some other file to describe the same thing you're already describing - the (feature)collection. Since we, and anyone else implementing GeoJSON FeatureCollections containing our boundaries (including OGC Features API) already have this collection object implemented, we can just add a few properties to it - just like we add them to the Feature part of the specification (and we don't do that in a special "fiboa" property, we just add them).

It just seems like we are totally missing each other and don't actually have a common understanding of what the problem even being solved is and the basic objects in the specification (like, what even IS a FIBOA collection of it's not a collection of FIBOA features).

If you want to find the documentation for the GFID API it's in the data survey, I think digifarm's is too but not sure. I think that we can quite easily make our API (specifically the GET /boundaries and GET /boundary-references) FIBOA-compliant without adding a separate collection concept or 'file' to contain it. The FeatureCollection would just have two or three extra properties. The download exports are in the same format so again, can already be compliant as a pure FeatureCollection. We can also easily make an OGC Features API using exactly the same payload structure with some further properties without any clashes, and if we wanted to we could make downloads that split the collection and features into a separate collection JSON file with the same exact metadata as the featurecollection alongside millions of other individual JSON Feature files.

m-mohr commented 7 months ago

It just seems like we are totally missing each other

Indeed. It feels like it would be more time-efficient to talk about this in one of the next fiboa call so that we can clarify individual questions and misunderstandings directly with examples. I really want to get us on the same page here. It might be that we disagree in certain parts, but I think we are actually not as far apart.... :-)

m-mohr commented 3 months ago

Results from the discussion yesterday:

No relation with STAC (removed any mentions in the spec, see https://github.com/fiboa/specification/commit/fbb0b7684b1d072227156194a671855ccc5eb5c3), we can re-use properties if we see a fit and they are scalar, but that applies to all existing standards, not just STAC. We generally try to keep property values simple (i.e. scalars), which e.g for providers in STAC is not the case (array of objects), so it's not a good fit. For provider we'll create an extension. We'll only allow one value generally unless there's a common usecase to provide multiple usecases. For provider for example we don't necessarily see a need.

The general discussion around collection level properties vs. feature properties will be held in https://github.com/fiboa/schema/issues/3 and #26.