fiboa / schema

The schema language for fiboa
Apache License 2.0
0 stars 0 forks source link

Schema for Collection Properties #3

Open m-mohr opened 2 months ago

m-mohr commented 2 months ago

The AI Ecosystem Extension has a Collection Property defined: https://github.com/fiboa/ai-ecosystem-extension

There's no way yet to define Collection Properties in the fiboa schemas. Currently the spec differentiates between Collection Properties and Feature Properties and Collection Properties could only be validated via STAC extensions right now (but STAC Collections are not required).

Related to: https://github.com/fiboa/specification/issues/13

m-mohr commented 2 months ago

The most elegant solution seems to not distinguish between Collection Properties and Feature Properties. We just define Properties. By default we assume Properties are Feature Properties and reside there, but whenever all values for a property are the same, the field can be moved up to Collection Properties if desired. Example: https://github.com/fiboa/ai-ecosystem-extension/blob/main/examples/geojson/ai-ecosystem-example.json?short_path=4d7428f#L7

Validation would need to check whether it only exists in one place and then dynamically check whether something is in the Collection Properties or in the Feature Properties and validate it there.

This would also solve https://github.com/fiboa/specification/issues/13 as the schemas for Collection and Feature Properties are the same and they can be placed wherever it fits best. On the other hand, the Collection properties would be less "advanced. For example, the STAC Provider Object is a list of providers with url, name, ... while we'd need separate properties for a single provider already (e.g. provider_name, provider_url). What to do with multiple providers is open...

I assume properties by default would be allowed for Collections unless we explicitly set a new collection keyword in the schemas to false. This would for example apply to id, bbox, area, perimeter and geometry.

This would also mean that fiboa_version and fiboa_extensions could potentially be used on a per Feature level unless we explicitly allow them to be only in collections. I guess we also need a feature: false flag in the schema then.

m-mohr commented 2 months ago

Comment from @cholmes from a chat conversation:

I guess one problem I see is that some things you want at the collection level. Like 'license' - it'd be good to look one place. So like if you combined a few datasets then the collection level is 'various', but then if you merged that collection with another then how do. you ensure that 'various' doesn't get written at the feature level? That's maybe not the best example - but just if we're automerging and moving everything from collection to feature level I could see some things go wrong.

My response:

The license example would remove license from the collection level metadata in the "various" case. A property couldn't be both at the collection and feature level, i.e. that would be reported as invalid to avoid that issue. With the concept I have in mind and described in the issue, I don't see any risks right now. Which do you see?

cholmes commented 2 months ago

The license example would remove license from the collection level metadata in the "various" case. A property couldn't be both at the collection and feature level, i.e. that would be reported as invalid to avoid that issue.

Ah, ok - I hadn't made the leap that the license would be removed from collection level metadata and that a property couldn't be at both the collection and feature level. So the idea is that any property that has more than one value has to be represented at the feature level.

I was imagining that it could be nice in some cases to just be able to quickly look at the metadata at literally see 'license = vairous' instead of having to iterate through everything. Or to like know that every license is 'open'. Seems like we start to get into a desire for stac-style summaries.

I'm also wondering about things that may only make sense at the collection level. The one that comes to mind is the 'bounds' of a collection. I guess the temporal extent is similar. Are these 'special' the other way from your non-collection examples above (id, bbox, etc).

cholmes commented 2 months ago

I am leaning towards this idea of making it so there is no difference between collection and feature level properties. Just hoping there's not lots of special cases we'll need to come up with. I tend to prefer to specify things at the 'right' level and not have things be so fluid, but it does seem like many people will combine fiboa datasets and thus 'push' many things that would be 'collection' level in a single dataset to be represented at the feature level when combined.

I do think it'd be good to 'stress test' the idea a little bit, to think through some of the 'merge' use cases where people are bringing different datasets together and make sure the properties would work right. This doesn't have to be super extensive, as we can also just change it again if it doesn't work.

One other semi-related thing to think through is using the same schemas when the geometry is just referenced instead of included. The biggest use case is time series data, like daily soil moisture or NDVI - having every date as a column off the geometry gets pretty onerous, but it'd be nice to have a defined way to talk about time series. But this probably deserves its own issue.

m-mohr commented 2 months ago

I am leaning towards this idea of making it so there is no difference between collection and feature level properties.

Agreed.

What do you mean with "fluid"? Feels like there might be a misunderstanding hiding here?

But this probably deserves its own issue.

Indeed.