fiboa / specification

Field Boundaries for Agriculture (fiboa) - a specification that describes important properties of field boundaries
Apache License 2.0
9 stars 2 forks source link

Prefixes for extensions #36

Open m-mohr opened 1 week ago

m-mohr commented 1 week ago

From the fiboa Slack:

@cholmes wrote:

I'm working on a Planet converter (got to a PR that needs some help) and started on an extension for Planet's extra values, to try to do it a bit more 'proper'. I did them as planet:qa and planet:mcid, but I'm questioning if we should bring our stac convention of the : with a prefix over or not. It doesn't work great in SQL (you have to quote it), and I feel like it's a special character in other places too. I'm sorta contemplating just not doing (as much?) prefixing. Like it'll be easier to have Planet's geopackage (that's not on fiboa) have an equivalent geoparquet/fiboa that has mostly the same fields, especially for the ones where we don't have a 'standard'. And then also wondering about just doing like tillage_occurred instead of tillage:occurred for places where a prefix does make some sense.

@m-mohr wrote:

The : is indeed not ideal, but on the other hand is too generic and may lead to conflicts with existing fields much more often. Also, the might be ambiguous... crop_type_identifier_code - What's the prefix? crop? crop_type? What if someone defines an extension crop with a field type_identifier and an extension crop_type with a field identifier?

@andyjenkinson wrote:

I think the colon has a clear place as a URI prefix. However one thing we could do is to actually make the spec JSON-LD. That means the schema properties are full URIs, and an extension is just a URI prefix (namespace) for the properties within that extension. Then you use a JSON-LD context object (which can also be shifted to an external URL) that maps property keys to a full URI. That makes your JSON look like normal JSON (and can even be a way of keeping all the proprietary property keys you already have) but automatically convertible to the common standard. Basically you could largely retrofit FIBOA extension compatibility in an almost invisible way. By the way it also means you can literally reuse concepts that already exist in other ontologies, like DCAT for datasets, PROV-O, agri domain ontologies etc. To me, I think a great target would to aim that an implementation can hit FIBOA, JSON-LD and OGC Features API compatibility all at the same time. The actual spec for an extension would be expressed in JSON-LD, but all of the examples would look like plain old JSON with one ‘@context’ property that contains the schema mapping.

cholmes commented 1 week ago

Also, the _ might be ambiguous... crop_type_identifier_code - What's the prefix? crop? crop_type? What if someone defines an extension crop with a field type_identifier and an extension crop_type with a field identifier?

Yeah, I think the essence of the idea to me is more to stop worrying about prefixes. Like letting Planet do 'mcid' instead of 'planet:mcid' or 'planet_mcid'. This may just be a bad idea - but I'm just sorta wondering what we've really gained by having all the prefixes. If you have a data model and want to validate it with a few different extensions then you'll be making choices about what to validate it with. The chances of overlap seem small, and if there's a set of 'known' extensions then people introducing new extensions that might need to be compatible can tweak their names. It would essentially just be a 'looser' approach to the ecosystem - here's a set of attributes that mean these things, but they're not trying to define things for all times.

So like Planet would have 'mcid' defined at https://planet.github.io/fiboa/planet-fb-extension/v0.1.0/schema.yaml, some other org can have 'mcid' (and maybe it means something different...) defined at https://company.com/fiboa/company-extension/v0.1.0/schema.yaml, but if they wanted to make the ecosystem more compatible then they could just establish a new community extension at https://fiboa.org/mcid-extension/v1.0.0/schema.yaml, but the field name would stay 'mcid' - it'd just use the community-build JSON schema to validate.

I think the colon has a clear place as a URI prefix. However one thing we could do is to actually make the spec JSON-LD. That means the schema properties are full URIs, and an extension is just a URI prefix (namespace) for the properties within that extension.

And yeah, I think this is the other extreme of the approach, attempting to have the prefix have 'real' meaning. The original idea of the prefix in STAC was inpired by JSON-LD, with the intent to try to do just what you're saying @andyjenkinson - tie the prefixes to full URI's with well-known meaning in JSON-LD. I think one thing that threw it off is that the 'geo' representation in JSON-LD wasn't great if I remember right, and very few tools had support for it. But it's probably worth taking another run at figuring out if we could fully support it - I agree that hitting 'FIBOA, JSON-LD and OGC Features API compatibility all at the same time' would be really great.

The actual spec for an extension would be expressed in JSON-LD, but all of the examples would look like plain old JSON with one ‘@context’

But for extensions wouldn't you need to one 'context' per extension? Unless you put all the extensions into a single 'fiboa' context? Like if you don't put them all in a single context then there'd still be prefixes for all the ones that aren't the primary / default context?

It also seems like when you map from JSON-LD to GeoParquet you'd need to bring the prefixes back in consistently, or else try to fully represent the URI's in Parquet.

andyjenkinson commented 1 week ago

The context is a property included at the root of each payload, the value of which is either a context object or a URL of one, rather like a hypermedia link. So it can be unique to each implementation/dataset, and can include terms from any number of extensions. The extensions would give example contexts that correspond to example payloads, but when you implement FIBOA you can either:

Bear in mind JSON-LD contexts can remap any property to a URI, not only expand a prefix. This would allow Planet to have whatever terms it wanted in its payloads, they wouldn't even have to match the name of the property in the FIBOA spec and don't have to contain any prefixes; the mapping to FIBOA would be done entirely by the context file. All the machine readable stuff like validation, conversion etc can use the processed JSON-LD representation, but the files look completely normal to users and 'FIBOA unaware' software.

Regarding parquet, yes either you define them as full URIs, or you would carry forward the JSON-LD context mappings into the headers (and vice versa of course) so that they look ‘normal’ in any existing software that processes parquet data. Basically the headers would have to say "these are the property names, and these are their equivalent URIs.

The one thing I’m unsure about is the geo stuff. You may know there is a GeoJSON-LD context but I have not looked at it in detail. I would not want to abandon GeoJSON for some other random representation of geometry in JSON-LD, it’s about making a standard GeoJSON payload processable as JSON-LD. In fact that context is a good example explaining what I mean above about using the context object to essentially make JSON-LD look like completely normal JSON unchanged from its original format. All it is is a context, which maps all the original GeoJSON schema items to URIs exactly as they are.

Personally I am not a fan of going the other end of the scale and just allowing a free for all on names. I get namespaces are annoying but I can see clashes happening especially for terms like "crop", and in particular it's useful to distinguish 'uncontrolled' terms - personally I'm not sure it's necessary to make a Planet extension as by definition there won't be any terms in common with anyone else. So long as FIBOA allows additional properties just document your schema and anything proprietary doesn't need a prefix. Then focus on trying to standardise things that seem common in a topic- not vendor-specific extension. Unless you adopt JSON-LD or something like it, someone's going to have to change their schema anyway.

cholmes commented 5 days ago

@m-mohr - why didn't you use a prefix on flik-extension?

m-mohr commented 5 days ago

Good question, it was my first one and more an example. The fields in the original had no prefix, I guess I either forgot it or thought it's simpler, can't remember 😅

cholmes commented 5 days ago

Cool. Yeah, both reasons to me point a bit to how it could be nicer to not have to think about them.

Curious what you think about JSON-LD and contexts, and if you'd be up to dig into it a bit. Like if there is a way to enable us to pass through the 'schema' information without including the prefix, like all the way through geoparquet. I'm a bit less sure how much GeoParquet metadata should really handle - I wonder if there's any other examples of JSON-LD -> Parquet. And if it'll work when pulling a few different 'contexts' into one.

m-mohr commented 5 days ago

I'm travelling next week, but I can dig into it afterwards, but it will likely take a three weeks or so. It doesn't seem to solve the colon / quote issue though. I'm not sure whether we can solve that if the allowed set of characters for SQL names is A-Z, 0-9 and _.

I worry a bit that without a prefixes we'll end up with various extension that use crop_id and maybe even datasets that use no extension and have crop_id. If you want to merge them, what do you do with the fields names? You can't do it because the field is differently defined.

In STAC we see that many clients actually don't check stac_extensions array and just use the fields because they can be sure there are usually no conflicts. Clients would need to be developed with more care, they may even need to read all schemas.

So I tend towards a prefix at least for "common"/sebatable field names. If you have names like "flik" that are unlikely to conflict I could see that we allow without prefixes. We also sometimes do that in STAC.

JSON-LD is an open question for now.

andyjenkinson commented 5 days ago

It could solve the colon issue because there won’t be any colons in the payload any more, only in the context file which only ever needs to be read when doing things like validating or converting. And they’ll be in values, not keys. You can put whatever property names you like in your implementation, they don’t have to be named the same as the ‘standard’ ones, the context file provides a mapping. So the GeoJSON file literally looks like any normal JSON with one extra property ‘@context’ that links to the context. Everything else can be your native implementation, and if you want that to be a translation of a SQL schema, have at it. Think of the context as a set of instructions of how to convert the GeoJSON Feature/FeatureCollection to a FIBOA JSON-LD schema. It’s pretty much a glorified ‘find and replace’.

So for example in your JSON you could have:

{
 '@context': 'https://planet.com/path/to/my/context.json',
 'id': 'ABCD1334',
 'date': '2024-06-29'
}

And after mapping if would look something like:

{
 'http://purl.org/dc/terms/identifier': 'ABCD1234',
 'https://fiboa.org/schema/core#determination_date': '2024-06-29'
}

Here, I use an example where the FIBOA core specification would define the 'id' property, as a mapping to the existing Dublin core vocabulary term 'identifier', as well as its own unique properties.

Meanwhile the FIBOA examples (and the format you'd use if you were creating a file from scratch) could look like:

{
 '@context': 'https://fiboa.org/schema/context.json',
 'id': 'ABCD1234',
 'determination_date': '2024-06-29',
 'myextension:image_resolution': '10m'
}

And that would map to an identical RDF graph as the Planet example:


{
 'http://purl.org/dc/terms/identifier': 'ABCD1234',
 'https://fiboa.org/schema/core#determination_date': '2024-06-29',
 'https://fiboa.org/schema/myextension#image_resolution': '10m',
}

I've simplified the structure of all these of course. I'm typing on my phone.

The standard context file would contain all the mappings from the core and extensions. Here the only reason for the colon is for the same reason we have it today: to allow independent development of extensions which might simultaneously use the same property name. But if this isn't important (eg each extension is allowed to basically claim a property name by using it first) it could be removed. Either way, each extension would have to provide a standard JSON-LD context that would translate the "nice plain JSON" to full messy URIs. The key to this is to hide as much as possible the mechanics of JSON-LD to make creating and maintaining extensions and making compatible features as easy as possible.