cloudnativegeo / geo-embeddings-survey

A survey of use cases and current data schemas for vector embeddings in geoparquet
Apache License 2.0
22 stars 4 forks source link

Add Earth Index #1

Closed bengmstrong closed 5 months ago

bengmstrong commented 5 months ago

Adding spec for Earth Index embeddings.

Note: we're currently working on a fresh set of Earth Index embeddings that we'll publish to source.coop soon. I'll add that link to the spec once it's published!

cholmes commented 5 months ago

Awesome, looks good.

And just to be super clear, you're putting the model, dataset and embeddings data all under the 'geo' key that geoparquet uses? Like they live at the same level as columns and version / primary key? I'm 90% sure of that, just want to be 100% sure.

It might be good to make an file like https://github.com/opengeospatial/geoparquet/blob/main/examples/example_metadata.json that gives out an example for each of the values, to help make it even more clear.

I'm going to merge this, as there's no reason to not just do that as another PR.

bengmstrong commented 5 months ago

Yes, that metadata is being stored under the 'geo' key. We debated the right place to store it, but we see this as building on geoparquet so decided to store it there. Interested for your thoughts on this too though

bengmstrong commented 5 months ago

Added clarification about this to the earth index embeddings spec

cholmes commented 5 months ago

Yes, that metadata is being stored under the 'geo' key. We debated the right place to store it, but we see this as building on geoparquet so decided to store it there. Interested for your thoughts on this too though

I think it's reasonable, but will be interesting to see what others did and talk through it. For fiboa metadata we did it 'next to' geo:

{
    "fiboa":
    {
        "fiboa_version": "0.2.0",
        "fiboa_extensions":
        [
            "https://fiboa.github.io/inspire-extension/v0.2.0/schema.yaml"
        ],
        "id": "de_nrw",
        "title": "Field boundaries for North Rhine-Westphalia (NRW), Germany",
        "license": "dl-de/by-2-0",
        "attribution": "Land Nordrhein-Westfalen / Open.NRW - https://www.opengeodata.nrw.de/produkte/umwelt_klima/bodennutzung/landwirtschaft/"
    },
    "geo": {

    }
}

It does feel like it could be nice for some generic 'dataset' metadata. Though then I wonder a bit how much of that we should be embedding in the geoparquet file, vs having it in STAC, as collection metadata. Though I think with Fiboa we also include STAC JSON in the GeoParquet.

Will be good to talk through.

cholmes commented 5 months ago

Do you have any sample file? Even just a very small one. If you do I can generate the json metadata from it to show more easily.