Embeddings with land use land cover fields, or other attributes

weiji14 commented 9 months ago

Opening a parallel thread to #35, to ask about what other attributes are needed alongside the vector embeddings themselves.

Currently, we have implemented (or are about to implement):

[x] Embeddings generation (i.e. the 1x768 vector) at #56
[x] Embeddings + Date + Bounding Box at #73
[x] Embeddings + Date + Bounding Box + GeoTIFF source_url at #86

But @Clay-foundation/ode, it seems like you would require more than just the embeddings and spatiotemporal metadata for the Web App?

Do we have metadata or other datasets that have already been computed for this area? There's a breakdown for the land cover given up above, for example, do we have that for each of the chips?

I'm interested in exploring the relationships between these embeddings and known datasets over the area.

Originally posted by @MaceGrim in https://github.com/Clay-foundation/model/issues/35#issuecomment-1839727151

To be clear, is this extra land cover metadata something that falls on @Clay-foundation/devseed's plate, or can @Clay-foundation/ode use the spatiotemporal metadata from #73 to find the landcover type statistics? Besides landcover type, what other attributes is worth adding to the embedding file?

danhammer commented 9 months ago

I defer to @brunosan on the question:

To be clear, is this extra land cover metadata something that falls on @Clay-foundation/devseed's plate, or can @Clay-foundation/ode use the spatiotemporal metadata from https://github.com/Clay-foundation/model/pull/73 to find the landcover type statistics? Besides landcover type, what other attributes is worth adding to the embedding file?

We will need this metadata to do some of the dynamic visualizations we showed in the past. We can add this metadata, but it won't be nearly efficient as working within the imagery pipeline that @Clay-foundation/devseed already has.

weiji14 commented 9 months ago

The neural network model we have does not output Land Use Land Cover (LULC), so there would need to a separate pipeline for this. Note that the original sampling we did in #28 is based on WorldCover, which are annual grids from 2020-2021, and not exact LULC statistics on the acquisition date of the satellite imagery we ran the embedding on. We could use something like DynamicWorld perhaps that has a 1-to-1 temporal match with Sentinel-2, but there is no STAC catalog for this as far as I'm aware, so it would take some a lot of time to setup. Cc @yellowcap.

In #86, I've linked the each row of embeddings to the source GeoTIFF file, so it should be possible to at least see the RGB image associated with each embedding. Looking at https://medium.com/earthrisemedia/how-we-judge-earth-observation-foundation-model-quality-part-1-intuition-building-623e527d560a, it seems that the visualization was created using https://github.com/nomic-ai/deepscatter, which expects a Parquet/Feather file with x, y and other categorical columns. The x and y columns can be derived from the current GeoParquet file's geometry column, but the categorical columns would need some work.

Not promising that we can do this by end of the year, but could you send a sample Parquet/Feather file that was used for the dynamic visualization, and we can at least see what the data inside the categorical columns should look like?

brunosan commented 7 months ago

Closing here since the items of creating embeddings adding the source location, time and file, are all done.

We still need to get better at the last point of exploring the embeddings, but probably needs a narrower scope on a separate ticket.

Clay-foundation / model

Embeddings with land use land cover fields, or other attributes #84