Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
371 stars 48 forks source link

Output column in GeoParquet table with statistics (mean, count, min/max) of an input band #185

Closed weiji14 closed 5 months ago

weiji14 commented 7 months ago

Idea that came up during our regular meetings, on generalizing the cloud-cover percentage patch-level info (i.e. extending #168) to other bands/channels, so that someone could apply other filters based on certain columns with some statistics (mean, count, min/max, percentage, etc) derived from the input images. This would enable pre-filtering based on attributes when performing Similiarity Search.

Example:

embedding cloud_cover_percentage mean_elevation max_temperature bbox
[0.1, 0.4, ... x768] 20% 100m 25°C POLYGON(...)
[0.2, 0.5, ... x768] 30% 300m 20°C POLYGON(...)
[0.3, 0.6, ... x768] 40% 500m 15°C POLYGON(...)

This would involve generalizing the inference part of the code somehow, specifically the predict_step function here:

https://github.com/Clay-foundation/model/blob/0145e55bcf6bd3e9b19f5c07819a1398b6a22c35/src/model_clay.py#L855-L921

Some changes might also need to happen on the DataLoader side, so that these statistical measures are passed through. Parking this as an idea for now.

yellowcap commented 5 months ago

Good idea, but not sure if this is still relevant. We won't always have a way to assess cloud cover in this predict step. @srmsoumya is this code still operational?

srmsoumya commented 5 months ago

We don't have a predict_step in v1, we should add a script instead, that takes maybe a tile as input & creates embedding for chip size defined by the user. This can be scalable & we could add AWS batch scripts for these.

yellowcap commented 5 months ago

ok in that case closing here, let's revisit when doing prediction scripts