Output column in GeoParquet table with statistics (mean, count, min/max) of an input band

weiji14 commented 7 months ago

Idea that came up during our regular meetings, on generalizing the cloud-cover percentage patch-level info (i.e. extending #168) to other bands/channels, so that someone could apply other filters based on certain columns with some statistics (mean, count, min/max, percentage, etc) derived from the input images. This would enable pre-filtering based on attributes when performing Similiarity Search.

Example:

embedding	cloud_cover_percentage	mean_elevation	max_temperature	bbox
[0.1, 0.4, ... x768]	20%	100m	25°C	POLYGON(...)
[0.2, 0.5, ... x768]	30%	300m	20°C	POLYGON(...)
[0.3, 0.6, ... x768]	40%	500m	15°C	POLYGON(...)

This would involve generalizing the inference part of the code somehow, specifically the predict_step function here:

https://github.com/Clay-foundation/model/blob/0145e55bcf6bd3e9b19f5c07819a1398b6a22c35/src/model_clay.py#L855-L921

Some changes might also need to happen on the DataLoader side, so that these statistical measures are passed through. Parking this as an idea for now.

yellowcap commented 5 months ago

Good idea, but not sure if this is still relevant. We won't always have a way to assess cloud cover in this predict step. @srmsoumya is this code still operational?

srmsoumya commented 5 months ago

We don't have a predict_step in v1, we should add a script instead, that takes maybe a tile as input & creates embedding for chip size defined by the user. This can be scalable & we could add AWS batch scripts for these.

yellowcap commented 5 months ago

ok in that case closing here, let's revisit when doing prediction scripts

Clay-foundation / model

Output column in GeoParquet table with statistics (mean, count, min/max) of an input band #185