Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
242 stars 25 forks source link

Patch embeddings are not meant for similarity search #223

Closed brunosan closed 2 months ago

brunosan commented 2 months ago

A key use case for Clay is to find similar stuff. Give it a few examples of parking lots, and find more of those. Very quickly, the challenge becomes that small stuff is much smaller than the image. E.g. the image size is 512x512 at Sentinel2 resolution, so 5kmx5km, and you might want to find dams, or airports, or aquaculture, which might be ~100m. This is a dual problem:

  1. You can only select whole images, so the code will have trouble understanding what of the many small things you actually wanted. The only solution here is to give it a few positive and negative examples, to filter down the actual semantic intended. We've been using that. Is like playing who is who selecting attributes across samples.
  2. Even if you do find the right stuff in other images, you don't know where your stuff is within the image.

We've been moderately successful with patch embeddings similarity but there is one underlying fundamental issue. Patch embeddings are literally designed to depend on their context. The whole point of self-attention is to understand the presence and distribution of not only the semantics of the patch, but who that related to the ones around it: The same exact helipad image will have different patch embeding if on a ship, hospital or airport.

Transformers force word embeddings to distinguish among semantics given the context, and then we try to find the same word and struggle when they are different, as we forced them to be. the word "bank" is our patch, and we struggle when given "world bank" cannot find the "similar" case "river bank". In EO, it doesn't matter than our tokens (the patch) is actually an image that might have whole isolated semantics (like a car), it is forced to distinguish the same car given the context.

It is only at the image level, not the patch that we get whole semantics.

With v0, the image size was fixed, and large, hence we needed the patch level. For v1 we are doing several resolutions, and several image sizes. This should enable us to generate embeddings for images much closer to the size of the semantics we are looking for.

My question:

  1. how to merge all those patch embeddings into a single image embedding (or how word embeddings are combined to get a sentence embedding). Is the average acceptable?
  2. Do we then need to generate embeddings at several image sizes so we can find stuff at different sizes? E.g. embeddings for stuff at 10m, 100m, 1km, ...

@leothomas @MaceGrim @yellowcap @srmsoumya

related #222 #107