Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

Run v0.1 embeddings to store raw encoder output #127

Closed yellowcap closed 3 months ago

yellowcap commented 6 months ago

To enable fast downstream applications, we could store the raw encoder output and not only the average embedding that we are already creating for similarity search.

Refs

brunosan commented 6 months ago

Thanks. From the discussion, I also realize that we don't really have 1 embedding per self-attention patch, we have 1 per self attention AND layer group. If that's the case, we should add it somewhere on the documentation.

I.e.

  1. Our big MGRS tile is split into 512x512 patches. (at Sentinel resolution means ~5kmx5km
  2. Each patch is further split into a "self-attention patch" of size 32x32 or ~300mx300m.
  3. For each self-attention patch, the 13 layers are grouped into 6 groups, and we create one embedding per group.

Note: These grouping do NOT mean that the model has parallel track for each group. When training we calculate the self-attention (QKV) individually for each layer. Groups, are more akin to sentences, groups of words. This means that the RGB group also has information about what SAR has, and vice versa.

@srmsoumya to confirm this. If this is the case, I don't understand the value of the grouping instead of making one embedding per self-attention patch. What's the value of grouping the embeddings this way?

TLDR for @MaceGrim. The semantic resolution at the self-attention path before the average is ~300m but also split into dominant groups of bands.

brunosan commented 6 months ago

Thanks. From the discussion, I also realize that we don't really have 1 embedding per self-attention patch, we have 1 per self attention AND layer group. If that's the case, we should add it somewhere on the documentation.

I.e.

  1. Our big MGRS tile is split into 512x512 patches. (at Sentinel resolution means ~5kmx5km
  2. Each patch is further split into a "self-attention patch" of size 32x32 or ~300mx300m.
  3. For each self-attention patch, the 13 layers are grouped into 6 groups, and we create one embedding per group.

Note: These grouping do NOT mean that the model has parallel track for each group. When training we calculate the self-attention (QKV) individually for each layer. Groups, are more akin to sentences, groups of words. This means that the RGB group also has information about that SAR has, and vice versa.

@srmsoumya to confirm this. If this is the case, I don't understand the value of the grouping instead of making one embedding per self-attention patch across all bands. What's the value of grouping the embeddings this way? Would not make sense to reduce the semantics across all bands into one?

TLDR for @MaceGrim the semantic resolution at the self-attention path is ~300m but also split into dominant groups of bands.

yellowcap commented 6 months ago

We are already running on the previous version that only stores average embeddings.

So creating the raw embeddings might be scheduled later in tandem with other model updates.

yellowcap commented 3 months ago

The option to output this has been implemented in https://github.com/Clay-foundation/model/pull/133 . We have multiple people running patch embeddings for specific use cases. So we can close this high level issue here.