Closed yellowcap closed 3 months ago
Thanks. From the discussion, I also realize that we don't really have 1 embedding per self-attention patch, we have 1 per self attention AND layer group. If that's the case, we should add it somewhere on the documentation.
I.e.
512x512
patches. (at Sentinel resolution means ~5kmx5km
32x32
or ~300mx300m
.Note: These grouping do NOT mean that the model has parallel track for each group. When training we calculate the self-attention (QKV) individually for each layer. Groups, are more akin to sentences, groups of words. This means that the RGB group also has information about what SAR has, and vice versa.
@srmsoumya to confirm this. If this is the case, I don't understand the value of the grouping instead of making one embedding per self-attention patch. What's the value of grouping the embeddings this way?
TLDR for @MaceGrim. The semantic resolution at the self-attention path before the average is ~300m
but also split into dominant groups of bands.
Thanks. From the discussion, I also realize that we don't really have 1 embedding per self-attention patch, we have 1 per self attention AND layer group. If that's the case, we should add it somewhere on the documentation.
I.e.
512x512
patches. (at Sentinel resolution means ~5kmx5km
32x32
or ~300mx300m
.Note: These grouping do NOT mean that the model has parallel track for each group. When training we calculate the self-attention (QKV) individually for each layer. Groups, are more akin to sentences, groups of words. This means that the RGB group also has information about that SAR has, and vice versa.
@srmsoumya to confirm this. If this is the case, I don't understand the value of the grouping instead of making one embedding per self-attention patch across all bands. What's the value of grouping the embeddings this way? Would not make sense to reduce the semantics across all bands into one?
TLDR for @MaceGrim the semantic resolution at the self-attention path is ~300m
but also split into dominant groups of bands.
We are already running on the previous version that only stores average embeddings.
So creating the raw embeddings might be scheduled later in tandem with other model updates.
The option to output this has been implemented in https://github.com/Clay-foundation/model/pull/133 . We have multiple people running patch embeddings for specific use cases. So we can close this high level issue here.
To enable fast downstream applications, we could store the raw encoder output and not only the average embedding that we are already creating for similarity search.
Refs