Semantically aligned embeddings across instruments

brunosan commented 1 month ago

We are currently using MAE where we recreate the input image. We can specify any instrument, but we always train with the task to reconstruct the input image. This means that the embeddings space exists for any instrument, but we have no mechanism to relate all these embeddings spaces to each other. The only anchor would be that we give the location and time as input, and the resolution in the self-attention.

This makes us think about a coherent embeddings across instruments. Where semantics across instruments are located in the same location in the embeddings space. That is, the embeddings of e.g. "city streets" are in the same location in the embeddings space in Sentinel-1 and Sentinel-2, and NAIP, and ... In cases of unresolved semantics (e.g. MODIS won't see city streets, but it will see large cities) I expect that MODIS will generate an unresolved blob around the "city" cluster, where Sentinel will be able to resolve structure within the blob, NAIP more and LINZ even more). If the semantic is completely invisible (underwater colors in RGB versus SAR which is unable to see under water) I expect the embedding to point to encompassing semantic (e.g. water in SAR). I do not know how it will behave when different sensors offer fundamentally different semantics (due to the sensor, e.g. SAR slanting).

Main goal of this is semantically align sensors, so one can go from one to the other to better assess local dynamics. E.g. we can track an agricultural field across RGB and SAR, but when is cloudy I want to use SAR to get a sense on any possible semantic deviation on RGB. Or be able to semantically monitor a semantics, where we can detect a MODIS embedding shiffrom forest to grass, and use a semantic db query to pull other locations in NAIP with the same values.

Untitled

In order to do this I think the "only" change is to decouple the encoder and decoder (only to make changes on the blue box). To replace the decoder to create not the inout image, but another instrument at the closest available time on the equivalent FoV. E.g. When doing Sentinel as input and MODIS as output, the MODIS output will be just a few pixels (and easier to MAE) and when going the reverse from MODIS to Sentinel, the Sentinel decoer will get an overview of many more pixels (harder on MAE). I think we should always pair instruments with not too different resolution (e.g. Try to create a NAIP from MODIS)

Another benefit is that we at least triple the training set, since we can train with the task of same input-output or pairs, or pairs of different instruments (in both directions). I would propose these pairs, in both directions:

Sentinel 2 <-> Sentinel 1 (2 options, both directions)
Sentinel X <-> NAIP (4 options, 2 sentinels, both directions)
Sentinel X <-> LANDSAT (4 options, 2 sentinels, both directions)
MODIS <-> Sentinel X (4 options, 2 sentinels, both directions)
?

Thoughts? cc @yellowcap @srmsoumya

srmsoumya commented 1 month ago

Interesting idea, I think we should be able to implement this. Few modifications in the encoder to indicate what we want to reconstruct & a control flow decoder. Few things we might have to deal with:

NAIP to Sentinel or Sentinel to MODIS is 10x & 50x difference, so an image of size 256 x 256 will be of size 25 x 25 or 5 x 5 respectively (will have to adjust patch size accordingly)
Creating spatially aligned datasets, for each location we have to pair datasets on a location like Clay v0.1 for Sentinel 2, Sentinel 1 & DEM. We faced a few issues with alignment & not always finding overlapping tiles, with single tile & stacchip this was a lot simpler. However, I think this is an engineering problem & should be doable.

yellowcap commented 3 weeks ago

To do this the training data pipeline has to be changed quite drastically, as we would have to pair the imagery to match the same location at the same time (and roughly same resolution I guess). This is doable, but a heavy lift.

Also it would most likely reduce the number of samples, not increase them. Date matching is hard and filters out most of the imagery (5 day intervals only overlap relatively rarely). Moreover, if we want to cross more than 2 at a time, then we have to match dates across 3-5 platforms, then the number of good matches goes down even more.

Maybe we can create a smaller datset of datacubes containing multiple instruments, and do additional training on only that kind of reconsturction.

Clay-foundation / model

Semantically aligned embeddings across instruments #314