4x semantic resolution: Reduce tile size to 256 (16x semantic resolution with 128 tile size, and patch size to 16)

brunosan commented 8 months ago

We currently use an image tile size of 512x512, which defines the area of a semantic embedding. That means a 262k:1 ratio of spatial to semantic area information.

We reduce 262k pixels (512x512 pixels) to a single semantic.

It's unclear how many pixels are needed for the minimum semantic, but something like 64x64 pixels, or 4k pixels, seems more appropriate.

We cannot reduce the image tile too much, or the self-attention patches of the transformers will not have enough information. Nor we will have enough patches so that masking 25% retains enough semantic for the reconstruction.

Clearly there is minimum bound given by the self-attention patch, and number of patches, and a maximum bound by the usable user-embedding being as small as possible

The actual answer is to sweep with different patch sizes and tilesizes, and see performance, but I don't think we are there yet.

v0 has patch_size=32x32 and tile_size=512x512, so num_patches=16x16=256

For the moment I propose we retain the patch_size as 32x32 but decrease tile_size to 256 (num_patches=8x8). We will have 1/4 of the self-attention patches to train on, but we will also have 1/4 semantic resolution.

Looking at the samples below, may be even path_size=16 and image_size 128.

Downstream applications with coarser chipsizes or resolutions are still doable, e.g. averaging embeddings, but we currently cannot provide semantics smaller than 512x512 pixels.

As it currently is, a Sentinel 10m^2 resolution image provides 5km^2 semantics. It seems excessively coarse.

Rough areas: Red 512x512 is current semantic/image unit. Green is proposed 256x256. Blue is patch unit 32x32.

Closeup to the patch_size unit of ~32pixels to show the self-attention unit of semantics. We might even want to go with a patch size of 16.

yellowcap commented 8 months ago

I like the thought behind this, but I have a suggestion on ow to approach it slightly differently:

Let's fix a pixel size for the model that makes most sense, maybe 256x256.
Input data at different resolutions, so that we input tiles to the model for each box in the image above, but with 256x256 pixels every time. I.e. just changing the resolution of the data, even if that means a bit over/under sampling when compared to the original
Like this the model will learn semantics at different scale, and will be able adapt accordingly depending on input. The critical contextual embedding here will be the "resolution" in meters of the image, which defines the scale.

brunosan commented 8 months ago

Let's go with 256x256 for the image. We still need to figure out the right patch size, which right now is 32x32 which might be right.

I agree that giving it a different resolution is a great way to deal if smaller details. My concern is that at any resolution you are having a tile of size 512x512 times of that image resolution. I.e. sentinel at 10m yields a single embeddgin for 5km right now. It is not clear to me to what degree the semantic embedding conveys location within the image. E.g. we can cluster images with houses, but can we say what part of the image they are?

This is another reason I wanted to push for a single embedding per image on the Unet architecture, instead of one embedding per patch. Forcing the Unet to reduce the many patch embeddgins into a single image embedding will force the encoding of location within the image, and prevent us from making the rough approximation of making an average. (Tracked here #107 )

brunosan commented 5 months ago

I believe this is being done. @yellowcap to confirm and close.

Clay-foundation / model

4x semantic resolution: Reduce tile size to 256 (16x semantic resolution with 128 tile size, and patch size to 16) #110