Closed brunosan closed 5 months ago
I like the thought behind this, but I have a suggestion on ow to approach it slightly differently:
Let's go with 256x256
for the image. We still need to figure out the right patch size, which right now is 32x32
which might be right.
I agree that giving it a different resolution is a great way to deal if smaller details. My concern is that at any resolution you are having a tile of size 512x512 times of that image resolution. I.e. sentinel at 10m yields a single embeddgin for 5km right now. It is not clear to me to what degree the semantic embedding conveys location within the image. E.g. we can cluster images with houses, but can we say what part of the image they are?
This is another reason I wanted to push for a single embedding per image on the Unet architecture, instead of one embedding per patch. Forcing the Unet to reduce the many patch embeddgins into a single image embedding will force the encoding of location within the image, and prevent us from making the rough approximation of making an average. (Tracked here #107 )
I believe this is being done. @yellowcap to confirm and close.
We currently use an image tile size of 512x512, which defines the area of a semantic embedding. That means a 262k:1 ratio of spatial to semantic area information.
We reduce 262k pixels (
512x512
pixels) to a single semantic.It's unclear how many pixels are needed for the minimum semantic, but something like 64x64 pixels, or 4k pixels, seems more appropriate.
We cannot reduce the image tile too much, or the self-attention patches of the transformers will not have enough information. Nor we will have enough patches so that masking 25% retains enough semantic for the reconstruction.
Clearly there is minimum bound given by the self-attention patch, and number of patches, and a maximum bound by the usable user-embedding being as small as possible
The actual answer is to sweep with different patch sizes and tilesizes, and see performance, but I don't think we are there yet.
v0
haspatch_size=32x32
andtile_size=512x512
, sonum_patches=16x16=256
For the moment I propose we retain the
patch_size
as32x32
but decreasetile_size
to256
(num_patches=8x8
). We will have 1/4 of the self-attention patches to train on, but we will also have 1/4 semantic resolution.Looking at the samples below, may be even
path_size=16
andimage_size 128.
Downstream applications with coarser chipsizes or resolutions are still doable, e.g. averaging embeddings, but we currently cannot provide semantics smaller than 512x512 pixels.
As it currently is, a Sentinel 10m^2 resolution image provides 5km^2 semantics. It seems excessively coarse.
Rough areas: Red 512x512 is current semantic/image unit. Green is proposed 256x256. Blue is patch unit 32x32.
Closeup to the patch_size unit of ~32pixels to show the self-attention unit of semantics. We might even want to go with a patch size of
16
.