Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
299 stars 38 forks source link

Switch from per-band mean/std normalization of Sentinel-2 values to surface reflectance conversion instead #94

Open weiji14 opened 8 months ago

weiji14 commented 8 months ago

Using mean and standard deviation normalization is a common procedure in standard Computer Vision, and can be applied e.g. by using torchvision's Normalize function. But this normalization can lead to incorrect band ratios when applied to optical remote sensing images.

E.g. let's take the formula for Normalized Difference Vegetation Index (NDVI):

$$ \text{NDVI} = \frac{\text{NIR} - \text{Red}}{\text{NIR} + \text{Red}} $$

If say, we have a un-normalized Sentinel-2 pixel with Band 8 (NIR): 3327, and Band 4 (Red): 426, then the NDVI value would be:

$$ \text{NDVI} = \frac{3327-426}{3327+426} = 0.77 $$

However, if we apply a per-band mean/std normalization scheme, the value becomes:

\text{NIR} = \frac{3327-2238}{1414} = 0.77
\text{Red} = \frac{426-583}{981} = -0.16
\text{NDVI} = \frac{0.77-(-0.16)}{0.77+(-0.16)} = 1.52

Clearly this is wrong, since we've removed the NDVI signal! A model trained on these mean/std normalized pixel values would have a harder time capturing the semantics of band indices such as NDVI.

One possible solution, is that instead of applying a per-band normalization, we can convert the Sentinel-2 Digital Number (DN) values to surface reflectance by dividing with the dynamic range of the band to a value between 0-1. Sentinel-2's MSI sensor is 12-bit, but the data is stored as 16-bit. Usually people use 10000, but this doesn't work for very bright white areas, so I'll use $2^{14} = 16384$ below:

\text{NIR} = \frac{3327}{16384} = 0.20
\text{Red} = \frac{426}{16384} = 0.026
\text{NDVI} = \frac{0.20-0.026}{0.20+0.026} = 0.77

which matches with the actual NDVI value.

Notes:


Side note: Using a single mean and standard deviation value for all Sentinel-2 bands won't preserve the band ratios either. E.g. if we use a mean value of 1351 and standard deviation of 1071, and apply it to the NIR/Red bands

\text{NIR} = \frac{3327-1351}{1071} = 1.845
\text{Red} = \frac{426-1351}{1071} = -0.8637
\text{NDVI} = \frac{1.845-(-0.8637)}{1.845+(-0.8637)} = 2.76

The 2.76 result is still not the correct NDVI value of 0.77.

References:

weiji14 commented 8 months ago

Another detail after chatting with @lillythomas, we'll also need to apply a bias correction for Sentinel-2 images that were taken after Jan 2022, due to changes in the BOA_ADD_OFFSET value (see https://sentinels.copernicus.eu/web/sentinel/-/copernicus-sentinel-2-major-products-upgrade-upcoming). This is to ensure that the band values of Sentinel-2 images before and after 2022 are from the same distribution. @srmsoumya, we'll probably handle this as custom logic in the transform of the datamodule here: https://github.com/Clay-foundation/model/blob/ee74c91b99cbf9a7c304c0d37335c128d1ae7566/src/datamodule.py#L152

srmsoumya commented 8 months ago

Good article on normalizing EO imagery: https://medium.com/sentinel-hub/how-to-normalize-satellite-images-for-deep-learning-d5b668c885af Another option to consider, use BatchNorm as a way to learn normalized weights for each band in the imagery,

weiji14 commented 8 months ago

Another option to consider, use BatchNorm as a way to learn normalized weights for each band in the imagery,

No, we should not use BatchNorm for the Foundation Model layers, since it is doing mean/std normalization! In Super Resolution models such as ESRGAN (Wang et al., 2018) and EDSR (Lim et al., 2017), BatchNorm layers have been removed and replaced with residual skip connections. Quoting from Lim et al., 2017:

We remove the batch normalization layers from our network as Nah et al.[19] presented in their image deblurring work. Since batch normalization layers normalize the features, they get rid of range flexibility from networks by normalizing the features, it is better to remove them. We experimentally show that this simple modification increases the performance substantially as detailed in Sec. 4.

Furthermore, GPU memory usage is also sufficiently reduced since the batch normalization layers consume the same amount of memory as the preceding convolutional layers. Our baseline model without batch normalization layer saves approximately 40% of memory usage during training, compared to SRResNet. Consequently, we can build up a larger model that has better performance than conventional ResNet structure under limited computational resources.

The point is though, that we shouldn't be doing any mean/std normalization on the inputs to the first layer of the model, so that the band ratios are preserved. If we want to apply normalization to subsequent layers, that's fine, and for more recent models (from 2020- onwards), it seems like LayerNorm is preferred over BatchNorm (e.g. see ConvNext and https://stats.stackexchange.com/questions/474440/why-do-transformers-use-layer-norm-instead-of-batch-norm).

yellowcap commented 3 months ago

I think for v1 and beyond we can use both. The model should see L1, L2 data, and different normalization patterns so that it hopefully generalized better. So I am closing this for now. @weiji14 feel free to keep this open or re-open later if we get back to working on this.