developmentseed / pearl-backend

PEARL (Planetary Computer Land Cover Mapping) Platform API and Infrastructure
MIT License
55 stars 7 forks source link

Data Drift - Mosaics have different data distribution across different time intervals #47

Closed srmsoumya closed 1 year ago

srmsoumya commented 1 year ago

Problem

We have observed that the mosaics generated from PC seem to have different data distribution across different time intervals. Specifically, the mean and standard deviation of the R, G, B, and NIR bands for a subset of tiles that have a good distribution of all the classes of interest to Reforestamos show that the mean for 2022 is drifting to the higher side. Screenshot from 2023-03-06 14-04-02

It is unclear if we can attribute this to seasonal or annual variance. In addition, some mosaics are comparatively darker. image

Challenges

This presents several challenges for our work. First, we have created training data labels for a single quarter (December to March 2022). If we train a model on this sample subset, it may not generalize well to different time periods if the mosaics have different data distribution. Second, our Active Learning (AL) loop depends on seed data that is again created from the training dataset. Since it will have a similar data distribution as the training dataset, the AL loop might not work as well.

Proposed Solution

To address this issue, we propose two possible solutions:

Color space augmentations: We can modify the brightness, contrast, and saturation values of the images to create augmented images that have the right shift in mean/std values. By training the model on these augmented images, it will learn that the features are more important to learn than the contrast or brightness of the image. As an example, we have applied random contrast to two tiles from 2022 and 2020, and we observe that the data distribution shifts accordingly. (Although, the 2020 augmented tiles are still relatively darker)

2022 random contrast Screenshot from 2023-03-06 14-07-44 2020 random contrast Screenshot from 2023-03-06 14-08-02

Train on images from different time periods: Instead of training the model on a single time period, we can train it on multiple time periods. While the labels are generated for a single quarter, this would be a form of weak-supervision that allows us to train the model on different time periods.

I would like to get feedback on these proposed solutions and any other potential solutions that can help address this issue.

cc' @geohacker @ingalls @Rub21 @vincentsarago

vincentsarago commented 1 year ago

@srmsoumya do you know the color formula used for the tile creation?

srmsoumya commented 1 year ago

@vincentsarago we are not using any color correction while creating the mosaics, here is the script for reference: https://github.com/developmentseed/reforestamos_data/blob/f281ef3e04ad401eec96b85d7aa2cae47b0c7608/reforestamos_data/download_tiles.py#L55

geohacker commented 1 year ago

@srmsoumya I suspect the inference applies the color formula that's stored in the database before the tiles are sent for prediction https://github.com/developmentseed/pearl-backend/pull/38/files#diff-e12f27f89936087188bc2501422778c013c05946ce9c938e078dd6c12820ce36R466-R469 — we should confirm this with @ingalls. If that's the case, I can see how some of the shifts are affected by it.

The color formula in the database is "color_formula": "Gamma+RGB+3.2+Saturation+0.8+Sigmoidal+RGB+25+0.35"

geohacker commented 1 year ago

@srmsoumya Thank you for outlining this. Your approach sounds good to me. I feel like with these things we can't quite be sure until we try? What would you say is the small experiment to test this hypothesis? I think this should improve overall in the right direction but given time constraints, it would be good to understand our approach.

vincentsarago commented 1 year ago

@srmsoumya looking at the script you are doing a linear rescale rescale=0,10000 but this do not guarantee a constant histogram because for each (sentinel) observation the min/max pixels value will vary.

vincentsarago commented 1 year ago

note: you can't use color_formula for dataset that have more then 3 bands.

vincentsarago commented 1 year ago

What would you say is the small experiment to test this hypothesis? I think this should improve overall in the right direction but given time constraints, it would be good to understand our approach.

If we do this I would first recommend to remove the linear rescale

srmsoumya commented 1 year ago

@geohacker yes, most of the things are experimental & we can't be sure until we have the model deployed.

Good thing is none of these will take a lot of time, I can probably try both today/tomorrow. We can then try to deploy & see how the AL loop performs,

@vincentsarago thanks so much - this is super helpful to know. We are not using color_formula as we expect the model to look at raw data & learn from that. Apart from removing the linear scale i.e rescale(0, 10000) is there anything else we should change while creating the mosaics?

geohacker commented 1 year ago

@srmsoumya sounds good. Yeah we can deploy models as they are ready and see results quickly.

We are not using color_formula as we expect the model to look at raw data & learn from that.

Let's confirm this is also not the case for inference.

vincentsarago commented 1 year ago

We are not using color_formula as we expect the model to look at raw data & learn from that. Apart from removing the linear scale i.e rescale(0, 10000) is there anything else we should change while creating the mosaics?

👍 , so it's fair to expect your model to work with uint16 data?

Rub21 commented 1 year ago

We are not using color_formula as we expect the model to look at raw data & learn from that. Apart from removing the linear scale i.e rescale(0, 10000) is there anything else we should change while creating the mosaics?

@srmsoumya , Let me remove rescale(0, 10000) and re-generate all tiles for the diferente dates.

Rub21 commented 1 year ago

@srmsoumya here is the images without recalling an no calor formula, as we talk s3://ds-data-projects/reforestamos/reforestamos_sentinel/no_rescale_color_formula/images_dt_4B/

srmsoumya commented 1 year ago

@vincentsarago yes, I will have to do pre-processing at my end before feeding in the images to the model.

Thanks @Rub21 !

srmsoumya commented 1 year ago

Adding contrast & clouds as data augmentation to the input pipeline handles this issue to an extent.

We can look at generating mosaics manually with histogram matching to better handle this problem at our end - will explore this option in Phase 2.