developmentseed / pearl-backend

PEARL (Planetary Computer Land Cover Mapping) Platform API and Infrastructure
MIT License
55 stars 7 forks source link

Train a starter model for Sentinel in Mexico #34

Open geohacker opened 1 year ago

geohacker commented 1 year ago

For our Sentinel release, we'll create a starter model based on priority AOIs for Reforestamos.

srmsoumya commented 1 year ago

Model

Training Strategy

We are labeling the segmentation masks from scratch & given the complexity of differentiating between the classes of interest to RM, it is taking us quite some time to generate the chips.

In the allocated budget of ~125 hours, we can generate approximately 1000 chips of size 256x256 as ground truth for our model. This is not sufficient to build a decent segmentation model for all the eight corridors.

As a workaround for this we are trying a weakly supervised training approach, in this case:

After we have a model that is pre-trained with weakly supervised labels, we can then fine-tune the model on chips generated by our data team i.e more precise & designed for sentinel imagery.

Data Distribution

0: "other",
1: "Bosque",
2: "Selvas",
3: "Pastos",
4: "Agricultura",
5: "Urbano",
6: "Sin vegetación aparente",
7: "Agua",
8: "Matorral",
9: "Suelo desnudo",
10: "Plantaciones",
11: "Otras coberturas",
12: "Vegetación caducifolia",

Image

The numbers in the diagram represent the number of pixels for each LULC class in that particular corridor. As we can see from the figure, there is severe class imbalance across all the corridors with Bosques, Selvas & Agricultura dominating in most of the cases.

Few things to consider while training:

Initial PEARL Model for Reforestamos

PEARL models for NAIP imagery was built on top of PyTorch & used segmentation models like UNet, FCN & DeepLab.

I am building the baseline model using PyTorch & PyTorch-Lightning, this takes care of both the science & engineering side of things. We have to write less boilerplate code and things like storing model checkpoints, logs loss curves, metrics etc come for free. We can easily scale the model to run on single/multiple CPU/GPU/TPU without any additional effort.

Update as on 30 Jan, 23

We have a segmentation model that is trained on a single corridor with weakly supervised labels coming from the RM team.

Architecture - Unet
Backbone - EfficientNet-B0 pre-trained on ImageNet
Epochs - 10
Dataset - 1700 chips for training & ~400 chips for testing (with LULC labels from RM)
Loss - Dice Loss 0.47
Score - Jaccard Index 0.6

Here are some sample results

Image (Color Corrected), Ground Truth Mask, Predicted Mask, Image overlay with mask

Image

srmsoumya commented 1 year ago

Model Update - 13-03-23

We have a baseline model that is a DeepLabv3+ with timm-efficientnet-b5 backbone which has an weighted F1 score of 0.78 currently deployed as Mexico LULC pre alpha in PEARL backend. This models also handles issues mentioned here #47 by using color based augmentations.

Issues with the current baseline model

  1. Clouds are creating confusion for the model. This is understandable, we filtered mosaics with no cloud cover & trained out model on that. This can be fixed by:

    • Using different search id to extract mosaic tiles that have less restrictive cloud cover
    • Use a custom augmentation that adds clouds, fog, snow flakes to the image
  2. Edge effects getting introduced as the model is looking at very small patch of the imagery. Our ground truths are representations of what an area looks like & not exact pixel match for the classes, the model learns from its surrounding pixels & infers the results. When we constrain that to just 256x256 tiles, it sometimes doesn't have enough information & thus creates the edge effects, look at the red pixels at the bottom left of model prediction mask.

Few ways to handle this:

  1. Model retraining takes a few iteration, I tried for a few classes & this works fine - I just had to iterate twice for the model to learn. This is mainly because we have 16_000 embeddings per model & are adding just 100 new pixels for new or modified classes, we can either pass more pixels or reduce the seed data size.

Next steps in order of priority

  1. Improve model retraining workflow

    • [ ] Increase the number of points used for retraining workflow @ingalls
    • [ ] Reduce the number of embedding pixels inside seed dataset (check what works best between the two)
  2. Infer of larger tiles (should be easy to implement)

    • [ ] Pass larger tiles for inference to the model. Try tiles of size 2048x2048, 1024x1024, 512x512 - find sweet spot between speed & model accuracy to prevent edge effects @ingalls
    • [ ] SAHI approach for model inference (try later)
    • [ ] Model sharding using accelerate (try later)
  3. Retrain model to improve accuracy

    • [ ] Add artificial clouds, fog & snow flakes to the augmentation pipeline
    • [x] Add more ground truth dataset @Rub21 Can we have the data team label more chips for Reforestamos?
    • [ ] Curate dataset to have different seasonality, cloud covers for model training

@developmentseed/pearl

geohacker commented 1 year ago

@srmsoumya What are your thoughts about closing this ticket? I think we managed to achieve most of what you outlined as improvements. We can revise/reopen based on feedback.