Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

Benchmark v0.0 - Microsoft Flood and Clouds Segmentation #117

Closed lillythomas closed 1 month ago

lillythomas commented 6 months ago

We've completed our first benchmarking exercise for Clay v0 using the Microsoft Flood detection dataset (see https://github.com/Clay-foundation/model/issues/83 for details).

The exercise entailed a comparison of a model "trained from scratch" vs. a model finetuned from Clay. They both are designed to segment input into binary maps of flood and background. The data module ingests datacubes like those used to train clay, with the addition of water masks and cloud/cloud shadow masks, applies cloud/cloud shadow masking and adds the water masks (labels) into the batch dictionary.

The from scratch model is a simple encoder / decoder architecture with He initialization, dropout regularization, learning rate scheduling (lr=1e-3 to start) and early stopping. The loss function is binary cross entropy.

The finetuned model uses Clay as the encoder and uses a simple multi layer convolutional decoder. It also implements learning rate scheduling, early stopping and minimizes a binary cross entropy loss function.

Results

The model from scratch converged (based on validation loss) in 23 epochs, whereas the model finetuned from Clay converged in 19 epochs.

Loss curves

From scratch

Train loss

Screen Shot 2024-01-12 at 9 15 41 PM

Val loss

Screen Shot 2024-01-12 at 9 15 49 PM

Finetuned

Train loss

Screen Shot 2024-01-12 at 9 13 18 PM

Val loss

Screen Shot 2024-01-12 at 9 13 32 PM

Evaluation metric

Out evaluation entailed calculating mean IoU on the validation dataset.

From scratch

mIoU = 0.27

Finetuned

mIoU = 0.31

Visual comparison

Blue band on left, water mask label in middle, prediction on right.

From scratch

Screen Shot 2024-01-12 at 8 00 17 PM Screen Shot 2024-01-12 at 7 58 39 PM Screen Shot 2024-01-12 at 7 57 30 PM Screen Shot 2024-01-12 at 9 18 16 PM Screen Shot 2024-01-12 at 9 21 41 PM

Finetuned

Screen Shot 2024-01-12 at 7 59 38 PM Screen Shot 2024-01-12 at 7 55 16 PM Screen Shot 2024-01-12 at 7 56 09 PM Screen Shot 2024-01-12 at 9 17 04 PM Screen Shot 2024-01-12 at 9 21 26 PM

Notebook: https://github.com/Clay-foundation/model/blob/benchmark_segmentation_task/notebooks/flood_benchmark_segmentation.ipynb

lillythomas commented 6 months ago

I've done a few more experiments to rigor-test things. The main changes involved shuffling the samples before partitioning using a seed for consistency across model variants and reproducibility. The prior metrics in the first comment did not involve shuffled data pre-partitioning. These experiments demonstrate how much the results vary based on different data distributions.

Varying data distributions

Experiment Variant Number of epochs Mean IoU Seed
1 From scratch 46 0.57 42
1 Finetuned 23 0.42 42
2 From scratch 36 0.44 33
2 Finetuned 8 0.31 33
3 From scratch 30 0.38 21
3 Finetuned 11 0.33 21
4 From scratch 19 0.48 9
4 Finetuned 26 0.38 9

In the above table, finetuned refers to finetuned with a multi-layer convolutional decoder.

Using the clay decoder

An experiment in which the multi-layer convolutional decoder was replaced with the Clay decoder. Experiment Variant Number of epochs Mean IoU Seed
5 From scratch 36 0.44 33
5 Finetuned (conv decoder) 8 0.31 33
5 Finetuned (clay decoder) 9 0.33 33

Controlling the number of epochs

An experiment where the number of epochs were standardized instead of determined by an early stopping callback.

For a set number of 10 epochs, we achieved: Experiment Variant Number of epochs Mean IoU Seed
6 From scratch 10 0.14 42
6 Finetuned (conv decoder) 10 0.51 42
6 Finetuned (clay decoder) 10 0.44 42
lillythomas commented 6 months ago

Controlling the number of epochs (part ii)

An experiment where the number of epochs were standardized instead of determined by an early stopping callback.

For a set number of 30 epochs, we achieved: Experiment Variant Number of epochs Mean IoU Seed
7 From scratch 30 0.55 42
7 Finetuned (conv decoder) 30 0.52 42
7 Finetuned (clay decoder) 30 0.53 42

Loss curves

From scratch

Screen Shot 2024-01-21 at 10 33 59 PM Screen Shot 2024-01-21 at 10 34 21 PM

Finetuned (conv and clay decoder)

Screen Shot 2024-01-21 at 10 16 23 PM Screen Shot 2024-01-21 at 10 16 02 PM

36080/671e5636-0ed5-476d-8a91-49783aa6a988">

Visual examples

From scratch

Screen Shot 2024-01-21 at 11 02 50 PM Screen Shot 2024-01-21 at 11 17 05 PM Screen Shot 2024-01-21 at 11 23 12 PM Screen Shot 2024-01-21 at 11 23 54 PM Screen Shot 2024-01-21 at 11 28 14 PM

Finetuned (conv decoder)

Screen Shot 2024-01-21 at 10 54 45 PM Screen Shot 2024-01-21 at 11 17 37 PM Screen Shot 2024-01-21 at 11 19 38 PM Screen Shot 2024-01-21 at 11 24 28 PM Screen Shot 2024-01-21 at 11 28 45 PM

Finetuned (clay decoder)

Screen Shot 2024-01-21 at 10 54 25 PM Screen Shot 2024-01-21 at 11 17 59 PM Screen Shot 2024-01-21 at 11 19 07 PM Screen Shot 2024-01-21 at 11 25 12 PM Screen Shot 2024-01-21 at 11 29 05 PM
lillythomas commented 6 months ago

Same part ii epoch-controlled experiment with two more seeds

Experiment Variant Number of epochs Mean IoU Seed
8 From scratch 30 0.41 21
8 Finetuned (conv decoder) 30 0.40 21
8 Finetuned (clay decoder) 30 0.44 21

Loss curves

From scratch

Screen Shot 2024-01-22 at 10 19 44 AM Screen Shot 2024-01-22 at 10 19 57 AM

Finetuned (conv and clay decoders)

Screen Shot 2024-01-22 at 10 21 51 AM Screen Shot 2024-01-22 at 10 22 36 AM
Experiment Variant Number of epochs Mean IoU Seed
9 From scratch 30 0.44 33
9 Finetuned (conv decoder) 30 0.42 33
9 Finetuned (clay decoder) 30 0.47 33

Loss curves

From scratch

Screen Shot 2024-01-22 at 2 44 51 PM Screen Shot 2024-01-22 at 2 45 09 PM

Finetuned (conv and clay decoders)

Screen Shot 2024-01-22 at 2 43 37 PM Screen Shot 2024-01-22 at 2 44 02 PM

Parameter counts per model

| Name | Type | Params | model | FloodDetector_fromscratch | 713 K | dice_metric | Dice | 0
| 713 K Trainable params | 0 Non-trainable params | 713 K Total params | 2.853 Total estimated model params size (MB)

| Name | Type | Params | model | FloodDetector_clay_convdecoder | 98.3 M | criterion | BCEWithLogitsLoss | 0
| 3.0 M Trainable params | 95.3 M Non-trainable params | 98.3 M Total params | 393.032 Total estimated model params size (MB)

| Name | Type | Params | model | FloodDetector_clay_decoder | 127 M | criterion | BCEWithLogitsLoss | 0
| 32.4 M Trainable params | 95.3 M Non-trainable params | 127 M Total params | 510.809 Total estimated model params size (MB)

yellowcap commented 5 months ago

Thanks for the detailed summary and thorough work @lillythomas 🚀 This sets the example for fine tuning which is great!

We can re-iterate over this later if needed, feel free to close this if you agree @lillythomas.