SkyTruth / cerulean-ml

Repo for Training ML assets for Cerulean
Apache License 2.0
1 stars 0 forks source link

apply coco dataset creation to validation, train, and test set #17

Open rbavery opened 2 years ago

rbavery commented 2 years ago

Now that the dataset is almost finalized, we can start breaking apart the train, val, and test sets. I think the best way to do this is in the dataloading step by defining a splitter function that works for the icevision and fastai2 trainers. This would involve defining lists of scene ids for our train set, validation sets, and test sets. From the Phase 2 Doc, these are the guidelines for how we should split these out:

For the validation set, we suggest having at a minimum, 5 whole image samples of each class and for the test set we suggest having at least 2 of each class.

More samples in the validation set will allow us to be more confident in the robustness of the metrics we use to make decisions that affect model performance and more samples in the test set will give scientific users of Cerulean more confidence in the robustness of the results. We recommend Jona having oversight over all annotations and particular oversight over annotations in the validation and test sets. Ideally, we would split the validation and test sets into different folders based on 1) hard negatives (oil slick look alikes that are especially difficult to annotate and detect), 2) positive samples (background pixels in both of these categories will comprise the easy negative category). This will allow us to calculate metrics on each of these categories separately and have greater insight into where the model has difficulty.

So I'm thinking the sets of scene ids we need to define as list are as follows:

1: "Infrastructure", 2: "Natural Seep", 3: "Coincident Vessel", 4: "Recent Vessel", 5: "Old Vessel",

@jonaraphael do you want to select these scenes we will use to evaluate yourself? I recall that we wanted to select scenes that were annotated with particular detail and attention. If not just let me know and we can do this.

rbavery commented 2 years ago

We'll instead select a random sample on a scene basis to ensure the following number of samples for each class are represented:

I think we can then sort these scenes into folders like so

train/ class_folder_1 N folders for scene samples .... validation/ class_folder_1 10 folders for scene samples ... test/ class_folder_1 3 folders for scene samples ...

lillythomas commented 2 years ago

Completed in PR #80. The partitioned data is visible in gs://ceruleanml/partitions/.