WIP: improve annotation sampling for training

StephenChan commented 8 months ago

Still trying to confirm the exact semantics of sklearn's train_test_split().

StephenChan commented 8 months ago

Should be in a better working state now. Things left to do:

[x] Process accepted_classes param before doing the split (see TODO comment on 758ee86)
[ ] Add option to ensure each feature vector is entirely in one set or another, instead of having its point-features potentially separated between train, ref, and val; this addresses concerns about the sets being too similar to one another
[ ] See if coralnet can use this PR to delegate more splitting logic to pyspacer
[ ] Update README

(I realize I forgot to make this a draft PR and am not sure how to fix that now. Oh well.)

StephenChan commented 8 months ago

Just added another to-do item:

Add option to ensure each feature vector is entirely in one set or another, instead of having its point-features separated between train, ref, and val; this addresses potential concerns about the sets being too similar to one another

Prior to this PR, preprocess_labels() did have the property that each feature vector (containing a single image's point-features) was contained entirely within either the train set, the ref set, or the val set. The current state of this PR throws out that property so that sklearn's train_test_split() can be used for precise class-stratification. As a simple example, if somehow all of your Porites annotations are in a single image 123.jpg, then the only way you can do any stratification of the Porites annotations is if 123.jpg's annotations were allowed to be separated between train/ref/val.

However, I brought this up with David and he did say this could raise concerns about the sets being too similar to one another, and thus the accuracy results may poorly represent how the classifier would perform on entirely new images. In fact, when coralnet 1.0's feature extractor was being evaluated, the bar was set higher, with the train and validation sets using entirely different coralnet sources.

I suspect the degree of this concern would depend on the project though. If your points are densely distributed, such that you can easily identify 3 or so points which are very close in an image (and thus the cropped patches would have a lot of overlap), then splitting up images may matter more for set-similarity. If your points are sparsely distributed, and/or a lot of your photos were taken right next to each other anyway, then splitting up images may matter less.

So it may be good to add at least a boolean option, giving callers the ability to decide if they're more concerned about splitting up images or about precise stratification.

Two main implementation concerns:

It may not be possible to rely on train_test_split() for class-stratification anymore, if we're giving the option to not split up images/feature vectors.
OK, so if we don't split up images/feature vectors, then we can't stratify classes as precisely - but can we get some half-decent stratification? A few ideas:
1. Shuffle the images and rely completely on the randomness to even things out.
2. Some kind of greedy algorithm... as we loop over each image, check the resulting counts of annotations-per-class if we were to add the image to train, or to ref, or to val; and use some kind of scoring system to see which option gets us closer to the ideal stratification proportions.
3. Start with 1, then use some kind of algorithm (not sure what) to identify a series of 'image swaps' between sets that would get us closer to the ideal stratification proportions.

Not sure if there is a clean way to also support defining subsets of images (e.g. coralnet sources) to keep together as all train / all ref / all val. I guess the support for that doesn't have to be "can do in a single function call". I could see if some bits of preprocess_labels() can be split up into smaller functions, and maybe something workable could come out of that.

yeelauren commented 7 months ago

Replying here as I think you've brought up some good points!
It might really depend on the project / source and their goals! If they're capturing a lot of different classes then they may care more about class stratification or alternatively, if there's a 'rare' class that's of importance. The idea of spatial stratification based on proximity to near points is interesting too.

As for stratification implementation:

There's so many different ways to slice the pie here. We could do a pure numpy/pandas type of sampling approach that would be completely custom by the user, which give 'ultimate control'.
Depending on the total number of labels, you could also lower the proportion of the ref/val sets and be more 'greedy' with the training set (not recommended for smaller data sets but could work for mermaid)
Create some general sampling as part of pyspacer (see other PRs by Stephen!)
Try out a library like https://imbalanced-learn.org/stable/combine.html
- This allows other types of sampling based on majority / minority classes which I can see being useful for Mermaid if we don't get great accuracy by randomization and stratification directly.

saanobhaai commented 7 months ago

Stratified Sampling PR

StephenChan commented 6 months ago

Late here, but thanks for the comment - that imbalanced-learn library seems particularly promising, but also a lot to digest!

Going to put off further enhancing the stratification until it's really needed. For now, focusing on merging the changes this PR offers so far. To resolve the conflicts with the other recently merged branches, I've opened a replacement PR, #97.

coralnet / pyspacer

WIP: improve annotation sampling for training #84