coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

WIP: improve annotation sampling for training #84

Closed StephenChan closed 6 months ago

StephenChan commented 8 months ago

Still trying to confirm the exact semantics of sklearn's train_test_split().

StephenChan commented 8 months ago

Should be in a better working state now. Things left to do:

(I realize I forgot to make this a draft PR and am not sure how to fix that now. Oh well.)

StephenChan commented 8 months ago

Just added another to-do item:

Add option to ensure each feature vector is entirely in one set or another, instead of having its point-features separated between train, ref, and val; this addresses potential concerns about the sets being too similar to one another

Prior to this PR, preprocess_labels() did have the property that each feature vector (containing a single image's point-features) was contained entirely within either the train set, the ref set, or the val set. The current state of this PR throws out that property so that sklearn's train_test_split() can be used for precise class-stratification. As a simple example, if somehow all of your Porites annotations are in a single image 123.jpg, then the only way you can do any stratification of the Porites annotations is if 123.jpg's annotations were allowed to be separated between train/ref/val.

However, I brought this up with David and he did say this could raise concerns about the sets being too similar to one another, and thus the accuracy results may poorly represent how the classifier would perform on entirely new images. In fact, when coralnet 1.0's feature extractor was being evaluated, the bar was set higher, with the train and validation sets using entirely different coralnet sources.

I suspect the degree of this concern would depend on the project though. If your points are densely distributed, such that you can easily identify 3 or so points which are very close in an image (and thus the cropped patches would have a lot of overlap), then splitting up images may matter more for set-similarity. If your points are sparsely distributed, and/or a lot of your photos were taken right next to each other anyway, then splitting up images may matter less.

So it may be good to add at least a boolean option, giving callers the ability to decide if they're more concerned about splitting up images or about precise stratification.

Two main implementation concerns:

Not sure if there is a clean way to also support defining subsets of images (e.g. coralnet sources) to keep together as all train / all ref / all val. I guess the support for that doesn't have to be "can do in a single function call". I could see if some bits of preprocess_labels() can be split up into smaller functions, and maybe something workable could come out of that.

yeelauren commented 7 months ago

Replying here as I think you've brought up some good points!
It might really depend on the project / source and their goals! If they're capturing a lot of different classes then they may care more about class stratification or alternatively, if there's a 'rare' class that's of importance. The idea of spatial stratification based on proximity to near points is interesting too.

As for stratification implementation:

saanobhaai commented 7 months ago

Stratified Sampling PR

StephenChan commented 6 months ago

Late here, but thanks for the comment - that imbalanced-learn library seems particularly promising, but also a lot to digest!

Going to put off further enhancing the stratification until it's really needed. For now, focusing on merging the changes this PR offers so far. To resolve the conflicts with the other recently merged branches, I've opened a replacement PR, #97.