coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

Enhance annotation sampling for training, v2 #97

Closed StephenChan closed 6 months ago

StephenChan commented 6 months ago

Newer version of PR #84. The difference is that this PR has gone on top of the merges of 95 and 96, resolving any conflicts. This is a new PR because I wanted to leave the old training-annotation-sampling branch intact, in case it's still being used for some tests right this moment.

This PR is ready for review, and is the next thing I'm looking to finally merge.

Per the updated CHANGELOG:

task_utils.preprocess_labels() now has three available modes on how to split training annotations between train, ref, and val sets. Differences between the three modes - VECTORS, POINTS, and POINTS_STRATIFIED - are explained in the SplitMode Enum's comments. Additionally, all three modes now ensure that the ordering of the given training data has no effect on which data goes into train, ref, and val.

And here are said Enum's comments:

class SplitMode(Enum):
    """
    How to split annotations between train, ref, and val sets.
    """
    # Each feature vector's points all go into train, or all go into ref, or
    # all go into val.
    #
    # The rationale behind this mode is to have greater separation between
    # training data and evaluation data, whether it's during the calibration
    # process (train vs. ref) or during evaluation of the final classifier
    # (train vs. val).
    # Here we assume the imagery is 'more different' when going across feature
    # vectors, as opposed to staying within the same feature vector. When
    # training and evaluation data are 'more different', the result is more
    # useful.
    # Thus, this mode can improve usefulness of calibration, and rigor of the
    # evaluation results.
    # However, the annotation count may not end up precisely balanced
    # between train/ref/val as desired, particularly when the feature vector
    # size is comparable to the set size. For example, if each feature vector
    # has 100 points, and the target ref-set size is 450, then the best we can
    # do is giving the ref set either 400 or 500 points.
    VECTORS = 'vectors'
    # The split is done on an individual point basis, so a single
    # feature vector may be split across train/ref/val.
    #
    # This allows the annotation count to be more precisely balanced
    # between train/ref/val.
    # However, there may be concerns that the imagery going into each set is
    # too similar, particularly when points are densely distributed within
    # each image.
    POINTS = 'points'
    # Stratified sampling by class: an A%/B%/C% train/ref/val split means
    # an A%/B%/C% split of each class.
    # The split is done on an individual point basis.
    #
    # The POINTS mode's results should already be approximately stratified due
    # to the annotations being shuffled. However, POINTS_STRATIFIED makes the
    # stratification more guaranteed. This can be useful because it makes the
    # final number of unique classes more consistent.
    #
    # Stratification checks that the number of annotations in each
    # set isn't less than the number of unique classes.
    # However, each set is NOT guaranteed to have at least 1 of each class.
    # If stratification is calculated such that a set would get <0.5
    # annotations of a class, then that set gets 0 of that class.
    POINTS_STRATIFIED = 'points_stratified'

The mode that's notably 'missing' is VECTORS_STRATIFIED, because it would be more complicated to stratify accurately when splitting at the vector level. As @yeelauren pointed out in the old PR's thread, there should be ways to implement that if desired, such as the imbalanced-learn library. But it would be more complex to implement than the other modes, so it's deferred until someone really wants it.

There may be other methods/restrictions that one might want for the data split. For example, perhaps you have a hierarchy of CoralNet data where the data can be divided into several sources, each source has a set of feature vectors, and each feature vector has a set of point features; and you want each source to go entirely in train, or ref, or val (not split between the three). However, at this point, I think that potential need is covered by the ability to instantiate your own TrainingTaskLabels and thus define your own arbitrary split.

Results of experiments using this code:

Source Images Mode Annotations Train Ref Val Classes Accuracy CN accuracy Train time
3342 1204 VECTORS 1202766 1031819 50000 120947 12 95.1% 93.0% 1267.2s
3342 1204 POINTS_STRATIFIED 1203965 1033567 50000 120398 16 95.7% 93.0% 1484.2s
372 37955 VECTORS 379383 303504 37958 37921 53 78.5% 78.0% 3441.3s
372 37955 POINTS_STRATIFIED 379538 303628 37955 37955 60 78.8% 78.0% 3323.5s
2112 4649 VECTORS 232263 185792 23242 23229 46 86.6% 82.0% 763.9s
2112 4649 POINTS_STRATIFIED 232416 185930 23243 23243 56 86.8% 82.0% 755.3s
3401 7195 VECTORS 243342 194598 24399 24345 60 80.0% 80.0% 1020.7s
3401 7195 POINTS_STRATIFIED 243568 194851 24359 24358 68 80.3% 80.0% 1123.4s
3411 14948 VECTORS 227448 181933 22767 22748 82 74.1% 74.0% 1349.8s
3411 14948 POINTS_STRATIFIED 227554 182038 22758 22758 88 74.7% 74.0% 1389.1s
3577 3696 VECTORS 184623 147635 18500 18488 29 89.7% 89.0% 536.0s
3577 3696 POINTS_STRATIFIED 184786 147826 18480 18480 36 89.3% 89.0% 554.9s
1579 16438 VECTORS 164334 131460 16439 16435 50 77.7% 77.0% 1343.7s
1579 16438 POINTS_STRATIFIED 164341 131468 16437 16436 48 77.3% 77.0% 1374.4s
3697 1049 VECTORS 52279 41810 5250 5219 18 78.8% 82.0% 149.0s
3697 1049 POINTS_STRATIFIED 52434 41947 5244 5243 25 81.5% 82.0% 159.3s
3606 969 VECTORS 24096 19269 2425 2402 42 59.4% 69.0% 138.2s
3606 969 POINTS_STRATIFIED 24218 19374 2422 2422 49 68.4% 69.0% 127.9s
3357 564 VECTORS 16790 13376 1710 1704 14 89.4% 86.0% 74.3s
3357 564 POINTS_STRATIFIED 16911 13527 1692 1692 15 87.3% 86.0% 77.8s
3583 200 VECTORS 6512 5174 669 669 24 74.6% 77.0% 25.1s
3583 200 POINTS 6608 5281 665 662 31 82.5% 77.0% 26.3s
3583 200 POINTS_STRATIFIED 6637 5308 666 663 33 84.0% 77.0% 30.5s
3362 44 VECTORS 1715 1327 200 188 5 95.2% 95.0% 7.5s
3362 44 POINTS_STRATIFIED 1755 1403 176 176 6 93.2% 95.0% 6.9s
3489 86 VECTORS 860 680 90 90 10 54.4% 80.0% 7.1s
3489 86 POINTS_STRATIFIED 857 685 86 86 9 59.3% 80.0% 5.5s
3685 21 VECTORS 580 410 90 80 10 57.5% 46.0% 4.0s
3685 21 POINTS_STRATIFIED 607 484 62 61 13 65.6% 46.0% 3.5s

CSV version for potentially easier viewing: 2024-03 - single source runs with new sampling code.csv

(To be exact, the experiments used the training-cache-features-3 branch, which places the PR #80 feature-caching commits on top of this PR's branch).

My takeaways from the experiments:

StephenChan commented 6 months ago

Stratified Sampling PR

StephenChan commented 6 months ago

Oh yeah, one other note: there was a previous version of this code where I made POINTS_STRATIFIED (or equivalent) the default mode. However, I've since changed the default mode to VECTORS.

yeelauren commented 6 months ago

Not sure where the best place for this commentary is - maybe an issue? I did some digging today into other methods for our imbalance class problem. One other option is 'weighing' the classes which you can do with other methods like SVM. However, sklearn has an issue that opened in 2017 which is one of the most highly upvoted issues for weighing classes using MLP. See Discussion : https://github.com/scikit-learn/scikit-learn/issues/9113 Most notably :


Hi all,

Thank you all for your comments.

The maintainers of scikit-learn have limited time and resources to improve the project and already are focusing on other aspects of the project they find valuable.

MLPs were introduced in scikit-learn but aren't currently a priority to the maintainers (the maintainers of scikit-learn aren't thinking of extending scikit-learn's implementations of MLPs anymore).

Now, this does not stop anyone from extending those implementations but we (or at least I) do not guarantee those contributions will be accepted.

Note that if someone is interested in co-maintaining those implementations, we highly welcome them!

Alternatively, specialized libraries like Keras and PyTorch should provide reference implementations.
StephenChan commented 6 months ago

Yeah, another issue for it - just created issue #98.

Issue #74 also has notes about sklearn's MLP being a bit rudimentary compared to other libraries' implementations. Super robust deep learning implementations seem to be out of sklearn's scope basically.

How does this PR look otherwise?

StephenChan commented 6 months ago

@yeelauren Thanks for the review! Tried making some edits accordingly.

yeelauren commented 6 months ago

Great! Thanks @StephenChan. One statistic we're missing here is the per class accuracy. Overall accuracy can hide some of the nuance between classes - opened an issue #99 that should help with this.