Enhance annotation sampling for training, v2

StephenChan commented 6 months ago

Newer version of PR #84. The difference is that this PR has gone on top of the merges of 95 and 96, resolving any conflicts. This is a new PR because I wanted to leave the old training-annotation-sampling branch intact, in case it's still being used for some tests right this moment.

This PR is ready for review, and is the next thing I'm looking to finally merge.

Per the updated CHANGELOG:

task_utils.preprocess_labels() now has three available modes on how to split training annotations between train, ref, and val sets. Differences between the three modes - VECTORS, POINTS, and POINTS_STRATIFIED - are explained in the SplitMode Enum's comments. Additionally, all three modes now ensure that the ordering of the given training data has no effect on which data goes into train, ref, and val.

And here are said Enum's comments:

class SplitMode(Enum):
    """
    How to split annotations between train, ref, and val sets.
    """
    # Each feature vector's points all go into train, or all go into ref, or
    # all go into val.
    #
    # The rationale behind this mode is to have greater separation between
    # training data and evaluation data, whether it's during the calibration
    # process (train vs. ref) or during evaluation of the final classifier
    # (train vs. val).
    # Here we assume the imagery is 'more different' when going across feature
    # vectors, as opposed to staying within the same feature vector. When
    # training and evaluation data are 'more different', the result is more
    # useful.
    # Thus, this mode can improve usefulness of calibration, and rigor of the
    # evaluation results.
    # However, the annotation count may not end up precisely balanced
    # between train/ref/val as desired, particularly when the feature vector
    # size is comparable to the set size. For example, if each feature vector
    # has 100 points, and the target ref-set size is 450, then the best we can
    # do is giving the ref set either 400 or 500 points.
    VECTORS = 'vectors'
    # The split is done on an individual point basis, so a single
    # feature vector may be split across train/ref/val.
    #
    # This allows the annotation count to be more precisely balanced
    # between train/ref/val.
    # However, there may be concerns that the imagery going into each set is
    # too similar, particularly when points are densely distributed within
    # each image.
    POINTS = 'points'
    # Stratified sampling by class: an A%/B%/C% train/ref/val split means
    # an A%/B%/C% split of each class.
    # The split is done on an individual point basis.
    #
    # The POINTS mode's results should already be approximately stratified due
    # to the annotations being shuffled. However, POINTS_STRATIFIED makes the
    # stratification more guaranteed. This can be useful because it makes the
    # final number of unique classes more consistent.
    #
    # Stratification checks that the number of annotations in each
    # set isn't less than the number of unique classes.
    # However, each set is NOT guaranteed to have at least 1 of each class.
    # If stratification is calculated such that a set would get <0.5
    # annotations of a class, then that set gets 0 of that class.
    POINTS_STRATIFIED = 'points_stratified'

The mode that's notably 'missing' is VECTORS_STRATIFIED, because it would be more complicated to stratify accurately when splitting at the vector level. As @yeelauren pointed out in the old PR's thread, there should be ways to implement that if desired, such as the imbalanced-learn library. But it would be more complex to implement than the other modes, so it's deferred until someone really wants it.

There may be other methods/restrictions that one might want for the data split. For example, perhaps you have a hierarchy of CoralNet data where the data can be divided into several sources, each source has a set of feature vectors, and each feature vector has a set of point features; and you want each source to go entirely in train, or ref, or val (not split between the three). However, at this point, I think that potential need is covered by the ability to instantiate your own TrainingTaskLabels and thus define your own arbitrary split.

Results of experiments using this code:

Source	Images	Mode	Annotations	Train	Ref	Val	Classes	Accuracy	CN accuracy	Train time
3342	1204	VECTORS	1202766	1031819	50000	120947	12	95.1%	93.0%	1267.2s
3342	1204	POINTS_STRATIFIED	1203965	1033567	50000	120398	16	95.7%	93.0%	1484.2s
372	37955	VECTORS	379383	303504	37958	37921	53	78.5%	78.0%	3441.3s
372	37955	POINTS_STRATIFIED	379538	303628	37955	37955	60	78.8%	78.0%	3323.5s
2112	4649	VECTORS	232263	185792	23242	23229	46	86.6%	82.0%	763.9s
2112	4649	POINTS_STRATIFIED	232416	185930	23243	23243	56	86.8%	82.0%	755.3s
3401	7195	VECTORS	243342	194598	24399	24345	60	80.0%	80.0%	1020.7s
3401	7195	POINTS_STRATIFIED	243568	194851	24359	24358	68	80.3%	80.0%	1123.4s
3411	14948	VECTORS	227448	181933	22767	22748	82	74.1%	74.0%	1349.8s
3411	14948	POINTS_STRATIFIED	227554	182038	22758	22758	88	74.7%	74.0%	1389.1s
3577	3696	VECTORS	184623	147635	18500	18488	29	89.7%	89.0%	536.0s
3577	3696	POINTS_STRATIFIED	184786	147826	18480	18480	36	89.3%	89.0%	554.9s
1579	16438	VECTORS	164334	131460	16439	16435	50	77.7%	77.0%	1343.7s
1579	16438	POINTS_STRATIFIED	164341	131468	16437	16436	48	77.3%	77.0%	1374.4s
3697	1049	VECTORS	52279	41810	5250	5219	18	78.8%	82.0%	149.0s
3697	1049	POINTS_STRATIFIED	52434	41947	5244	5243	25	81.5%	82.0%	159.3s
3606	969	VECTORS	24096	19269	2425	2402	42	59.4%	69.0%	138.2s
3606	969	POINTS_STRATIFIED	24218	19374	2422	2422	49	68.4%	69.0%	127.9s
3357	564	VECTORS	16790	13376	1710	1704	14	89.4%	86.0%	74.3s
3357	564	POINTS_STRATIFIED	16911	13527	1692	1692	15	87.3%	86.0%	77.8s
3583	200	VECTORS	6512	5174	669	669	24	74.6%	77.0%	25.1s
3583	200	POINTS	6608	5281	665	662	31	82.5%	77.0%	26.3s
3583	200	POINTS_STRATIFIED	6637	5308	666	663	33	84.0%	77.0%	30.5s
3362	44	VECTORS	1715	1327	200	188	5	95.2%	95.0%	7.5s
3362	44	POINTS_STRATIFIED	1755	1403	176	176	6	93.2%	95.0%	6.9s
3489	86	VECTORS	860	680	90	90	10	54.4%	80.0%	7.1s
3489	86	POINTS_STRATIFIED	857	685	86	86	9	59.3%	80.0%	5.5s
3685	21	VECTORS	580	410	90	80	10	57.5%	46.0%	4.0s
3685	21	POINTS_STRATIFIED	607	484	62	61	13	65.6%	46.0%	3.5s

CSV version for potentially easier viewing: 2024-03 - single source runs with new sampling code.csv

(To be exact, the experiments used the training-cache-features-3 branch, which places the PR #80 feature-caching commits on top of this PR's branch).

My takeaways from the experiments:

The default train/ref/val ratios of 80%/10%/10% (with ref being capped at 50000) are working correctly. Small discrepancies can be explained by 1) restrictions of the VECTORS mode, and 2) filtering out of classes that don't end up in both train and ref.
Classifiers' measured accuracy hasn't been conclusively affected one way or the other by this PR, compared to the accuracy of pre-existing CoralNet classifiers trained on the same sources. That is, "Accuracy" is comparable to "CN accuracy". There are bigger accuracy fluctuations for smaller sources, as I'd expect.
POINTS_STRATIFIED consistently results in bigger labelsets ("Classes" column) than VECTORS, as I'd expect, since with POINTS_STRATIFIED, rare classes have a better guarantee of being included in both train and ref.
And since POINTS_STRATIFIED includes more classes than VECTORS, it also includes slightly more annotations.
It seems that if there's a discrepancy in accuracy between VECTORS and POINTS_STRATIFIED, larger discrepancies tend to give the nod to POINTS_STRATIFIED, despite the fact that we might expect bigger labelsets to result in lower accuracy (harder to predict correctly when there are more choices). So, for smaller sources in particular, there could indeed be a concern that POINTS_STRATIFIED makes the calibration and validation 'artificially good' by training on the same images that it's validating on. For larger sources, it doesn't seem as much of a concern. (cc @kriegman)

StephenChan commented 6 months ago

Stratified Sampling PR

StephenChan commented 6 months ago

Oh yeah, one other note: there was a previous version of this code where I made POINTS_STRATIFIED (or equivalent) the default mode. However, I've since changed the default mode to VECTORS.

yeelauren commented 6 months ago

Not sure where the best place for this commentary is - maybe an issue? I did some digging today into other methods for our imbalance class problem. One other option is 'weighing' the classes which you can do with other methods like SVM. However, sklearn has an issue that opened in 2017 which is one of the most highly upvoted issues for weighing classes using MLP. See Discussion : https://github.com/scikit-learn/scikit-learn/issues/9113 Most notably :


Hi all,

Thank you all for your comments.

The maintainers of scikit-learn have limited time and resources to improve the project and already are focusing on other aspects of the project they find valuable.

MLPs were introduced in scikit-learn but aren't currently a priority to the maintainers (the maintainers of scikit-learn aren't thinking of extending scikit-learn's implementations of MLPs anymore).

Now, this does not stop anyone from extending those implementations but we (or at least I) do not guarantee those contributions will be accepted.

Note that if someone is interested in co-maintaining those implementations, we highly welcome them!

Alternatively, specialized libraries like Keras and PyTorch should provide reference implementations.

StephenChan commented 6 months ago

Yeah, another issue for it - just created issue #98.

Issue #74 also has notes about sklearn's MLP being a bit rudimentary compared to other libraries' implementations. Super robust deep learning implementations seem to be out of sklearn's scope basically.

How does this PR look otherwise?

StephenChan commented 6 months ago

@yeelauren Thanks for the review! Tried making some edits accordingly.

yeelauren commented 6 months ago

Great! Thanks @StephenChan. One statistic we're missing here is the per class accuracy. Overall accuracy can hide some of the nuance between classes - opened an issue #99 that should help with this.

coralnet / pyspacer

Enhance annotation sampling for training, v2 #97