Training reference set is inflexible, making label inclusion unpredictable

StephenChan commented 1 year ago

As seen in train_utils.py's train() function (link):

The labels (which were the train_labels of the TrainClassifierMsg) are split up into a train set and a reference set. The reference set is determined by taking every 10th image of labels, up to 5000 annotations (assuming the number of samples per image is constant throughout labels). The 5000 cap is to ensure the reference set fits in memory. The reason the reference set is made to be separate from the train set is that it's used to calibrate the classifier (according to the code; I can't provide a deeper explanation myself).
The classifier's set of recognized classes (i.e. its labelset) ends up being the set of classes present in both the train set and reference set.

As a result, from the point of view of the program calling pyspacer, it's difficult to control what the classifier's recognized labelset ends up being.

Also, if the rarity of a class is on the order of 1 in 5000 points, there's a good chance that class ends up being excluded from the classifier's recognized labelset. Further, if the labelset size is on the order of 5000 labels, most of the labels have a good chance of being excluded. That's a significant obstacle to any vision of, say, a generalized CoralNet-wide classifier (CoralNet has 7000+ labels).

Possible solution avenues; more than one might be desired:

Make the number 5000 customizable, so that a system with higher-memory instances than CoralNet can raise to potentially a much higher number.
Rework training so that the reference set doesn't have to fit in memory and thus doesn't have to be size-capped at all. Not sure how doable this is. If there are any caveats to this training implementation compared to the old one, then perhaps the new implementation would become a new Trainer class (alongside the existing MiniBatchTrainer).
Allow TrainClassifierMsg to specify train labels, val labels, and ref labels - not just train and val. This means the reference set is much more clearly defineable (don't have to set it up as every 10th element of your train labels). To reduce entry-level complexity, maybe this granularity can be optional: have the option to specify train & val & ref, and also the option to just specify 'labels' and let pyspacer figure out how to split it up into train/val/ref. I think this solution should be implemented, because CoralNet still has trouble handling the corner case where the classifier's recognized labelset is size 0, and being able to reliably detect that situation before calling pyspacer (by defining the ref set on CoralNet's side) would help a lot.

StephenChan commented 1 year ago

pyspacer training: Make reference set larger & specifiable to ensure rare labels are recognized

saanobhaai commented 1 year ago

Interesting, and good explanation. My take is that solution #3 is preferable; we should be able to call pyspacer with a reliable set of all possible labels, whether they are in the val labels or training labels or neither.

StephenChan commented 1 year ago

Sounds good - yeah, based on CoralNet's needs as well, I should be implementing solution 3 in the short term.

StephenChan commented 9 months ago

Done in PR #71.

In terms of the 3 solution avenues in the OP, 1 and 3 were done. 2 did not seem possible, at least with scikit-learn.

coralnet / pyspacer

Training reference set is inflexible, making label inclusion unpredictable #60