Training: specifying ref set + filtering by class + better batching

StephenChan commented 7 months ago

For issue #60.

Training labels are now specified either as just one set, or as train + reference + val sets.
Training mini-batch size is now configurable as TRAINING_BATCH_LABEL_COUNT.
Training mini-batch size is now followed more accurately, rather than approximating based on number of labels in the first image.

Note: I'll probably end up amending commit 813a17b to update tests / fix any bugs.

StephenChan commented 6 months ago

pyspacer training: Make reference set larger & specifiable to ensure rare labels are recognized

StephenChan commented 6 months ago

code to filter annotations by desired set of labels

StephenChan commented 6 months ago

So there were a few issues pertaining to training which all involved code/logic related to each other, and I ended up addressing them at the same time in this PR. Those issues would be #60, #59, #78, and a feature to filter annotations by a desired set of labels (indicated by the Trello link above, and self-explanatory from the title I think).

The main changes since this issue's OP involve how to build the labels argument of TrainClassifierMsg. I've pasted the relevant part of the updated README here:

The labels must be split into training, reference, and validation sets:

The training set (train) is the data that actually goes into the classifier training algorithm during each training epoch. This is generally much larger than the other two sets.

The reference set (ref) is used to evaluate and calibrate the classifier between epochs.

The validation set (val) is used to evaluate the final classifier after training is finished.

This three-set split is known by other names elsewhere, such as training, validation, and test sets respectively, or training, development, and test sets respectively.

There are a few ways to create the labels structure. Each way involves creating one or more instances of data_classes.ImageLabels:
from spacer.data_classes import ImageLabels
image_labels = ImageLabels({
    # Labels for one feature vector's points.
    '/path/to/image1.featurevector': [
        # Point location at row 1000, column 2000, labeled as class 1.
        (1000, 2000, 1), 
        # Point location at row 3000, column 2000, labeled as class 2.
        (3000, 2000, 2),
    ],
    # Labels for another feature vector's points.
    '/path/to/image2.featurevector': [
        (1500, 2500, 3),
        (2500, 500, 1),
    ],
})
The labels argument of TrainClassifierMsg expects an instance of data_classes.TrainingTaskLabels. There are a few ways to create this:

Pass a single ImageLabels instance to the task_utils.preprocess_labels() function. preprocess_labels() then decides how to split up your labels into train, ref, and val (while doing error checks in the meantime), and creates a TrainingTaskLabels instance from there.

Pass three ImageLabels instances to the TrainingTaskLabels constructor: one instance for each of train, ref, and val.

Do method 1, but also specify the accepted_classes argument to preprocess_labels(); this makes the function filter out any labels that aren't in the desired set of classes.

Do method 2, but also pass the TrainingTaskLabels through preprocess_labels(). This allows you to use the error-checking and accepted_classes parts of preprocess_labels(), and the train/ref/val split you defined will remain intact.
from spacer.data_classes import ImageLabels
from spacer.messages import TrainingTaskLabels
from spacer.task_utils import preprocess_labels

# 1
labels = preprocess_labels(ImageLabels(...))
# 2
labels = TrainingTaskLabels(
    train=ImageLabels(...), ref=ImageLabels(...), val=ImageLabels(...))
# 3
labels = preprocess_labels(ImageLabels(...), accepted_classes={...})
# 4
labels = preprocess_labels(TrainingTaskLabels(...), accepted_classes={...})

Other comments after working on this decently involved PR:

I was wondering if an overall move in terminology from 'labels' to 'annotations' (to mean 'the application of a label/class to a point') would make things less confusing. Potentially also 'classes' to 'labels' to match CoralNet, but not sure. The terminology shift would involve a lot of little changes throughout spacer, so it should be its own PR if we do it.
Maybe some function names could be more clear, like load_train_data_for_image() instead of load_image_data().
I was tempted to simplify out trainer_name and trainer_factory because there's still only one available trainer name, but I ended up saving it for a future PR. Relatedly, some of the task messages' arguments could be optional or have reasonable defaults. For example, previous_model_locs could be optional, and clf_type could default to 'MLP'.
I'm increasingly wishing for a way to override settings (config) values for certain unit tests, the way Django has it, so that warrants an issue at some point. Right now, the tests which depend on particular settings simply assume that the setting has been left as the default value in the current environment, which makes test runs potentially a hassle for anyone using non-default values. Also, being able to set certain settings to a bare minimum value can speed up tests; for example, a test which depends on TRAINING_BATCH_LABEL_COUNT would likely run faster if the setting was 50 instead of 5000.
The scripts dir is getting increasingly out of date, as we don't have tests for them and I still haven't used them in a while... so that needs cleaning up one way or the other sometime.

yeelauren commented 6 months ago

Nice work! :D I successfully tested with multi-source after re-jigging to use the new labels parameter. I haven't adjusted any of the default settings for batches yet - which I may need to adjust for the large all-source test.

coralnet / pyspacer

Training: specifying ref set + filtering by class + better batching #71