coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
6 stars 2 forks source link

Training: specifying ref set + filtering by class + better batching #71

Closed StephenChan closed 6 months ago

StephenChan commented 7 months ago

For issue #60.

Note: I'll probably end up amending commit 813a17b to update tests / fix any bugs.

StephenChan commented 6 months ago

pyspacer training: Make reference set larger & specifiable to ensure rare labels are recognized

StephenChan commented 6 months ago

code to filter annotations by desired set of labels

StephenChan commented 6 months ago

So there were a few issues pertaining to training which all involved code/logic related to each other, and I ended up addressing them at the same time in this PR. Those issues would be #60, #59, #78, and a feature to filter annotations by a desired set of labels (indicated by the Trello link above, and self-explanatory from the title I think).

The main changes since this issue's OP involve how to build the labels argument of TrainClassifierMsg. I've pasted the relevant part of the updated README here:

The labels must be split into training, reference, and validation sets:

  • The training set (train) is the data that actually goes into the classifier training algorithm during each training epoch. This is generally much larger than the other two sets.
  • The reference set (ref) is used to evaluate and calibrate the classifier between epochs.
  • The validation set (val) is used to evaluate the final classifier after training is finished.

This three-set split is known by other names elsewhere, such as training, validation, and test sets respectively, or training, development, and test sets respectively.

There are a few ways to create the labels structure. Each way involves creating one or more instances of data_classes.ImageLabels:

from spacer.data_classes import ImageLabels
image_labels = ImageLabels({
    # Labels for one feature vector's points.
    '/path/to/image1.featurevector': [
        # Point location at row 1000, column 2000, labeled as class 1.
        (1000, 2000, 1), 
        # Point location at row 3000, column 2000, labeled as class 2.
        (3000, 2000, 2),
    ],
    # Labels for another feature vector's points.
    '/path/to/image2.featurevector': [
        (1500, 2500, 3),
        (2500, 500, 1),
    ],
})

The labels argument of TrainClassifierMsg expects an instance of data_classes.TrainingTaskLabels. There are a few ways to create this:

  1. Pass a single ImageLabels instance to the task_utils.preprocess_labels() function. preprocess_labels() then decides how to split up your labels into train, ref, and val (while doing error checks in the meantime), and creates a TrainingTaskLabels instance from there.
  2. Pass three ImageLabels instances to the TrainingTaskLabels constructor: one instance for each of train, ref, and val.
  3. Do method 1, but also specify the accepted_classes argument to preprocess_labels(); this makes the function filter out any labels that aren't in the desired set of classes.
  4. Do method 2, but also pass the TrainingTaskLabels through preprocess_labels(). This allows you to use the error-checking and accepted_classes parts of preprocess_labels(), and the train/ref/val split you defined will remain intact.
from spacer.data_classes import ImageLabels
from spacer.messages import TrainingTaskLabels
from spacer.task_utils import preprocess_labels

# 1
labels = preprocess_labels(ImageLabels(...))
# 2
labels = TrainingTaskLabels(
    train=ImageLabels(...), ref=ImageLabels(...), val=ImageLabels(...))
# 3
labels = preprocess_labels(ImageLabels(...), accepted_classes={...})
# 4
labels = preprocess_labels(TrainingTaskLabels(...), accepted_classes={...})

Other comments after working on this decently involved PR:

yeelauren commented 6 months ago

Nice work! :D I successfully tested with multi-source after re-jigging to use the new labels parameter. I haven't adjusted any of the default settings for batches yet - which I may need to adjust for the large all-source test.