Closed StephenChan closed 6 months ago
So there were a few issues pertaining to training which all involved code/logic related to each other, and I ended up addressing them at the same time in this PR. Those issues would be #60, #59, #78, and a feature to filter annotations by a desired set of labels (indicated by the Trello link above, and self-explanatory from the title I think).
The main changes since this issue's OP involve how to build the labels
argument of TrainClassifierMsg
. I've pasted the relevant part of the updated README here:
The labels must be split into training, reference, and validation sets:
- The training set (train) is the data that actually goes into the classifier training algorithm during each training epoch. This is generally much larger than the other two sets.
- The reference set (ref) is used to evaluate and calibrate the classifier between epochs.
- The validation set (val) is used to evaluate the final classifier after training is finished.
This three-set split is known by other names elsewhere, such as training, validation, and test sets respectively, or training, development, and test sets respectively.
There are a few ways to create the
labels
structure. Each way involves creating one or more instances ofdata_classes.ImageLabels
:from spacer.data_classes import ImageLabels image_labels = ImageLabels({ # Labels for one feature vector's points. '/path/to/image1.featurevector': [ # Point location at row 1000, column 2000, labeled as class 1. (1000, 2000, 1), # Point location at row 3000, column 2000, labeled as class 2. (3000, 2000, 2), ], # Labels for another feature vector's points. '/path/to/image2.featurevector': [ (1500, 2500, 3), (2500, 500, 1), ], })
The
labels
argument ofTrainClassifierMsg
expects an instance ofdata_classes.TrainingTaskLabels
. There are a few ways to create this:
- Pass a single ImageLabels instance to the
task_utils.preprocess_labels()
function. preprocess_labels() then decides how to split up your labels into train, ref, and val (while doing error checks in the meantime), and creates a TrainingTaskLabels instance from there.- Pass three ImageLabels instances to the TrainingTaskLabels constructor: one instance for each of train, ref, and val.
- Do method 1, but also specify the
accepted_classes
argument to preprocess_labels(); this makes the function filter out any labels that aren't in the desired set of classes.- Do method 2, but also pass the TrainingTaskLabels through preprocess_labels(). This allows you to use the error-checking and accepted_classes parts of preprocess_labels(), and the train/ref/val split you defined will remain intact.
from spacer.data_classes import ImageLabels from spacer.messages import TrainingTaskLabels from spacer.task_utils import preprocess_labels # 1 labels = preprocess_labels(ImageLabels(...)) # 2 labels = TrainingTaskLabels( train=ImageLabels(...), ref=ImageLabels(...), val=ImageLabels(...)) # 3 labels = preprocess_labels(ImageLabels(...), accepted_classes={...}) # 4 labels = preprocess_labels(TrainingTaskLabels(...), accepted_classes={...})
Other comments after working on this decently involved PR:
load_train_data_for_image()
instead of load_image_data()
.trainer_name
and trainer_factory
because there's still only one available trainer name, but I ended up saving it for a future PR. Relatedly, some of the task messages' arguments could be optional or have reasonable defaults. For example, previous_model_locs
could be optional, and clf_type
could default to 'MLP'
.scripts
dir is getting increasingly out of date, as we don't have tests for them and I still haven't used them in a while... so that needs cleaning up one way or the other sometime.Nice work! :D I successfully tested with multi-source after re-jigging to use the new labels parameter. I haven't adjusted any of the default settings for batches yet - which I may need to adjust for the large all-source test.
For issue #60.
TRAINING_BATCH_LABEL_COUNT
.Note: I'll probably end up amending commit 813a17b to update tests / fix any bugs.