coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

No-rowcol ImageFeatures are problematic when filtering classes before training #82

Closed StephenChan closed 8 months ago

StephenChan commented 8 months ago

From pyspacer 0.7.0, specifically PR #71, the train task's input annotations are no longer class-filtered at ImageFeatures load time. They are now class-filtered with preprocess_labels() at or before the start of the task. By class filtering, I mean the training class-set is determined as the intersection of train+ref, or as a custom set, and then any train/ref/val annotations that aren't in that set are filtered out.

This timing change has introduced a problem for ImageFeatures extracted before roughly PR #5 (which propagated to coralnet's build by January 2021). These legacy features didn't save row and column values. So in this case, to match up an image's features with input annotations, we rely on each image's features to be stored in the same order as the annotations that were passed in:

        # For legacy features, we didn't store the row, col information.
        # Instead rely on ordering.
        for (_, _, label), point_feature in zip(labels_data,
                                                features.point_features):
            yield point_feature.data, label

But if we filter out one or more annotations prior to passing them in, then we don't know how to match up the features + annotations anymore. Right now, this makes training just crash with an error:

spacer.exceptions.RowColumnMismatchError: somefilepath.jpeg.featurevector: The number of labels (49) doesn't match the number of extracted features (50).

and there isn't an obvious fix. To match up 49 annotations with 50 legacy features, we need to know if the 1st annotation was filtered out, or the 2nd, or the 35th. But that info isn't saved in the ImageLabels data-class or anywhere else.

(Note: in this issue's context, "legacy features" has nothing to do with VGG16 vs. EfficientNet. It's just the presence vs. absence of row/column info. However, absent rows/columns are an old enough thing that it's not seen in any EfficientNet features.)

Potential solution options:

  1. Each time ImageLabels.filter_classes() is called, have the ImageLabels instance save info on which annotations (in terms of ordering: 1st, 15th, etc.) were filtered out.

    • One option is to list the annotation numbers that were filtered out: for example, [1, 2] to say the first two were filtered out. Subsequent updates can be a bit tricky: for example, if you call filter_classes() twice, and each time filters out the 4th remaining annotation, then going by the original ordering, it's actually the 4th and 5th annotations that were filtered out.

    • Another option is to not remove elements from the annotations list, but instead null out the row, column, and class values for any annotations that are being filtered out. The label_count attribute then needs to account for this instead of just taking the list length.

  2. Have a boolean flag that says: the features for this image (or more coarse-grained: this training job) may be legacy, so do not filter classes early; filter them at ImageFeatures load time like we did originally. This comes at the cost of potentially uneven batch sizing, and less upfront error checking (both motivations for PR #71), and obviously more if-else cases here and there. This may require coralnet to use trickier logic again in terms of error checking before and after training.

  3. Do not support legacy features as training input anymore. We can even still support legacy features for classification, but for training it's getting to be a software-engineering pain. For coralnet this means whenever we're about to queue training, we make sure all confirmed images in the source have non-legacy features, and if there are any legacy ones left, then redo those feature extractions. No action is needed for inactive sources.

    • Other reasons to prefer non-legacy features: row/column checking means there's less concern of giving nonsense inputs to training. Also, features extracted after PR #33 have about 8x smaller filesizes, which means faster training, lower RAM requirements for a given training batch/refset size, and less S3 storage cost (this is also not a VGG16 vs. EfficientNet thing).

    • Newly extracted VGG16 features still use the old feature-extractor weights, so if I'm understanding correctly, there will be practically no difference between legacy VGG16 features vs. newly extracted VGG16 features in terms of the "ML information" they convey. Compared to phasing out VGG16 entirely, this seems like a baby-step in terms of removing backwards compatibility.

I'm thinking option 3 is probably the way.

Option 3 and probably also option 2 would involve detecting legacy features before submitting training. The only sure-fire detection method right now is to load them from S3, which takes a while (considering that feature loading is still the bottleneck for train time). coralnet may want to add a boolean field to the Features model saying whether it's legacy or not, and run a one-time data migration to populate the field. As an optimization, anything extracted before say Feb 2020 can be marked as legacy, and anything after say Feb 2021 can be marked as non-legacy, without even loading the features. (Update: I scanned all the features in that range; without any exceptions, 2020-12-31 00:00 UTC is a cutoff we can use between legacy and non-legacy.)

StephenChan commented 8 months ago

Resolved (by option 3) in PR #83 and https://github.com/coralnet/coralnet/pull/525