coralnet / pyspacer

Python based tools for spatial image analysis
MIT License
7 stars 3 forks source link

Accept int or str as label ID, not just int #95

Closed StephenChan closed 6 months ago

StephenChan commented 6 months ago

(Take 3, after deciding this is the next thing to merge after all)

From Mermaid Trello card:

Since Mermaid thus far has used UUIDs, not integer IDs, for BAGFs.

CoralNet has always used integer IDs for everything in its database, so I believe that design assumption just carried over to the way that pyspacer identifies labels (classes). I couldn’t think of any reason why this would need to be an integer. The values do get passed into scikit-learn’s MLPClassifier and SGDClassifier, but the relevant methods/fields predict(), predict_proba(), and classes_ seem to take an array of any type.

This would involve relaxing the parameter type annotations in various functions/methods such as ImageLabels.__init__(), ImageLabels.unique_classes(), load_image_data(), load_batch_data(), evaluate_classifier(), and make_random_data(). I think make_random_data() does actually use the parameter as an int (or at least something that’s addable to a float), but that doesn’t seem to be a necessary implementation detail. Of note, ValResults actually uses indices into a list of labels, rather than using the label representations themselves, so some stuff related to ValResults won't need changing.

I’d probably define a custom data type LabelId which is an alias of Hashable. So if we ever decide that Hashable isn’t quite what we wanted for label IDs, that’ll be easier to change next time.

This plan ran into a hitch though: a uuid.UUID, which is Hashable, got this sort of error when used for training:

Traceback (most recent call last):
  File "D:\CVCE\Site\pyspacer\spacer\tests\test_tasks.py", line 190, in test_default
    clf, _ = train(
  File "D:\CVCE\Site\pyspacer\spacer\train_utils.py", line 73, in train
    clf.partial_fit(x, y, classes=classes_list)
  File "D:\Code_environments\venv_coralnet_py310\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py", line 848, in partial_fit
    return self._partial_fit(
  File "D:\Code_environments\venv_coralnet_py310\lib\site-packages\sklearn\linear_model\_stochastic_gradient.py", line 593, in _partial_fit
    _check_partial_fit_first_call(self, classes)
  File "D:\Code_environments\venv_coralnet_py310\lib\site-packages\sklearn\utils\multiclass.py", line 368, in _check_partial_fit_first_call
    clf.classes_ = unique_labels(classes)
  File "D:\Code_environments\venv_coralnet_py310\lib\site-packages\sklearn\utils\multiclass.py", line 103, in unique_labels
    raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: ([UUID('115ec141-332b-4ac6-b94c-2ce178402c5f'), UUID('2acacc19-ce7d-4b9d-9b1e-cbf761c9e8c7')],)

However, a string representation of a UUID worked just fine. So in other words, passing the result of uuid.uuid4() directly won't work, but passing str(uuid.uuid4()) would work. Hopefully that works fine (or was the intent to begin with)?

As for why passing a raw UUID got an error, I tried to look through the sklearn documentation for any constraints on label types. Two relevant pages/sections that I can point out:

  1. https://scikit-learn.org/stable/modules/generated/sklearn.utils.multiclass.unique_labels.html

    We don’t allow:

    • mix of multilabel and multiclass (single label) targets
    • mix of label indicator matrix and anything else, because there are no explicit labels)
    • mix of label indicator matrices of different sizes
    • mix of string and integer labels

    Note that unique_labels() is the last function call in the above traceback. This comment from the function suggests that strings and integers are allowed, but maybe some types that are quite different from str/int are not allowed.

  2. https://scikit-learn.org/stable/glossary.html#term-target

    target

    targets

    The dependent variable in supervised (and semisupervised) learning, passed as y to an estimator’s fit method. Also known as dependent variable, outcome variable, response variable, ground truth or label. Scikit-learn works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple classes, or multiple numbers. See Target Types.

    "targets that have minimal structure". So that seems to justify str/int over some arbitrary class like UUID.

StephenChan commented 6 months ago

pyspacer: accept any hashable data type (unique) as label id, not just integer