Optimization Problem

What can the target function look like?

Regression

My dataset contains x,y coordinates and a label, indicating "Daphnia"+, "Culex", "unidentified" and "?"

I need a Classifier that returns also x,y coordinates and a label. In the easiest case, this classifier filters labels and names all the objects "Daphnia".

This prediction set can then be compared with the test set. Metrics can be:

N Daphnia
How many tags were found and correctly labelled within a margin of error. This could work with a loop like this


train = np.array(groundthruth)

# make sure all labels can be matched. i.e. all relevant Daphnia+ labels --> Daphnia

for point in prediction_points:
    # point has x,y coordinates
    offset = sum_over_xy(abs(train - point))

    # get minimum offset
    candidate = argsort(offset)[0]

    # test if offset falls within margin of detection, should be very close
    if offset[candidate] < 2:
        match = candidate
        true_positive_detects += 1
    else:
        false_positives_detects += 1

    if point.label == match.label:
        point.label == "Daphnia": 
            true_positive_classifications += 1
        else:
            true_negative_classification += 1
    else:
        if point.label == "Daphnia":
            false_positive_classifications += 1
        else:
            false_negative_classification += 1

this fct. will iterate over each point in the prediction and try to find a corresponding annotated tag. Success will be mesured as detection accuracy. If a match could be found, it will be measured whether the label was correct.

alternative: ML approach

I could use a logistic Regression classification scheme, where I give several predictors to the regression such as:

number of clusters
size central cluster
average size of non central clusters
xcenter
ycenter
color of central cluster
length major axis
length minor axis
angle of major axis
...

and then for training and testing I can probably use a standard ML approach.

The benefit of logistic regression is that I get a probability of detection. In a second step I could manually label the ones with a low probability

Also, for this approach I already have some scripts in peek

If I'm not mistaken, I can just take the tag database (or combine the databases from the tagging) for predictors and results

Resources:

https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-auto-examples-calibration-plot-compare-calibration-py

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py

classifier options

Steps:

[x] combine tag databases
[x] filter out rows that should not be used for training
[x] transform data if needed
[x] train classifier
[x] evaluate classifier (metrics, build report --> .md file, hyperparameter optimization) tools exist: https://scikit-learn.org/stable/modules/cross_validation.html
[x] test classifier
[x] if needed calculate other metrics for predictors (based on threshold slice and original image)
[x] repeat with different classifier
[x] model comparison
[x] use for predictions

flo-schu / peek

create optimizer for fitting the detector to annotated tags #4