My dataset contains x,y coordinates and a label, indicating "Daphnia"+, "Culex", "unidentified" and "?"
I need a Classifier that returns also x,y coordinates and a label. In the easiest case, this classifier filters labels and names all the objects "Daphnia".
This prediction set can then be compared with the test set. Metrics can be:
N Daphnia
How many tags were found and correctly labelled within a margin of error. This could work with a loop like this
train = np.array(groundthruth)
# make sure all labels can be matched. i.e. all relevant Daphnia+ labels --> Daphnia
for point in prediction_points:
# point has x,y coordinates
offset = sum_over_xy(abs(train - point))
# get minimum offset
candidate = argsort(offset)[0]
# test if offset falls within margin of detection, should be very close
if offset[candidate] < 2:
match = candidate
true_positive_detects += 1
else:
false_positives_detects += 1
if point.label == match.label:
point.label == "Daphnia":
true_positive_classifications += 1
else:
true_negative_classification += 1
else:
if point.label == "Daphnia":
false_positive_classifications += 1
else:
false_negative_classification += 1
this fct. will iterate over each point in the prediction and try to find a corresponding annotated tag. Success will be mesured as detection accuracy. If a match could be found, it will be measured whether the label was correct.
alternative: ML approach
I could use a logistic Regression classification scheme, where I give several predictors to the regression such as:
number of clusters
size central cluster
average size of non central clusters
xcenter
ycenter
color of central cluster
length major axis
length minor axis
angle of major axis
...
and then for training and testing I can probably use a standard ML approach.
The benefit of logistic regression is that I get a probability of detection.
In a second step I could manually label the ones with a low probability
Also, for this approach I already have some scripts in peek
If I'm not mistaken, I can just take the tag database (or combine the databases from the tagging) for predictors and results
Optimization Problem
What can the target function look like?
Regression
My dataset contains x,y coordinates and a label, indicating "Daphnia"+, "Culex", "unidentified" and "?"
I need a Classifier that returns also x,y coordinates and a label. In the easiest case, this classifier filters labels and names all the objects "Daphnia".
This prediction set can then be compared with the test set. Metrics can be:
this fct. will iterate over each point in the prediction and try to find a corresponding annotated tag. Success will be mesured as detection accuracy. If a match could be found, it will be measured whether the label was correct.
alternative: ML approach
I could use a logistic Regression classification scheme, where I give several predictors to the regression such as:
and then for training and testing I can probably use a standard ML approach.
The benefit of logistic regression is that I get a probability of detection. In a second step I could manually label the ones with a low probability
Also, for this approach I already have some scripts in peek
If I'm not mistaken, I can just take the tag database (or combine the databases from the tagging) for predictors and results
Resources:
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-auto-examples-calibration-plot-compare-calibration-py
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py
classifier options
Steps: