Open KilianB opened 5 years ago
A quick implementation will be added shortly. Which metric do we want to optimize? true positives? Gini impurity does not work in it's bare form due to the way test cases are generated from labeled images. We end up with highly unbalanced classes.
F1 looks promising at the moment.
Are there any slim random forest implementations available (preferably supporting the C4.5 algorithm)? Everything I have found so far will lead to an explosion of the dependency tree. ...
8097890cc7ea448baf2031225f6e31996f3c78bd & 98ce751d85d01c35a11b9280ca90832280d25ab6 & 401fdd07dc3d796271a41911358bc25bf006e950
If we have labeled test data we can do better than directly comparing distances to guess if the images are duplicates or not.
With different hashing algorithms focusing on different criteria like color, gradient and frequency we might get better results using a simple technique like random forest.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm