KilianB / JImageHash

Perceptual image hashing library used to match similar images
MIT License
401 stars 81 forks source link

Add random forest image matcher to utilize different image features #17

Open KilianB opened 5 years ago

KilianB commented 5 years ago

If we have labeled test data we can do better than directly comparing distances to guess if the images are duplicates or not.

With different hashing algorithms focusing on different criteria like color, gradient and frequency we might get better results using a simple technique like random forest.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

KilianB commented 5 years ago

A quick implementation will be added shortly. Which metric do we want to optimize? true positives? Gini impurity does not work in it's bare form due to the way test cases are generated from labeled images. We end up with highly unbalanced classes.

F1 looks promising at the moment.

Are there any slim random forest implementations available (preferably supporting the C4.5 algorithm)? Everything I have found so far will lead to an explosion of the dependency tree. ...

KilianB commented 5 years ago

8097890cc7ea448baf2031225f6e31996f3c78bd & 98ce751d85d01c35a11b9280ca90832280d25ab6 & 401fdd07dc3d796271a41911358bc25bf006e950