Using CategoricalMatcher on massive amounts of Hashes

Hey, first of all, awesome lib. I'm currently tinkering on a database to collect Minecraft Skins (64x64 images). Before adding them I clean them up (upgrade older 64x32 skins to 64x64 and remove data from un-seen/used areas), and then save them with a sha256 hash in order to deduplicate them. Now I'm trying to use the CategoricalMatcher to group together visually simular skins. To speed things up I precalculate my used JImageHash(PerceptiveHash(128) seems to work really good for this usecase) and save them as json(Using gson for conversion) to the database. I created a tiny fork of JImageHash to add a "categorizeImageAndAdd(Hash[] hash, String id)" method that just skips the BufferedImage -> Hash[] conversion. It's also noteworthy that all that data is unlabeled, but normalized. Now to my problem: The first ~20k hashes can be added rellativly fast, but the further along I get, the slower everything becomes. The current test database has 111.000 skin = 111.000 Hashes inside, and together with recomputeCategories() the process takes about 3 hours. The goal is to have millions of Skin in that database, so this approach won't work anymore. Is there a better way of doing this? It just comes to mind test a newly added hash against all other hashes, which doesn't sound too practical, or useing chunks of the database and match against them. Maybe the Database can be split up into 10-100k clusters that can be computed, and where only the categories fuzyHash gets stored. Then hashes could be compared against these fuzyHashes.

KilianB / JImageHash

Using CategoricalMatcher on massive amounts of Hashes #40