KilianB / JImageHash

Perceptual image hashing library used to match similar images
MIT License
401 stars 81 forks source link

Create hashes using string builder instead of big integers #22

Closed KilianB closed 5 years ago

KilianB commented 5 years ago

During hash creation a bigInteger object will be created for each bit present. Currently no fancy operation take place during it's creation process that justify this overhead. Afterwards the xor operation on arbitrary hashes is required for the hamming distance calculation.

Benchmark if it's more performant to use a stringbuilder.

BigInteger hash = BigInteger.ZERO;
for (int x = 0; x < width; x++) {
    for (int y = 0; y < height; y++) {
        if (pixelValue[x][y] < compareAgainst) {
            hash = hash.shiftLeft(1);
        } else {
            hash = hash.shiftLeft(1).add(BigInteger.ONE);
        }
    }
}
return hash;

vs new

StringBuilder sb = new StringBuilder(hashResolution);
for (int x = 0; x < width; x++) {
    for (int y = 0; y < height; y++) {
        if (pixelValue[x][y] < compareAgainst) {
            sb.append(0);
        } else {
            sb.append(1);
        }
    }
}
return new BigInteger(sb.toString(),2);

In cooperation with #18

KilianB commented 5 years ago

for smaller than 64 bit hashes we could also fall back to long variables and do primitive bit shifting.

KilianB commented 5 years ago

jmh benchmarks. Big Integer is the current implementation. Performance decreases with increased hash length

BenchmarkModeCntScoreErrorUnits
kilianB.HashCreationBenchmark.benchmarkAverageHash32BigIntegerthrpt25152626,545± 2741,163ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash32BigIntegerStringBuilderthrpt25158690,519± 2111,147ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash32MutableBigIntegerthrpt25157546,217± 2026,677ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash128BigIntegerthrpt25104932,391± 1387,338ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash128BigIntegerStringBuilderthrpt25117392,216± 1035,051ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash128MutableBigIntegerthrpt25116409,836± 2436,276ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash5000BigIntegerthrpt251387,469± 6,534ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash5000BigIntegerStringBuilderthrpt257690,280± 96,870 ops/s
kilianB.HashCreationBenchmark.benchmarkAverageHash5000MutableBigIntegerthrpt257715,949± 49,220ops/s

Mutable big integer requires method handles and reflection access which is not warranted for the performance difference compare to the stringbuilder approach (as well as unit tests to guarantee that the implementation is working as expected).

A new StringBuilder is created for every hash creation. Caching a stringbuilder would result again in a performance gain at the cost of thread safety. In relation file IO is much more heavy and the treadoff isn't worth it.

Migration to either stringbuilder or mutable big int allows to cut the first pass needed to estimate the hash length on an algorithm basis

KilianB commented 5 years ago

fixed with afc026dd79ae1e7ba721f566896a5a1f56385a39

KilianB commented 5 years ago

Using a custom hash builder which manipulates a byte array directly we can gain even more performance d9ef1f86984d008b1f23fbef1efe03c9b8cae4df Now even the 5000 bit hash runs with 13.000 hashes created / second up from 1.3 k

KilianB commented 5 years ago

fixed with v 3.0.0