JohannesBuchner / imagehash

A Python Perceptual Image Hashing Module
BSD 2-Clause "Simplified" License
3.28k stars 331 forks source link

Is tests in this repository correct throw logical point of view? #140

Closed bigcat88 closed 3 years ago

bigcat88 commented 3 years ago

Here is link to my closed pull request: https://github.com/JohannesBuchner/imagehash/pull/139

I closed it cause it didnt past tests, but I new that that conversion is correct. I look at tests and see: binary_to_hexadecimal_values = [ ['1', '1'], ['11', '3'], ['111', '7'], ['1111', 'f'],

From my point of view 1 bit must be represented as '01' in hex, and not '1'. And test must be started from 8 bits, then from 16 and so on, with step 8 bit. If you need I can explain why... From Python official docs: Every byte of data is converted into the corresponding 2-digit hex representation. No one will have hashes with total length < 8 bits(1 byte). No one who know how CPU works will store and operate with hashes not rounded by 8 bit(1 byte), it will not grant any memory or speed benefit, even worse in most situation it will slow the operations.

I can change that code from pull request and add round up of input array to size of one byte(after thet it will pass all tests) - but do you need that? Even with rounding it will be much faster that proposed code here, but does anyone use hashes <8 bit or not rounded by 8?

bigcat88 commented 3 years ago

And another interesting thing, when you pass hash in bits that is not rounded to 4, it values will be not recoverable to original. Impossible to know, was hash with value 0001 or with value 001 or with 01, after in converted to hex('1' can mean 1/01/0001/0001). Anyway i pull a version that works to pass this tests. for i in range(10000): for case in hexadecimal_to_bool_array: Execution time(new):0.02864799 Execution time(orig):0.55697636

Edited: never mind, in python 2 there is no hex func, so my code is fast, but useless(it is Python 3.5+ only). Next time I will be more careful, and see first which version project must support.

JohannesBuchner commented 3 years ago

The string representation is a matter of taste.

Different backends will need different conversions (binary blob, string, something that one can do a hamming distance query on) that the user needs to implement.

It is true that for recovering the hash from the string, one also needs to know the size of the hash (ideally the width and height).

My understanding is that 0x7 is a commonly acceptable hex value, so I think the 2-digit rule is not universal. In base 10, the representations "1" and "01" also refer to the same number.

I didn't have a detailed look at your PR, but pull requests are always welcome.