eddieantonio / perfection

Simple perfect hashing in Python
https://pypi.python.org/pypi/perfection
MIT License
13 stars 6 forks source link

make_hash stalls for long numbers #7

Open za3k opened 1 year ago

za3k commented 1 year ago

perfection.make_hash([3934533475148980154, -3342959649445288775, 654684863317306829, 29321895429837616, 2356289430496076560]) stalls indefinitely on my computer.

Note, these numbers are the output of Python's built-in hash, which I think is a reasonable use case.

eddieantonio commented 1 year ago

tl;dr: your example internally creates a very large array!

The reason this stalls is due to the naïve and literal interpretation of this algorithm. I implemented the algorithm described by Thomas Gettys here, and the first step is to generate a $t \times t$ square, where $t \times t \ge \max(A')$, $A$ are the integers you want to hash, and $A'$ is every element in $A$ subtracted by $min(A)$. In effect, this means that $max(A')$ is the range of $A$, that is $\max(A) - \min(A)$.

In this particular example, $\min(A)$ = -3342959649445288775 which means $\max(A')$ = 7277493124594268929 therefore the smallest value of $t$ is 2697682918, because this is the smallest $t$ such that $t \times t$ is greater than 7277493124594268929.

Because it's making a square of $t$, this means the algorithm then tries to make room for at least 7277493124594268929 integers. Your computer cannot do this. For reference, this number is between $2^{62}$ and $2^{63}$. There are a few other housekeeping data structures that are made, so 7277493124594268929 times the size of a Python integer is the minimum amount of bytes needed for this algorithm to terminate.

Long story short, this perfect hashing algorithm is not suitable for numbers with a very large range, such as in this example.

Maybe I will look into other perfect hashing algorithms that are more appropriate, optimize the current algorithm so it doesn't allocate, or just document that it's not appropriate for large ranges.

za3k commented 1 year ago

Throwing an exception if you try to do this might be a "quick workaround"

On 2022-11-05 9:30 am, Eddie Antonio Santos wrote:

tl;dr: your example internally creates a very large array!

The reason this stalls is due to the naïve and literal interpretation of this algorithm. I implemented the algorithm described by Thomas Gettys here [1], and the first step is to generate a $t \times t$ square, where $t \times t \ge \max(A')$, $A$ are the integers you want to hash, and $A'$ is every element in $A$ subtracted by $min(A)$. In effect, this means that $max(A')$ is the range of $A$, that is $\max(A) - \min(A)$.

In this particular example, $\min(A)$ = -3342959649445288775 which means $\max(A')$ = 7277493124594268929 therefore the smallest value of $t$ is 2697682918, because this is the smallest $t$ such that $t \times t$ is greater than 7277493124594268929.

Because it's making a square of $t$, this means the algorithm then tries to make room for at least 7277493124594268929 integers. Your computer cannot do this. For reference, this number is between $2^{62}$ and $2^{63}$. There are a few other housekeeping data structures that are made, so 7277493124594268929 times the size of a Python integer is the minimum amount of bytes needed for this algorithm to terminate.

Long story short, THIS PERFECT HASHING ALGORITHM IS NOT SUITABLE FOR NUMBERS WITH A VERY LARGE RANGE, such as in this example.

Maybe I will look into other perfect hashing algorithms that are more appropriate, optimize the current algorithm so it doesn't allocate, or just document that it's not appropriate for large ranges.

-- Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. You are receiving this because you authored the thread.Message ID: @.***>

Links:

[1] https://www.drdobbs.com/architecture-and-design/generating-perfect-hash-functions/184404506 [2] https://github.com/eddieantonio/perfection/issues/7#issuecomment-1304581727 [3] https://github.com/notifications/unsubscribe-auth/AAFRLUUXNZXBXENEZMET7GLWG2DTTANCNFSM6AAAAAARPUJ45Y