farhadi / cuckoo_filter

High-performance, concurrent, and mutable Cuckoo Filter for Erlang and Elixir
Apache License 2.0
44 stars 1 forks source link

AltIndex sends Integer to HashFunction #3

Closed PJUllrich closed 1 week ago

PJUllrich commented 1 week ago

Hello there 👋

when I use a custom hash function and add many items in a loop, I receive the following error:

** (FunctionClauseError) no function clause matching in String.length/1

    The following arguments were given to String.length/1:

        # 1
        22727

    Attempted function clauses (showing 1 out of 1):

        def length(string) when is_binary(string)

    (elixir 1.17.3) lib/string.ex:2201: String.length/1
    (xxhash 0.3.1) lib/xxhash.ex:60: XXHash.xxh32/1
    (cuckoo_filter 1.0.0) /deps/cuckoo_filter/src/cuckoo_filter.erl:409: :cuckoo_filter.alt_index/4
    (cuckoo_filter 1.0.0) /my_app/deps/cuckoo_filter/src/cuckoo_filter.erl:198: :cuckoo_filter.add_hash/3
    (elixir 1.17.3) lib/enum.ex:992: anonymous fn/3 in Enum.each/2
    (elixir 1.17.3) lib/enum.ex:4423: anonymous fn/3 in Enum.each/2

My Setup

This is my filter definition and how I add a long list (~50k) of items to the filter:

    filter =
      :cuckoo_filter.new(65_536,
        bucket_size: 32,
        hash_function: &XXHash.xxh32/1,
        name: :tmp_blocklist_cache
      )

Enum.each(list_of_string_ips, fn ip ->
  :cuckoo_filter.add(filter, ip)
end)

The error occurs after adding exactly 36404 items. Is this maybe a capacity issue?

Note

I'm using the xxHash library here because I want to use the library inside my own library which will have to run on a many different OSs and I already received a compilation error for the recommended xxh3 library on my macOS system. That's why I opted for the native xxHash implementation.

PJUllrich commented 1 week ago

Funny observation: If I reduce the bucket_size, the error occurs after fewer items. With bucket_size: 4 (default) it occurs after 3725 items, with bucket_size: 16 it occurs after 24531 items, with bucket_size: 64, it occurs after 40869 items.

PJUllrich commented 1 week ago

I built a workaround by converting the integer to a string first. I was just surprised by this because the docs said that I could use any hash function as long as it "can convert a string to a hash". But this works too now :)

    filter =
      :cuckoo_filter.new(65_536,
        bucket_size: 32,
        hash_function: &hash/1,
        name: :tmp_blocklist_cache
      )

  def hash(input) when is_binary(input), do: XXHash.xxh32(input)

  def hash(input) when is_number(input) do
    input |> to_string() |> hash()
  end
farhadi commented 1 week ago

Hi Peter, This is actually happening because I made a breaking change in version 1.0.0 and forgot to update the documentation. Since version 1.0.0 hash_function must be a function that accepts any term and returns an integer. So in your case the solution would be to create a new filter like this:

filter =
      :cuckoo_filter.new(65_536,
        bucket_size: 32,
        hash_function: fn(input) -> XXHash.xxh32(:erlang.term_to_binray(input)) end,
        name: :tmp_blocklist_cache
      )
PJUllrich commented 1 week ago

Okay, no worries. thank you :) that worked!