Yiling-J / theine-go

high performance in-memory cache
MIT License
278 stars 17 forks source link

Validator cause hasher return different hash for the same key #31

Open zhenzou opened 11 months ago

zhenzou commented 11 months ago

Golang: 1.21.4

And it will be ok if remove the validate step.

It maybe not the problem of the theine, but would you like give some suggestions to solve this problem, Validator is a wild used library, many other uses maybe face the same issues.

package main

import (
    "slices"
    "unsafe"

    "github.com/go-playground/validator/v10"
    "github.com/zeebo/xxh3"
)

type Hasher[K comparable] struct {
    ksize int
    kstr  bool
}

func NewHasher[K comparable]() *Hasher[K] {
    h := &Hasher[K]{}
    var k K
    switch ((interface{})(k)).(type) {
    case string:
        h.kstr = true
    default:
        h.ksize = int(unsafe.Sizeof(k))
    }
    return h
}

func (h *Hasher[K]) Hash(key K) uint64 {
    var strKey string
    if h.kstr {
        strKey = *(*string)(unsafe.Pointer(&key))
    } else {
        strKey = *(*string)(unsafe.Pointer(&struct {
            data unsafe.Pointer
            len  int
        }{unsafe.Pointer(&key), h.ksize}))
    }
    return xxh3.HashString(strKey)
}

// For simulate real case, instance a new string every time
func newArgs(id string) Args {
    return Args{
        Id: string(slices.Clone([]byte(id))),
        // Id: id,
    }
}

type Args struct {
    Id string `url:"id" validate:"required"`
}

func main() {
    hasher := NewHasher[Args]()

    validate := validator.New()
    for i := 0; i < 100; i++ {
        args := newArgs("123")
        err := validate.Struct(args)
        if err != nil {
            panic(err)
        }
                // expect to same for all 100 times, but not
        println(hasher.Hash(args))
    }
}
maypok86 commented 11 months ago

Spoiler: it's actually a problem for theine as well. Since a key containing strings will select a random shard and hit ratio will be very small.

maypok86 commented 11 months ago

If you just need a hasher for comparable types, I can recommend https://github.com/dolthub/maphash that uses a set of dirty tricks to pull a hash function from a standard map

Yiling-J commented 11 months ago

@zhenzou as @maypok86 pointed out, this is not related to validator. This is because current hash function of Theine calculate hash based on object's memory bytes. String is pointer to bytes so two strings have different hash even their value is same . I copy this from https://github.com/tidwall/hashmap because first version of Theine use that as internal map.

I think the best way to handle this is adding an optional Hash interface, instead of using the maphash package which hack Go's runtime a lot. Let me do some investigations first and decide how to handle this properly

zhenzou commented 11 months ago

@Yiling-J Yes, I don't believe it is issue of validator either. The strange thing is the issue will be disappears if remove the validate step. In my opinion, I create a new string every time, it should be different every time for current version even removed the validate step. I tried some escape demos, but none of them can reproduce the issue. Would you help to explain the detail for this. Thanks a lot.

func escape(obj any) {
    if obj == nil {
        panic("nil")
    }
    val := reflect.ValueOf(obj)

    if val.Kind() != reflect.Struct {
        panic("not struct")
    }
}
zhenzou commented 11 months ago

@Yiling-J Hi, We are using theine in our new project, and this issue block us, would you provide a workaround for this. e.g. the add the interface to build a key to return a string, the hasher recognize if the key is a CacheKey.

type CacheKey interface{
    BuildCacheKey() string
}
Yiling-J commented 11 months ago

@zhenzou I do a quick fix on master branch so you are not blocked. I still need some investigations before release a new version. About your validator question, I think "123" is interning automatically by Go(in string() method), and validator somehow break this and create a new string each time string() is called.

new interface:

type StringKey interface {
    StringKey() string
}
Yiling-J commented 11 months ago

@zhenzou Actually there might be the case your key struct already has a StringKey method and do something else. So what i think is adding an extra method to builder, so the options are:

builder.StringKey(func(key Foo) string { return key.name })


which one you think is better?
zhenzou commented 11 months ago

@Yiling-J You are correct, add a custom StringKey builder instead of StringKey interface is a better solution.

maypok86 commented 11 months ago

@Yiling-J I noted maphash because all other solutions will raise the same question, "What will you do if the key contains multiple strings?".

Yiling-J commented 11 months ago

@zhenzou main branch updated, please use the new builder method:

builder := theine.NewBuilder[Foo, int](10000)
builder.StringKey(func(k Foo) string { return k.Bar })
Yiling-J commented 11 months ago

@maypok86 seems the only special type as map key is string? because bytes is not allowed in comparable and pointers are always not equal. if so it's also possible to do a recursive check with reflect when hasher initialized, if there is string field and user doesn't provide a custom StringKey func(or maybe hash function), just panic.

maypok86 commented 11 months ago

@Yiling-J Eh, if only it were that simple...

Since Go 1.20 interfaces are comparable and this example will not pass:

package kek

import (
    "fmt"
    "github.com/dolthub/maphash"
    "github.com/zeebo/xxh3"
    "testing"
    "unsafe"
)

type Hasher[K comparable] struct {
    ksize int
    kstr  bool
}

func NewHasher[K comparable]() *Hasher[K] {
    h := &Hasher[K]{}
    var k K
    switch ((interface{})(k)).(type) {
    case string:
        h.kstr = true
    default:
        h.ksize = int(unsafe.Sizeof(k))
    }
    return h
}

func (h *Hasher[K]) Hash(key K) uint64 {
    var strKey string
    if h.kstr {
        strKey = *(*string)(unsafe.Pointer(&key))
    } else {
        strKey = *(*string)(unsafe.Pointer(&struct {
            data unsafe.Pointer
            len  int
        }{unsafe.Pointer(&key), h.ksize}))
    }
    return xxh3.HashString(strKey)
}

type LolKek struct {
    lol any
    kek any
}

func TestHasher(t *testing.T) {
    lol := 236
    kek := [34]byte{1, 2, 3}
    hasher := NewHasher[LolKek]()
    h1 := hasher.Hash(LolKek{
        lol: lol,
        kek: kek,
    })
    fmt.Println(h1)
    h2 := hasher.Hash(LolKek{
        lol: lol,
        kek: kek,
    })
    fmt.Println(h2)
    if h1 != h2 {
        t.Fatal("hashes should be the same")
    }
}

func TestMapHash(t *testing.T) {
    lol := 236
    kek := [34]byte{1, 2, 3}
    hasher := maphash.NewHasher[LolKek]()
    h1 := hasher.Hash(LolKek{
        lol: lol,
        kek: kek,
    })
    fmt.Println(h1)
    h2 := hasher.Hash(LolKek{
        lol: lol,
        kek: kek,
    })
    fmt.Println(h2)
    if h1 != h2 {
        t.Fatal("hashes should be the same")
    }
}

But it's more rare than this one (LolKek is just a composite key):

type LolKek struct {
    LolID string
    KekID string
}
tv42 commented 10 months ago

If you're heading toward the route of needing StringKey functions, you should probably stop insisting on comparable. If I can make my StringKey function return a random number, that comparable didn't actually enforce the keys to be comparable.

I'd really prefer to see an API that doesn't force me to combine multiple []byte into a newly-allocated single string -- especially multiple times! And my loading function wants to access the fields separately, so encoding them into a string would just mean I have to decode the string to fetch the value.

I haven't looked at the internals of theine to know if this is feasible, but one way around this would be to say that a cache key must implement Equals(other: K) bool and Hash(sum *maphash.Hash) (https://pkg.go.dev/hash/maphash). That way you can do e.g.

package main

import (
    "bytes"
    "encoding/binary"
    "hash/maphash"
)

type cacheKey struct {
    foo  uint64
    bar  []byte
    quux []byte
}

func (c cacheKey) Equals(other cacheKey) bool {
    return c.foo == other.foo && bytes.Equal(c.bar, other.bar) && bytes.Equal(c.quux, other.quux)
}

func (c cacheKey) Hash(sum *maphash.Hash) {
    var tmp [8]byte
    binary.LittleEndian.PutUint64(tmp[:], c.foo)
    sum.Write(tmp[:])
    sum.Write(c.bar)
    sum.Write(c.quux)
}
Yiling-J commented 10 months ago

@tv42 I think comparable is the right key type, and if we consider cache a hashmap, it's nature to use comparable as key. Go knows exactly how to handle comparable, it just didn't export interface or useful functions. The best 2 options to me is: Go provide a generic hash function or Go make hashmap key type implementable, like Rust.

Currently I'm still thinking do a recursive reflect check when cache initilaized, if there is string and StringKey method is not implemented, just panic(any is valid in this check though, but when saving to real map, Go will panic). Another option is rewrite and avoid sharded map, because the hash value is only used to choose a shard.