Different `wcxf.WC` instances with same hash

peterstangl commented 5 years ago

There is an issue with the hash of WilsonCoefficient as implemented in https://github.com/flav-io/flavio/commit/48347a13598ba510a2782bf815c8d78315e62206:

https://github.com/flav-io/flavio/blob/48347a13598ba510a2782bf815c8d78315e62206/flavio/physics/eft.py#L136-L141

The problem is that different wcxf.WC instances based on different Wilson coefficients can have the same hash. This can be demonstrated as follows:

from wcxf import WC
for i in range(10):
    x = [0,0.1*i]
    wc_dict = {'C9_bsmumu': x[0], 'C10_bsmumu': x[1]}
    wc =WC(eft='WET', basis='flavio', scale=4.8,
       values=WC.dict2values(wc_dict))
    print(hash(wc))

e.g. returns

8768634703673
8768634703806
8768634703673
8768634703806
8768634703673
8768634703806
8768634703673
8768634703806
8768634703673
8768634703806

I don't know where is behaviour is coming from, but to use the caching as implemented in https://github.com/flav-io/flavio/commit/48347a13598ba510a2782bf815c8d78315e62206, it will be necessary to make sure that different wcxf.WC instances will have different hashes.

This also applies to wilson, from which, using

from wilson import Wilson
for i in range(10):
    x = [0,0.1*i]
    wc_dict = {'C9_bsmumu': x[0], 'C10_bsmumu': x[1]}
    w = Wilson(wc_dict, 4.8, eft='WET', basis='flavio')
    print(hash(w))

i get

8768634700692
8768634700790
8768634700692
8768634700790
8768634700692
8768634700790
8768634700692
8768634700790
8768634700692
8768634700790

DavidMStraub commented 5 years ago

Oh, shoot. This must have something to do with garbage collection: if I store the instance in a list, all hashs are unique. So apparently when an object is destroyed, its hash can be reused.

What can we do about this? Data-based hashing will be way too slow...

DavidMStraub commented 5 years ago

... actually, we could just set the hash on instantiation to be a UUID. This will lead to the behaviour we had expected (even if it will give different hashes for instances with the same values, but this was the case in the original solution as well).

DavidMStraub commented 5 years ago

def __hash__(self):
    return uuid.uuid4().hex

peterstangl commented 5 years ago

Yes, this was also what I was thinking about. Unique hashes for different instances is enough to solve the current problem. And as you say, data-based hashing is too slow and this would have been the only option to have same hashes for instances with same values.

peterstangl commented 5 years ago

def __hash__(self):
    return uuid.uuid4().hex

But the hash should be saved as private attribute, right? otherwise on each call of __hash__(), a new id is generated.

DavidMStraub commented 5 years ago

Sure, stupid me.

peterstangl commented 5 years ago

And all hashes in python I have seen so far were 64 bit integers. So perhaps we should also make sure that the uuid is turned into one.

DavidMStraub commented 5 years ago

Oh OK. In that case I guess using secrets.randbits(64) is much easier.

DavidMStraub commented 5 years ago

Belay that. secrets is not available in Python 3.5. random.randint(1, 2**64)?

peterstangl commented 5 years ago

Or simply hash(uuid.uuid4()) ?

DavidMStraub commented 5 years ago

that's 3 times slower on my system

peterstangl commented 5 years ago

It's actually faster than uuid.uuid4().hex on my system, but yes, random.randint(1, 2**64) is 3 times faster

DavidMStraub commented 5 years ago

But wait a second... the problem is that the original solution assumed that only wcxf.WC is practically immutable, but this does not apply to wilson.Wilson or flavio.WilsonCoefficients; so if we implement this random thing, it would have to be in the wcxf package (which would require releases of all 3 codes :unamused: )

peterstangl commented 5 years ago

If you don't want to make a new release now, the modification could be put in the development versions of wcxf and wilson (and will not affect anything). The caching in flavio should then be moved to a separate branch again and only merged after new versions of wcxf and wilson have been released.

Concerning the hash, it is possible that 32bit python uses 32 bit hashes. To be save and always get a proper hash, one could use hash(random.randint(1, 2**64)) which is as fast as random.randint(1, 2**64) alone.

peterstangl commented 5 years ago

btw, even faster (by almost a factor of 10 on my system) is hash(random.random())

DavidMStraub commented 5 years ago

This can be closed, no?

peterstangl commented 5 years ago

Yes

flav-io / flavio

Different `wcxf.WC` instances with same hash #82