bazelbuild / starlark

Starlark Language
Apache License 2.0
2.44k stars 160 forks source link

spec: hash and freeze #33

Open alandonovan opened 5 years ago

alandonovan commented 5 years ago

The spec is self-contradictory on the interaction of hashing and freezing. Using language from the documentation of the Go implementation, it says this:

Lists are not hashable, so may not be used in the keys of a dictionary.

but differing from the Go implementation it also says:

Most mutable values, such as lists, and dictionaries, are not hashable, unless they are frozen.

It would be easy to change first line to match the second, but I think this would be a mistake.

There are two reasons Python disallows hashing of lists and dicts.

The first is that they are mutable, so any non-trivial hash function consistent with == would have to reflect mutations. If a mutable key is inserted into a dict then the key is mutated, the dict would no longer appear to contain the key (even given the identical reference) because its hash no longer matches the one saved in the table. Starlark finesses this problem by allowing hashing only of frozen values.

The second reason is that mutable data structures may be cyclic. A list may contain itself, for example, which means the hash computation must detect cycles to avoid getting stuck in an endless loop:

$ Skylark # in Java
>> x = [None]; x[0] = x; {x: None}
Exception in thread "main" java.lang.StackOverflowError
        at java.util.ArrayList$Itr.<init>(ArrayList.java:852)
        at java.util.ArrayList.iterator(ArrayList.java:841)
        at java.util.AbstractList.hashCode(AbstractList.java:540)
        ...etc ad infinitum...

In order to implement cycle detection, the hash method of a Starlark value must pass a parameter which contains all references to mutable objects in the hash operation's current depth-first visitation stack. The implementation of the hash method for a List x must first inspect this stack to ensure that it does not contain x, then push x on the stack, compute the hash of the elements, and then pop x. If the stack does contain x, the hash method should return an error, or alternatively return a constant.

This is feasible. It's reasonably efficient, in that hashing lists is far from the norm and the stack depth is unlikely to be large. But it cannot be implemented in Java using the existing hashCode method without Java thread-local storage. Go has no thread-local storage, so it would require a major incompatible API change to the Hash method to add a new parameter. (An alternative implementation is to limit the visitation depth to some constant k, but it doesn't materially change the problem.)

It seems like an ugly feature. How important is it?

brandjon commented 5 years ago

Regarding the cyclic problem, we already have to worry about cycles for str()/repr() and equality, and probably other methods. One more operation shouldn't be a deal-breaker.

If Starlark allows hashing of frozen previously-mutable objects, you end up in situations where you want to do

load(..., "frozen_key")
my_dict[frozen_key] = ...      # Allowed
if frozen_key in my_dict: ...  # Allowed
if [1,2,3] in my_dict: ...     # Not allowed?
my_dict[[1,2,3]] = ...         # Not allowed!

For the second to last operation, the question is should we allow hashing of mutable values (and thereby allow hashes to change) and then just disallow actually entering mutables into hashed data structures?

alandonovan commented 5 years ago

How would you implement that in terms of Java hashCode? You are suggesting that the operations "k in dict" and "dict[k]=v" would compute the hash of k (recursively, potentially) but would also somehow remember that part of k is mutable and thus allow "k in dict" and "v=dict[k]" but disallow "dict[k]=v". Or are you proposing that all values would need an additional isMutable method that does its own recursion over the object graph?

Is it important that the last two statements in your example actually work? I have yet to see a program that requires it. The feature that users do seem to rely upon today is using a frozen struct(x=[...], ...) as a dict key.

brandjon commented 5 years ago

Or are you proposing that all values would need an additional isMutable method

That one, yes.

Is it important that the last two statements in your example actually work? I have yet to see a program that requires it.

It's not important to my knowledge, just an example of an asymmetry in the language. It's counterintuitive that you can't use the literal [1, 2, 3] to check if [1, 2, 3] is in the dict.

The feature that users do seem to rely upon today is using a frozen struct(x=[...], ...) as a dict key.

Ug. For Bazel, that's problematic because it's hard to even give sane semantics for equality of provider instances, thanks to depset-valued fields.

alandonovan commented 5 years ago

For Bazel, that's problematic because it's hard to even give sane semantics for equality of provider instances, thanks to depset-valued fields.

I thought depset equality used identity (even though they are immutable), so sane hash/eq should be straightforward for them, even if it does imply struct(x=depset()) != struct(x = depset()).

brandjon commented 5 years ago

Correct, by sane I meant that I think it's strange to use a struct with (possibly) identity equality semantics as a dict key. It sets up different expectations for duplicate behavior depending on whether or not it happens to contain an identity-semantics component. But I guess that's user-beware.

alandonovan commented 5 years ago

Or are you proposing that all values would need an additional isMutable method That one, yes.

Currently the spec has no need for "mutable" as a property of an object graph. It would be nice not to have to add it.

Each list/dict/whatever needs to locally know whether it is frozen, but there is no need to traverse the graph to compute mutability. The one exception is depset: it asserts that its elements are hashable, both because it needs to hash elements during iteration, and because it wants to be deeply immutable, and so far hashable has served as a proxy for deeply immutable. Allowing unfrozen mutable values to be hashable would undo that.

brandjon commented 5 years ago

So it's a choice between "be forced to track independent mutability data for each value" and "disallow using some structs as dict keys"? I'm ok with keeping dict keys simple, and thereby avoiding the [1,2,3] literal example I gave above.

alandonovan commented 5 years ago

I'm ok with keeping dict keys simple, and thereby avoiding the [1,2,3] literal example I gave above.

OK. So that leaves us here: frozen once-mutable values become hashable, but may contain cycles, which currently crash the Java implementation (and would require TLS to fix) and crash the Go implementation (and would require a breaking change to the Hash method to fix). Hashable would continue to imply immutable.

brandjon commented 5 years ago

And the alternative, which is just flat-out disallowing hashing of mutable types regardless of freezing. Right?

alandonovan commented 5 years ago

And the alternative, which is just flat-out disallowing hashing of mutable types regardless of freezing. Right?

Right. Python gets away without it, though of course you can define a class with its own notion of equivalence. Go also gets away without it; its maps allow only simple keys (scalars and tuples, in effect).