spec: convergence with Go

alandonovan commented 5 years ago

The Go implementation has a list of remaining differences from the Java implementation: https://github.com/google/starlark-go/blob/master/doc/spec.md#dialect-differences I'd like us to finish the wording of a spec that we can all be happy with, even if that spec allows for some differences among implementations. I'll go through the list of differences point by point:

multiprecision integers: the spec should require that integer precision be sufficient to represent uint64 and int64 values without loss, as these are required for correct handling of protocol buffers, among other things. Obviously Bazel has no need for larger integers so it would be fine not to implement it for now, but it should be described as a limitation of the implementation.
floating point: for the same reason, lossless handling and arithmetic on float64 values must also be supported. (On this and the above point I think we were all agreed based on a meeting in NYC about 18 months ago.) Bazel has no need of floating-point at all, so again, we can state that this is a limitation of the Java implementation.
bitwise operators should be supported. They are fundamental operations on integers in every machine and programming language. Bazel may not need them, but many other uses do (anything that uses protocol buffers, for example.)
strings: we cannot realistically require a particular string encoding (UTF-8 or UTF-16) without imposing intolerable costs on implementations whose host language uses the opposite encoding. I propose we specify strings in terms of code units without specifying the encoding; UTF-8 and UTF-16 are only quantitatively different in that sense. However this does leave the Java implementation without a data type capable of representing binary data.
strings should have methods elem_ords, codepoint_ords, and codepoints. I think there was agreement on this point but the Java implementation was lagging.
A language needs some way to encode a Unicode code point as a string (and vice versa). One way to do this is the Go impl's chr and ord built-in functions. (Related: the "%c" formatting operator, which is like "%s" % chr(x).)
The Go impl permits 'x += y' rebindings at top level. I think it should probably match the Bazel implementation (which rejects them), but the whole no-global-reassign feature should be specified as a dialect option, since no client other than Bazel wants it.
The Go implementation treats assert as a valid identifier. Indeed, it uses it widely throughout its own tests. The cost of specifying this would be that tools (such as Bazel tests) that use the Python parser will not be able to parse Starlark files that use 'assert' as an identifier. Given that using Python in this way is a hack, and that files containing assert will be vanishingly rare in the Bazel test suite, that doesn't seem like a problem.
The Go impl's parser accepts unary + expressions for parity with every other ALGOL-like language. A + operator forces a check that its operand is numeric, and occasionally makes code more readable. I think the spec should include it.
In the Go impl, a method call x.f() may be separated into two steps: y = x.f; y(). I think work is underway to support this in the Java impl too. I recall we were at least agreed it was the right thing.
In the Go impl, dot expressions may appear on the left side of an assignment: x.f = 1. This is a parser issue---in Bazel, there are no mutable struct-like data types for which this operation would succeed, but other applications may need it (esp. if they use protocol buffers), so the grammar should support it nonetheless.
In the Go impl, the hash function accepts operands besides strings, as in Python. It should be an easy fix to the Java implementation to do so too.
The Go impl's sorted function accepts the additional parameters key and reverse. These make it easier to define alternative order without the effort and unnecessary allocation of the decorate/sort/undecorate trick and a separate call to reverse.
The Go impl's type(x) returns "builtin_function_or_method" for built-in functions. This is the string Python uses. I don't have a strong feeling about the particular string, but the crucial thing is that builtin- and Starlark-defined functions must have distinct types because they support different operations. For example, in Bazel, the rule.outputs mechanism requires that its operand be a Starlark function so that its parameter names can be retrieved; this is impossible with a built-in function.

kastiglione commented 5 years ago

Given that using Python in this way is a hack

Is there more context on this? I've found it exceedingly convenient to use Python's ast module to query over and do transformations of Bazel files, for small development tasks.

alandonovan commented 5 years ago

Is there more context on this? I've found it exceedingly convenient to use Python's ast module to query over and do transformations of Bazel files, for small development tasks.

The key word in this sentence is "convenient". :)

If you want to transform a Starlark program from Python, the right thing to do is write a Starlark parser in Python. It should be easy because you can just fork the Python parser and delete the parts you don't need.

The syntax of Starlark is, for now, a subset of Python, but longer term it could be improved by breaking compatibility. The most glaring problem is the syntax for load, which must use strings where identifiers are wanted. Tools that assume a Python parser is sufficient are taking an expedient short-cut at the expense of long term maintainability, which is the definition of a hack. The Bazel test tools I was alluding to go one step further and actually execute the Skylark program in a Python interpreter, which is very fragile indeed.

adonovan commented 5 years ago

I met with Laurent and Damien today and we agreed on the following spec changes:

Floating point literals, values, arithmetic, and the float built-in should be an optional feature behind a dialect flag. Implementations that support it should use float64 semantics. Laurent was concerned that float-to-string conversion is hard to specify and may vary by implementation language; I agree but don't see that as a particular problem.
Integer bitwise operations (int&int, int|int, int^int, ~int, int<<int, and int>>int) should be supported in all implementations, without a dialect flag.
The unary +int operation should be defined in all implementations.
String operations should be specified in terms of code units (UTF-8 bytes in Go, UTF-16 chars in Java).
The Go implementation should reject x+=y at top-level if it would reject x=x+y, as the Java implementation does.
It should be possible to call methods in two steps y = x.f; y(). The Java impl doesn't yet support it; that's a bug.
The parser should accept x.f = y. Currently the Java parser rejects it. (There are no datatypes in Bazel for which this statement can execute without error, but that is not a reason for the parser to reject it.)
hash(x) should be defined only for strings, with the same algorithm across implementations, to ensure predictable ordering execution across tools that, say, process Bazel BUILD files. (Although many types of values are hashable, the dict data type doesn't expose the hash values or their ordering.) We need to agree on a cheap simple hash function. AD proposes FNV32. https://golang.org/src/hash/fnv/fnv.go?s=1100:1124#L32 Glenn (the F in FNV) was present and concurred. :)
The sorted function should support the key and reverse parameters as these increase flexibility and efficiency. Although lambdas are syntactically convenient for key, the key function does not typically close over variables.

alandonovan commented 4 years ago

Update: These are all done, except:

~~the spec does not require multiprecision integers, though both the Go and Java impls use them. See https://github.com/bazelbuild/starlark/issues/120#issuecomment-725629941~~
string encodings: see https://github.com/bazelbuild/starlark/issues/112
x += y rebindings at top level
assert as a valid identifier.

bazelbuild / starlark

spec: convergence with Go #20