bazelbuild / starlark

Starlark Language
Apache License 2.0
2.45k stars 163 forks source link

spec: new 'bytes' data type #112

Open alandonovan opened 4 years ago

alandonovan commented 4 years ago

This proposal argues for the addition of a new 'bytes' data type to Starlark, defined as an immutable sequence of byte (uint8) values.

Starlark has implementations in Go, Java, and Rust. It is crucial for performance that Starlark use the same string data type as the host language, but this poses a problem, because Java strings are sequences of uint16 values, conventionally holding UTF-16 encoded text, whereas Go's strings are sequences of uint8 values, conventionally holding UTF-8 encoded text.

The solution to this problem is to specify the behavior of Starlark strings in terms of an implementation-defined parameter K, which is 8 for Go and 16 for Java. Nearly all string operations over text can be usefully described this way, enabling portable text handling. (The sole obvious exception is str.strip(chars='abc'), as it treats the chars string as a set rather than an ordered sequence.) [This needs its own proposal. TODO(adonovan)]

This leaves the Java implementation without a way to represent arbitrary byte strings, such as the contents of arbitrary binary files, or fields of protocol buffers. This need would be met by adding a new data type, 'bytes', to the core language:

brandjon commented 4 years ago

byte string literals

The Python spec calls them "bytes literals", conspicuously omitting the term "string" when referring to them. I suggest we do the same.

byte string literals would permit hex escapes \xXX and octal escapes \377 over the range [0-255].

Bizarrely, Python 3 also permits \400 to \777 but truncates to 8 bits. Agreed that we should prohibit those values.

Ordinary strings would permit hex and octal escapes only for the ASCII range [0-127], but would additionally allow \uXXXX and \UXXXXXXXX escapes to denote the UTF-K encoding of a unicode code point specified in hex.

So that way escape sequences besides \u and \U yield behaviorally equivalent string content across implementations. Sounds good.

indexing bytes[int] would yield a 1-element byte string.

This differs from Python 3.

>>> b'\x00'[0]
0
>>> type(b'\x00'[0])
<class 'int'>

the following string methods would have byte-string counterparts: [count endswith find index join lstrip partition replace rfind rindex rpartition rsplit rstrip split startswith strip]

Note that many of the excluded methods exist on the bytes type in Python 3. Presumably this is to give access to Python 2-style string handling, so perhaps we don't need it.

the elems method would iterate over the 1-byte substrings, and elem_ords would iterate over the numeric byte values.

See above about indexing. If you want the items of elems to be of the same type as the values returned by indexing, then elems would be an iterable of integers. You could conceivably have another method to return the 1-byte subranges. (That could also be acheived through slicing, though you can say the same about string's elems method. I presume the reason we want an iterable is to avoid allocating for each item.)

str(bytes) would decode bytes to UTF-K, replacing invalid sequences by U+FFFD.

bytes(str) (that is, a new predeclared function called "bytes") would encode a text string to UTF-8, replacing invalid code units with the encoding of U+FFFD. (If K=8 and str is a valid encoding, the result is the same as the input.)

Should bytes values also have a decode method, and strings an encode method, as in Python?

Should we support the ability to specify an encoding besides UTF-8? What if Starlark code wants to write a file in an odd encoding? If we don't support that now, will we be able to add it seamlessly in the future simply by extending this design with an encoding param as in Python?

Should we support customizable (or at least multiple) behaviors for invalid text, e.g. "strict" mode (fail fast)? Should strict be the default as in Python 3? (Note that we might support user-defined behaviors by passing callback functions, as opposed to how Python passes names of callbacks registered with codecs.register_error().)

I'm fine with deferring all of these in implementation, so long as they don't conflict with anything we're doing today.

Additional points:

brandjon commented 4 years ago

The solution to this problem is to specify the behavior of Starlark strings in terms of an implementation-defined parameter K, which is 8 for Go and 16 for Java. Nearly all string operations over text can be usefully described this way, enabling portable text handling.

To expand on this, every string operation besides strip() expects strings (or substrings) as arguments, not individual unicode code points or code units. Therefore, they are agnostic to whether those arguments represent one code point or many. It doesn't matter that 🌿 encodes as 4 elements in the Go implementation and 2 elements in the Java one, because mystring.count('\U0001f33f') will return how many of these code points appear in mystring either way.

For strip, a plausible solution is to specify that it interprets its as a set of code points, not string elements. That is, it will decode the given string from UTF-k into unicode code points, reencode each of those code points separately back to UTF-k, then proceed to strip the input text of any occurrences of those separate substrings. Note that this implies strip's processing of its arguments could fail with unicode errors and is subject to the same decisions we're now making about U+FFFD vs strictness in other transcoding methods.

Yet more thoughts:

brandjon commented 4 years ago

Another thing: Apparently in Python you can pass non-ASCII numeric code points to the int and float functions:

>>> int('\u0c67')
1

It seems niche to support this, but maybe it's worse to not support it and make a backwards-incompatible change if we regret that?

The int and float functions also accept bytes objects.

brandjon commented 4 years ago

Would byte literals support the format operator % as in Python 3?

alandonovan commented 3 years ago

The Python spec calls them "bytes literals"... I suggest we do the same.

Agreed.

indexing bytes[int] would yield a 1-element byte string. This differs from Python 3.

Good point. Let's follow Python. But we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

If you want the items of elems to be of the same type as the values returned by indexing, then elems would be an iterable of integers.

Good point. Ordinary (UTF-K) strings need methods for iterating over two sequences---elements, and Unicode code points---and both of these usefully come in two flavors: singleton substrings, and numeric codes. Byte strings don't need the code point iterator, but they would want both substring and numeric flavors unless there's a convenient and cheap alternative conversion. [Update: ord is that conversion, from 1-element strings to numbers.]

Should bytes values also have a decode method, and strings an encode method, as in Python?

No. It's redundant wrt bytes(string) and str(bytes).

Should we support the ability to specify an encoding besides UTF-8?

No, I don't think it is important to support odd encodings. Users can add encoding functions to their applications if they want, but the world is moving to UTF-8. (The utf8everywhere.org manifesto is a good read.)

Should we support customizable (or at least multiple) behaviors for invalid text, e.g. "strict" mode (fail fast)?

No. Such things could in principle be added later, but I don't see a compelling need, and I doubt one will arise. I really don't want to add callback functions to these simple interfaces.

Probably best to prohibit implicit iteration for consistency with Starlark strings.

Agreed.

a unicode code point appearing within a bytes literal is treated as its UTF8 encoding, whereas the same code point appearing within a string literal is transcoded (if necessary) to the implementation's native string encoding.

Exactly. The important thing is that for any "abc" that is legal in both "abc" and b"abc" literals, we have the invariant bytes("abc") == b"abc".

For strip, a plausible solution is to specify that it interprets its [cutset] as a set of code points, not string elements.

That seems like the best approach; it's the one taken by Go's strings.Trim (https://golang.org/pkg/strings/#Trim), for example. The implementation could be simpler than what you suggest: instead of materializing the sequence of code points, you can decode the input string and test each code point as you go. The most common cutset is a single ASCII byte, followed by a sequence of ASCII bytes, both of which make the decoding and the testing trivial. Rarely is the cutset non-ASCII or large, so the quadratic implementation is fine for those cases.

Python appears to be case insensitive for literal prefixes like b and r.

I'd vote for lowercase only.

Python has raw bytes literals, spelled rb or br (again case insensitive).

Starlark should too.

Python has \N to allow specifying unicode characters by name. This seems like extra complexity we don't need

Agreed.

Triple-quoted bytes literals are permitted by Python so we should probably allow them as well.

Agreed.

Python disallows non-ASCII code points from appearing inside bytes literals, contrary to what was discussed above. '🌿' is a syntax error.

That's no doubt because Python doesn't mandate a source file encoding, but Starlark does (though Bazel screws things up here).

We should disallow implicit bytes literal concatenation (b'abc' b'def'), as we do for strings. We should allow explicit concatenation (b'abc' + b'def').

Agreed.

In Python, ord() also works on bytes values of length 1.

That's good to know. This provides a means of efficiently mapping from 1-element substrings to numbers, and means we could do away with the numeric flavor of the 1-element substring iterator for both strings and for bytes.

Python also has hex() and fromhex() methods on bytes.

That seems reasonable to add. It can be defined in the language (or rather, it could be if we had '%x"%x, and we should), but not without an enormous amount of wasteful allocation.

For strings, the various alphanumeric / whitespace methods will need to be made unicode-aware to match Python 3's behavior. Last I looked, this entailed embedding a database of unicode character traits in the binary.

Yes. I recently reviewed a change to RE2/J which generated these tables from ICU, instead of from Go's unicode package: https://github.com/google/re2j/commit/eedfe4bfdfdf6ebf7055e641a9398a81b6f77f6c We should learn what we can from there before coding.

Apparently in Python you can pass non-ASCII numeric code points to the int and float functions:

int('\u0c67') # U+0C67 is TELUGU DIGIT ONE 1 It seems niche to support this

It sure does seem niche, not to mention an intriguing avenue for security attacks. We should definitely not support it.

The int and float functions also accept bytes objects.

Seems like a relic of Python's old string handling, and not something we should support. bytes are not text.

Would byte literals support the format operator % as in Python 3?

No. Again, bytes are not text.

brandjon commented 3 years ago

we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

1-element slices would work. Of course that still allocates, but that's unavoidable in Java if you're creating a (non-primitive) value, no?

Byte strings don't need the code point iterator, but they would want both substring and numeric flavors unless there's a convenient and cheap alternative conversion. [Update: ord is that conversion, from 1-element strings to numbers.]

Then is the proposal to have bytes.elems() give you an iterable over the 1-element slices, and have no other new methods, relying on ord() for the numeric conversion on an element-by-element basis? (Note that if e is an item in b.elems(), then ord(e) is just e[0]. But ord allows for an ad hoc polymorphism between strings and bytes that wouldn't otherwise exist.)

No. It's redundant wrt bytes(string) and str(bytes).

I imagine other constructs in Starlark could be considered redundant. List += vs extend(), for instance. If we are to leave out a redundant construct, do we have a sense of which spelling -- the constructor-like notation, or the encode/decode methods -- is more commonly used by Python programmers?

No, I don't think it is important to support odd encodings. Users can add encoding functions to their applications if they want, but the world is moving to UTF-8. (The utf8everywhere.org manifesto is a good read.)

That does look like a good read. I can understand having a bias towards UTF-8, but I wonder how much of it is aspirational and how much of it reflects today's reality. Will lack of built-in support for other encodings be problematic to users?

I wonder: What do we do in Bazel when writing things like command line args of an action? Do we just never serialize any non-ASCII text in practice? Probably. But if we handle it correctly, the transcoding differs between windows and unix.

No. Such things could in principle be added later, but I don't see a compelling need, and I doubt one will arise. I really don't want to add callback functions to these simple interfaces.

If we reach a point where callbacks are being requested, I'd agree that we can just tell people to write a library function. But strictness seems like a basic enough feature that we can make it an option (agreed that it doesn't have to be done right now).

As for whether U+FFFE replacement is a sensible default: the argument against strictness is that you don't want to see failures when somebody throws bad data at you, you'd rather just process it and give garbage out for their garbage in? Ordinarily as a rule we prefer to fail fast. I'm undecided here.

Python is strict by default. Apparently strict means erroring out on surrogate code points, but not on U+FFFE and U+FFFF.

What was the root of Python 2's mess with unicode, where the moment the code handled non-ASCII data you'd suddenly see failures that your tests didn't catch? Was that related to the encoding strictness, or only to the loose typing between bytes and strings?

I'd vote for lowercase only.

Yeah, I'd kinda like to ban uppercase as non-conventional and weird.

That's no doubt because Python doesn't mandate a source file encoding, but Starlark does (though Bazel screws things up here).

Hm. According to SO, Java doesn't specify the source code format below the level of unicode code points. Is it in any way problematic to require UTF-8, beyond e.g. breaking the workflows of the masses of people clamoring to write Starlark code in Windows Notepad?

This provides a means of efficiently mapping from 1-element substrings to numbers, and means we could do away with the numeric flavor of the 1-element substring iterator for both strings and for bytes.

So for str, it'd just be s.elems(), s.codepoints(), and s.codepoint_ords()? Sounds like codepoint_ords could also be replaced by a function that operates on a single UTF-8-encoded code point, but in that case we'd be introducing a new predeclared symbol, so we may as well keep the method. How confident are we that this set of primitives -- iterators over elements and code points and str/bytes conversion via constructors -- provides the right set of tools to users?

Sticking to Python's treatment of unicode character metadata, and abandoning Python's loose byte typing w.r.t. int()/float() and its % formatting, sounds good to me.

alandonovan commented 3 years ago

If you're interested in this proposal, please take a look at the attached PR for the reference implementation in Go. I have a Java implementation too but it's considerably more complicated due to Bazel's long-standing mis-decoding of UTF-8 files as Latin1.

Suggestions welcome.

brandjon commented 3 years ago

I aggregated the issues raised in this discussion thread so far.

Resolved

Open issues

Iteration API

Exactly what elems(), elem_ords(), etc. methods do we support on strings and bytes?

Future extensions to encoding/decoding

If we support more encodings or more modes (strictness, etc.) in the future, what would that look like and would it interfere with any decisions we’re making now? I think I'm ok with the type names being the principal conversion operator (as opposed to encode()/decode() methods), so that means possibly adding keyword args to the str and bytes functions.

What was the root of Python 2's mess with unicode, where the moment the code handled non-ASCII data you'd suddenly see failures that your tests didn't catch? Was that related to the encoding strictness, or only to the loose typing between bytes and strings?

FWIW, this summed up what I was probably thinking of.

Surrogates

Under what circumstances does unpaired surrogate replacement by U+FFFD occur?

Does it happen when decoding the source text as UTF-8 (which technically shouldn't allow unpaired surrogates as content)? If so, that means unpaired surrogates would be transformed even when appearing inside bytes literals, though the transformation would occur at the source decoding level, not the lexer or parser. Is that within scope of the spec?

Does it happen when reading a \u or \U escape sequence representing a surrogate code point in a string literal? (Of course it wouldn't for a bytes literal.) With k=16, can you write "\uXXXX\uXXXX" to form a two-element string containing a valid surrogate pair, and have it be the exact same string value as if you wrote "\UXXXXXXXX" for the corresponding code point that the pair represents?

Does it happen when you convert between strings and bytes using str() and bytes()? In particular, how does this square with the property you cited:

The important thing is that for any "abc" that is legal in both "abc" and b"abc" literals, we have the invariant bytes("abc") == b"abc".

Note that U+FFFD replacement is the only thing standing in the way of having bytes <-> str conversion be lossless when k = 8.

1-byte strings

we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

1-element slices would work. Of course that still allocates, but that's unavoidable in Java if you're creating a (non-primitive) value, no?

Actually, I suppose we can simply preallocate the 256 length-1 bytes values. But I see now we're talking about going from an integer to a bytes value.

Python apparently has bytes(size) -> bytes value with size many 0x00 values. I presume we don’t want to carry over that constructor, but its presence in Python blocks us from doing bytes(octet) -> bytes([octet]). Maybe we could use a keyword parameter, e.g. bytes(octet=x) -> bytes([x])?

Strange values of k

In the spec, do we support k = 7 (UTF-7) or k = 32 (UTF-32)? Or are we explicitly mandating that k is either 8 or 16?

brandjon commented 3 years ago

See my comments on the starlark-go PR. I think the unifying theme of the problem with elems/ord is that we need a convenient, orthogonal, and obvious way to construct and deconstruct arbitrary sequences of k-bit elements. The restriction on non-ASCII numeric escapes in strings doesn't do that. The unicode-centric interpretation of ord() doesn't do that.

Instead, what if we make ord() always give the integer corresponding to the element (for both bytes and strings), and allow numeric escapes up to the max value representable in k? I.e., you could do "\xFF" in the Go implementation, and possibly even "\XFFFF" (new escape for new width) in Java. Then it's easy to make any kind of string, including malformed unicode strings, which were already possible but just edge cases. For both strings and bytes we do away with .elem_ords() unless it turns out to be performance critical for something. ord(a[i]) gives the number of the ith element for both strings and bytes.

For strings, we also provide a top-level codepoint_ord() and a string.codepoints() method (with string.codepoint_ords() similarly being omitted as redundant). codepoint_ord() gets the number of the first codepoint in the arg.

Going from integers to string or byte elements is less uniform. We have bytes([x]) and chr(x), though the latter may produce more than one element depending on x and k.

Note that the possible \XNNNN escape is redundant with \uNNNN on k=16 platforms, and illegal on k=8 platforms. Its main purpose is to signal that we're constructing an element, not a UTF-8 sequence. I'm open to other ideas though. (The escape-based approach also falls apart entirely if we are allowed to conceive of k=32 implementations.)

Edit: You may argue that this makes strings too similar to bytes. But if we really wanted them to be so dissimilar, wouldn't we define them as sequences of code points like Python does?

brandjon commented 3 years ago

Notes from my discussion with Alan follow.

Corrections from my summary above:

Construction

The constructor syntax we were going to use differs from Python. In Python:

We should aim for consistency with Python. At the very least, this means disallowing bytes(string_value) and having repr-like behavior for str(bytes_value). I suggest that we support both the constructor-style and encode()/decode()-style conversions, but the encode/decode one is more important since it permits omitting the encoding param in the common case of utf-8.

Unlike Python, we should require that encoding is passed by keyword only, not positionally. This improves user code clarity and helps avoid the situation where we can't introduce subsequent parameters (strict) until this one's implemented.

We omit Python's bytes(size) constructor. (The nullary bytes() constructor should probably still be allowed and produce an empty bytes object, analogous to str() and to other nullary constructors producing "zero" values.)

The bytes.join() method allows for linear-time concatenation of bytes, similar to strings. E.g. b''.join(bytes_builder) where bytes_builder is a list of bytes objects.

We may support bytes(octet=x) as a slightly more efficient synonym for bytes([x]), though we don't have to decide that right away. chr() and codepoint_chr() (see below) already serve that purpose for strings. For bytes, you can construct an arbitrary sequence dynamically (i.e., outside literals) by using bytes([n1, n2, ...]). For strings, you'd have to do ''.join([chr(n) for n in [n1, n2, ...]]).

Strictness

Similarly, we should side with Python's default of failing fast when an encoding error is detected. This behavior can be overridden by the strict parameter, which behaves as in Python (but perhaps omitting some of its possible values to start, e.g. "ignore"). Unlike Python, we make strict keyword-only.

While decoding (utf-8-encoded) Starlark source code, an invalid surrogate code point should result in an error.

Value of k

We specify that k = 8 and k = 16 are the only valid implementations. No k = 32 in particular.

Elements and encodings

Manipulating string elements is a first-class concept regardless of unicode encodings.

We add a \Xnnnn escape (not present in Python) for specifying element values from 0x0000 to 0xFFFF (k=16) or to 0x00FF (k=8). It's a lexical error to go out of range on a k=8 implementation. On a k=16 implementation, \X has the same effect as \u for non-surrogate code points.

It is illegal to use \u or \U to specify surrogate code points. This differs from Python. The rationale is that in Python, unicode escapes create an element whose code point matches the given number, and a surrogate is a valid Python string element (and a valid Starlark string element for k = 16). But in this proposal, unicode escapes simply append the utf-k encoding of the given code point to the string, and a surrogate's code point cannot be properly encoded. If you really want this encoded value anyway, you can construct it with one \X (for k=16 implementations) or several \x (for k=8, or for bytes objects) escapes.

The string.elems() and bytes.elems() methods give you an iterable of the elements -- individual length-1 strings for strings, and octet integers for bytes. In both cases this matches what you'd get from directly iterating over the object in Python (which we don't support). The ord() top-level function takes in a length-1 string or bytes value and returns its ordinal. The chr() top-level function does the inverse operation for strings, taking an integer (in range 2^k) to a 1-element string. Note that all of this is independent of unicode encodings.

For strings we also provide a string.codepoints() method and two toplevel functions, codepoint_ord() and codepoint_chr().

All three methods could take in a strict parameter to control their tolerance of accepting and yielding surrogates. They never accept or produce values out of the unicode range (U+0000 to U+10FFFF).

This gives us a complete elements-oriented API for dealing with strings and bytes, as well as a separate unicode-oriented API for strings. Note that whereas the format of a bytes object can be customized by the encoding parameter of encode()/decode(), the API always assumes utf-k for strings.

We do need to give some thought to how to effectively handle strict failure, given that Starlark does not have exceptions. It seems undesirable to add look-before-you-leap variants of these methods, or even a keyword param that changes it from a results-producing-call to a safety check. Maybe we can have "strict" mean "returns None on failure".