bazelbuild / starlark

Starlark Language
Apache License 2.0
2.44k stars 161 forks source link

Restore codepoints? #191

Open ndmitchell opened 3 years ago

ndmitchell commented 3 years ago

The spec at https://github.com/google/skylark/blob/3705afa472e466b8b061cce44b47c9ddc6db696d/doc/spec.md#string%C2%B7codepoints has a function codepoints, which gives the string as a series of points. That seems pretty reasonable to provide. Why did it disappear? The Rust Starlark implementation provides it, based on the old spec.

adonovan commented 3 years ago

It's present in the Go implementation and its documentation too, and I think it's a reasonable function. But when the spec in this repo was derived from the spec I wrote for the Go implementation, tricky parts were sadly just thrown out rather than flagged as differences.

In this particular case, the feature is tricky to implement in the Java implementation: it's technically easy to expose java.lang.String's codepoint iterator, but doing so would reveal that all Starlark strings in Bazel are actually mis-decoded (UTF-8 files and directories are decoded as Latin1). Hiding the codepoints iterator (and also the chr function and "%c" format) makes it impossible for Bazel users to notice.

This is an unfortunate case where an old mistake in Bazel (mine, BTW) makes it hard to "do the right thing" in Starlark.

ndmitchell commented 3 years ago

Would the right thing to be to add it to Starlark, but make a caveat that the Java implementation isn't compliant here? If historical mistakes in Bazel prevent progress on Starlark there might be a number of issues where we are unable to make forward progress.

adonovan commented 3 years ago

Yes, I think that's appropriate. We could even add the method to Java, but make it fail dynamically subject to a flag that is disabled in Bazel as long as Bazel suffers from https://github.com/bazelbuild/bazel/issues/374 and related bugs.