[GLSL 4.6 Specification] Clarify the translation from UTF-8 scalar values to the corresponding character set tokens

ContingencyOfTautologicalContradictions commented 1 year ago

At the GLSL 4.6 specification, add the following paragraph to the 3.1 section:

The given files for compilation must be in the form of a well-formed UTF-8 code unit sequence. These files are decoded to produce their corresponding sequence of Unicode scalar values. A sequence of character set tokens is then formed by mapping each Unicode scalar value to the corresponding character set token. In the resulting sequence, each pair of characters in the input sequence consisting of U+000D CARRIAGE RETURN followed by U+000A LINE FEED, as well as each U+000D CARRIAGE RETURN not immediately followed by a U+000A LINE FEED, is replaced by a single new-line character.

arcady-lunarg commented 1 year ago

This sounds like an issue with the spec, rather than the glslang compiler so I transferred it to the appropriate repository for that sort of issue.

gnl21 commented 1 year ago

I'm not sure what ambiguity you're aiming to clear up here, perhaps because I'm not sufficiently knowledgeable about UTF-8. Is there an alternative way of interpreting a UTF-8 sequence other than what you describe? I'm fine with spelling things out clearly, but this seems to be straying into territory that should be covered by the UTF-8 spec, rather than GLSL.

One specific concern that I have, for example, is that the proposed text talks about mapping the UTF-8 characters into the character set but doesn't say what the mapping is. I think that the UTF-8 codepoints actually already represent the characters, so don't need mapping, which is why the correct mapping is obvious, but if they're different enough to require mapping then we should say what the mapping is.

I'm not convinced that the handling of new lines in the proposed text is correct according to the current spec. GLSL currently says that any of "\r", "\n" or "\r\n" are a valid line break, which isn't the same as in your comment. I'm not sure what glslang implements for this.

arcady-lunarg commented 1 year ago

It looks like glslang currently treats "\n" or "\r\n" as line terminators, the situation with bare "\r" is more complicated in that I think it will not produce syntax errors but also will not give the right numbers. Note that the spec actually limits the valid characters in GLSL tokens to (a subset of) ASCII and the core language does not have strings. The GLSL_EXT_debug_printf extension does add string literals but the extension spec language still does not allow the use of codepoints above 126 in tokens, so the only place where non-ASCII characters can occur is in comments, where the current spec allows allows any byte values and doesn't require well-formed UTF-8. In practice, glslang doesn't enforce this and just accepts any sequence of bytes in a string literal (or in a header name in a #include, another place where arbitrary strings are allowed).

KhronosGroup / GLSL

[GLSL 4.6 Specification] Clarify the translation from UTF-8 scalar values to the corresponding character set tokens #220