add string literal syntax using paired Unicode delimiters

clarkevans commented 3 years ago

Executive Summary

This is a request to add a string literal syntax using paired Unicode delimiters, perhaps ⟪ and ⟫, for use in non-standard string literal macros. This is proposed as an alternative complementary to, but not as a replacement for single or triple double-quoted raw strings.

Description of Requested Syntax

Paired delimiters '⟪': U+27EA (Ps: Punctuation, open) and '⟫' (U+27EB, Pe: Punctuation, close) are employed.
Following a string macro name, such as htl for @htl_str, the open delimiter, ⟪, begins a string using this syntax.
The parser knows the extent of the string when the corresponding closing delimiter, ⟫, is encountered.
Nested pairs of these delimiters are seen as content. This could be done by tracking depth, the open delimiter increases the depth, while the close delimiter decreases the depth -- when the depth reaches zero, scanning is done.
The entire extent of the scanned buffer, less the very first opening and the very last closing delimiters become the string value that is passed along to the string macro.
There is no further complications with regard to scanning or processing of the string done by Julia. In particular, from Julia's perspective, there is no mechanism to escape content, interpolate content, or enter arbitrary Unicode code points.
The interactive Julia environment could add \>> and \>> as a way to enter these paired delimiters.

Critically, this non-standard string literal syntax provides no mechanism to escape either of the delimiters, excepting that nested pairings are permitted within content. In particular, unbalanced use of the given delimiters are simply not valid syntax. Julia provides no mechanism to enter unbalanced delimiters within this syntax.

Motivation

Let's define the term notation to mean what is currently in the documentation as "non-standard string literal". The word notation is used by SGML and other standards for this concept.

For those doing data munging to interoperate with other systems, there is an opportunity for the Julia language to better utilize notations, enhancing developer experience and improving code readability. While developing HypertextLiteral (providing Julia-style string interpolation to HTML construction), I ran into 3 challenges with existing string "non-standard string literals" (notations).

1) They are not succinct. Since a great many subordinate syntaxes include the double quote character, use of the triple double-quoted form is the norm. The double quote character is already loud, tripling it on both ends... becomes a distraction. Note that this deficiency applies also to the use of @macros().

2) They can be surprising. For cases where someone tries to use the single double-quoted form, novice users can be caught off guard with the raw_str escaping semantics and how it interacts with the backslash. As noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab).

3) They can't be used recursively. If one would like to embed one notation inside another, a round of character escaping is required. This is unlike, for example, @macros() which nest perfectly well.

A promising option emerged on in the discussion forums: the use of paired Unicode delimiters together with a matching parsing algorithm in place of traditional character escaping. You could think of this approach as bringing to string construction what we already know about function calling and data structures -- that they are seldom flat structures.

Specifically, we could employ '⟪': U+27EA (Ps: Punctuation, open) and '⟫' (U+27EB, Pe: Punctuation, close) as paired delimiters. This particular glyph combines a doubling (reminiscent of double quotes) with that of parenthesis (implying nestability). It's not perfect, but it is visually distinct in most fonts and in mono-space fonts appears to take the space of one regular character.

When Julia encounters a name token, say htl, followed by ⟪, it would enter "notation" parsing state. Here it would keep track of the nesting depth, increasing depth when additional ⟪ are encountered, and decreasing depth when ⟫ is encountered. When the depth reaches zero, the entire span (less outer most tokens) of the string is sent unprocessed to @htl_str, and Julia parsing resumes. The REPL could add \<< and \>> shorthand to permit these two characters to be easily entered.

This addresses the three deficiencies noted above. This paired delimiter is much more succinct and visually attractive as compared to tripled double-quotes. The rule is unsurprising since there is no escaping, only the counting of depth, as one would find with parenthesized expressions. The rule naturally supports nesting, any construction using this method could be directly embedded as a subordinate notation. Moreover, if Unicode is used, these delimiters are unlikely to collide with those used in traditional systems, and if they do, so long as those systems use only paired form, there is no difficulty.

What about content having a non-paired delimiter?

This is a two part answer. Primarily, how to avoid the chosen delimiter pair becomes the notation's concern, not Julia's. For example, HTML has ampersand escaping, so the opening delimiter could be written as ⟪. URLs use percent-encoding. Traditional double-quoted syntax (e.g. "\u27EA" for the opening delimiter) could be used by a Python notation. For example, to encode a non-paired opening delimiter, a use of this feature might look like...

htm⟪<html><body>We start these string literals with <code>⟪</code></body></html>⟫

Asking a notation to provide its own delimiter escaping is not without precedent. In web pages, embedded Javascript begins with <script> and ends when the HTML parser encounters </script> -- with no escape mechanism. Javascript developers who need to represent this sequence within their logic use regular double quoted strings, with the delimiter encoded as as "<\/script>".

As a fallback, for notations such as @raw_str which lack such features, if the user must include a non-paired delimiter, they could use the existing raw string syntax which would not go away. Alternatively, they could be creative and build their string in chunks, using this syntax for most of the content and concatenating with regular double quoted strings for the non-paired delimiter. This proposed syntax aims to be complementary to existing approaches and represents different set of sensibilities.

Increased Usability

With this feature, a regular expression to detect quoted strings might be written as r⟪(["'])(?:\\?+.)*?\1⟫ with no need to triple double-quote or worry about slashes. Moreover, other notations could embed regular expression notation without having to worry about a round of additional escaping.

I believe these rules would permit developers integrating with foreign data producers and consumers to create their own succinct, unsurprising and nested function-like transformations that mix native languages within a Julian data processing context. Here is an example.

render(books) = htl⟪
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(htl⟪
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    ⟫ for b in books)</tbody></table>⟫

In HypertextLiteral, the functionality above is currently written as...

render(books) = @htl("""
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(@htl("""
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    """)) for b in books)</tbody></table>""")

While one might argue that the latter form is particularly fine, this example works because HTL uses Julia's syntax and excellent parser. Notations defined outside of Julia's ecosystem won't have this luxury.

In conclusion, a succinct, unsurprising, and nestable way to incorporate foreign notations as Julia expressions will open up opportunities for innovative uses of Julia's excellent macro system and dynamic programming environment. What are the costs? A relatively simple parser rule and integration with existing string macros and... the assignment of a Unicode pair.

heetbeet commented 3 years ago

I got something working. It plays nice with the current system and macros and have it's own escape semantics: https://github.com/heetbeet/julia/tree/b55322073313b7af0b6fa849c28fad95cc66ca79

The following pairs are chosen: «» French-style, »« Danish-style and ⟪⟫ Tai Lue-style
Current semantics for x"foobar"y are left alone and remain exactly as they are (no unforeseen changes to the language).
The new syntax is only available as a macro call like raw«foobar» and not as some standalone «foobar» string syntax.
Block macro's are easily made available through defining a 3 or 4 argument macro ending in _block; i.e. macro x_block(str, ldelim, rdelim, suffix) end
Furthermore if x_block is undefined, but x_str is defined, x«foobar» will dispatch to x_str by discarding the delimiter information.
If no suitable macro is defined you get a UndefVarError: @x_block not defined error.

Example from a Julia session

julia> raw»hello«
"hello"

julia> bla»hello«
ERROR: LoadError: UndefVarError: @bla_block not defined
in expression starting at REPL[3]:1

julia> println(raw»I can repre\\se\\\\\\"\\\"nt these"""«)
I can repre\\se\\\\\\"\\\"nt these"""

You can make it work alongside str macros that employ their own escape semantics. I.e. you can have b"\n" transform \n to a newline, and have b»\n« keep the text as \n:

julia> b"\n" == b»\n«
true

julia> #Oopsie...

julia> macro b_block(str, args...); codeunits(str) end
@b_block (macro with 1 method)

julia> b"\n" == b»\n«
false

julia> b»\n«
2-element Base.CodeUnits{UInt8, String}:
 0x5c
 0x6e

adkabo commented 2 years ago

Here is some more prior art.

In D, string literals use q"…" to allow

any of several paired delimiters:

q"(...)"
q"[...]"
q"<...>"
q"{...}"

multi-line heredocs:

q"EOS
This
is a multi-line
heredoc string
EOS"

user-specified single-character delimiters:

q"/foo]/"          // "foo]"
// q"/abc/def/"    // error

In C++, raw literals let the user specify an outfix pair around parentheses like R"foo(...)foo".

// with no identifier, only (
R"(hello)" // in
hello // printout

// nested, with no identifier, only (
R"(hello R"x(world)x")" // in
hello R"x(world)x" // printout

// with no identifier, it is not possible to include )" in the string, because there is no escaping.
R"(hello)")" // in
error: Unexpected `)` after hello)". // printout

// using bar()bar to identify the delimiter level solves that problem
R"bar(world)")bar" // in
world)" // printout

// using foo()foo and bar()bar for nested strings
R"foo(hello R"bar(world)bar")foo" // in
hello R"bar(world)bar" // printout

c42f commented 1 year ago

I had one idea over at https://github.com/JuliaLang/julia/issues/39092#issuecomment-1238788154 which should belong here in this thread instead, I guess. To copy some of it:

One possibility could be that either ` or "" begins a string when it's followed by the opposite quote type, with the (reversed/same?) delimiter at the other end of the string. It's currently a syntax error to juxtapose string literals so this syntax is probably available unless I've forgotten something. For example, the string"hi"`
julia> :(x``"hi"``)
ERROR: syntax: cannot juxtapose string literal
Stacktrace:
 [1] top-level scope
   @ none:1
I'm imagining this mixed delimiter parsing as @x_cmd "hi", given that the quote starts with a backtick and it could be @x_str for ".

The rule might be that mixed delimiters can be an arbitrarily long sequence of length at least 3, and the user can always arrange for those to not be present in the string they're trying to quote.

JuliaLang / julia