JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.36k stars 5.46k forks source link

add string literal syntax using paired Unicode delimiters #38948

Open clarkevans opened 3 years ago

clarkevans commented 3 years ago

Executive Summary

This is a request to add a string literal syntax using paired Unicode delimiters, perhaps ⟪ and ⟫, for use in non-standard string literal macros. This is proposed as an alternative complementary to, but not as a replacement for single or triple double-quoted raw strings.

Description of Requested Syntax

Critically, this non-standard string literal syntax provides no mechanism to escape either of the delimiters, excepting that nested pairings are permitted within content. In particular, unbalanced use of the given delimiters are simply not valid syntax. Julia provides no mechanism to enter unbalanced delimiters within this syntax.

Motivation

Let's define the term notation to mean what is currently in the documentation as "non-standard string literal". The word notation is used by SGML and other standards for this concept.

For those doing data munging to interoperate with other systems, there is an opportunity for the Julia language to better utilize notations, enhancing developer experience and improving code readability. While developing HypertextLiteral (providing Julia-style string interpolation to HTML construction), I ran into 3 challenges with existing string "non-standard string literals" (notations).

1) They are not succinct. Since a great many subordinate syntaxes include the double quote character, use of the triple double-quoted form is the norm. The double quote character is already loud, tripling it on both ends... becomes a distraction. Note that this deficiency applies also to the use of @macros().

2) They can be surprising. For cases where someone tries to use the single double-quoted form, novice users can be caught off guard with the raw_str escaping semantics and how it interacts with the backslash. As noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab).

3) They can't be used recursively. If one would like to embed one notation inside another, a round of character escaping is required. This is unlike, for example, @macros() which nest perfectly well.

A promising option emerged on in the discussion forums: the use of paired Unicode delimiters together with a matching parsing algorithm in place of traditional character escaping. You could think of this approach as bringing to string construction what we already know about function calling and data structures -- that they are seldom flat structures.

Specifically, we could employ '⟪': U+27EA (Ps: Punctuation, open) and '⟫' (U+27EB, Pe: Punctuation, close) as paired delimiters. This particular glyph combines a doubling (reminiscent of double quotes) with that of parenthesis (implying nestability). It's not perfect, but it is visually distinct in most fonts and in mono-space fonts appears to take the space of one regular character.

When Julia encounters a name token, say htl, followed by , it would enter "notation" parsing state. Here it would keep track of the nesting depth, increasing depth when additional are encountered, and decreasing depth when is encountered. When the depth reaches zero, the entire span (less outer most tokens) of the string is sent unprocessed to @htl_str, and Julia parsing resumes. The REPL could add \<< and \>> shorthand to permit these two characters to be easily entered.

This addresses the three deficiencies noted above. This paired delimiter is much more succinct and visually attractive as compared to tripled double-quotes. The rule is unsurprising since there is no escaping, only the counting of depth, as one would find with parenthesized expressions. The rule naturally supports nesting, any construction using this method could be directly embedded as a subordinate notation. Moreover, if Unicode is used, these delimiters are unlikely to collide with those used in traditional systems, and if they do, so long as those systems use only paired form, there is no difficulty.

What about content having a non-paired delimiter?

This is a two part answer. Primarily, how to avoid the chosen delimiter pair becomes the notation's concern, not Julia's. For example, HTML has ampersand escaping, so the opening delimiter could be written as &#10218;. URLs use percent-encoding. Traditional double-quoted syntax (e.g. "\u27EA" for the opening delimiter) could be used by a Python notation. For example, to encode a non-paired opening delimiter, a use of this feature might look like...

htm⟪<html><body>We start these string literals with <code>&#10218;</code></body></html>⟫

Asking a notation to provide its own delimiter escaping is not without precedent. In web pages, embedded Javascript begins with <script> and ends when the HTML parser encounters </script> -- with no escape mechanism. Javascript developers who need to represent this sequence within their logic use regular double quoted strings, with the delimiter encoded as as "<\/script>".

As a fallback, for notations such as @raw_str which lack such features, if the user must include a non-paired delimiter, they could use the existing raw string syntax which would not go away. Alternatively, they could be creative and build their string in chunks, using this syntax for most of the content and concatenating with regular double quoted strings for the non-paired delimiter. This proposed syntax aims to be complementary to existing approaches and represents different set of sensibilities.

Increased Usability

With this feature, a regular expression to detect quoted strings might be written as r⟪(["'])(?:\\?+.)*?\1⟫ with no need to triple double-quote or worry about slashes. Moreover, other notations could embed regular expression notation without having to worry about a round of additional escaping.

I believe these rules would permit developers integrating with foreign data producers and consumers to create their own succinct, unsurprising and nested function-like transformations that mix native languages within a Julian data processing context. Here is an example.

render(books) = htl⟪
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(htl⟪
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    ⟫ for b in books)</tbody></table>⟫

In HypertextLiteral, the functionality above is currently written as...

render(books) = @htl("""
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(@htl("""
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    """)) for b in books)</tbody></table>""")

While one might argue that the latter form is particularly fine, this example works because HTL uses Julia's syntax and excellent parser. Notations defined outside of Julia's ecosystem won't have this luxury.

In conclusion, a succinct, unsurprising, and nestable way to incorporate foreign notations as Julia expressions will open up opportunities for innovative uses of Julia's excellent macro system and dynamic programming environment. What are the costs? A relatively simple parser rule and integration with existing string macros and... the assignment of a Unicode pair.

heetbeet commented 3 years ago

I got something working. It plays nice with the current system and macros and have it's own escape semantics: https://github.com/heetbeet/julia/tree/b55322073313b7af0b6fa849c28fad95cc66ca79

Example from a Julia session

julia> raw»hello«
"hello"

julia> bla»hello«
ERROR: LoadError: UndefVarError: @bla_block not defined
in expression starting at REPL[3]:1

julia> println(raw»I can repre\\se\\\\\\"\\\"nt these"""«)
I can repre\\se\\\\\\"\\\"nt these"""

You can make it work alongside str macros that employ their own escape semantics. I.e. you can have b"\n" transform \n to a newline, and have b»\n« keep the text as \n:

julia> b"\n" == b»\n«
true

julia> #Oopsie...

julia> macro b_block(str, args...); codeunits(str) end
@b_block (macro with 1 method)

julia> b"\n" == b»\n«
false

julia> b»\n«
2-element Base.CodeUnits{UInt8, String}:
 0x5c
 0x6e
adkabo commented 2 years ago

Here is some more prior art.

  1. In D, string literals use q"…" to allow
q"(...)"
q"[...]"
q"<...>"
q"{...}"
q"EOS
This
is a multi-line
heredoc string
EOS"
q"/foo]/"          // "foo]"
// q"/abc/def/"    // error
  1. In C++, raw literals let the user specify an outfix pair around parentheses like R"foo(...)foo".
// with no identifier, only (
R"(hello)" // in
hello // printout

// nested, with no identifier, only (
R"(hello R"x(world)x")" // in
hello R"x(world)x" // printout

// with no identifier, it is not possible to include )" in the string, because there is no escaping.
R"(hello)")" // in
error: Unexpected `)` after hello)". // printout

// using bar()bar to identify the delimiter level solves that problem
R"bar(world)")bar" // in
world)" // printout

// using foo()foo and bar()bar for nested strings
R"foo(hello R"bar(world)bar")foo" // in
hello R"bar(world)bar" // printout
c42f commented 1 year ago

I had one idea over at https://github.com/JuliaLang/julia/issues/39092#issuecomment-1238788154 which should belong here in this thread instead, I guess. To copy some of it:

One possibility could be that either ` or "" begins a string when it's followed by the opposite quote type, with the (reversed/same?) delimiter at the other end of the string. It's currently a syntax error to juxtapose string literals so this syntax is probably available unless I've forgotten something. For example, the string"hi"`

julia> :(x``"hi"``)
ERROR: syntax: cannot juxtapose string literal
Stacktrace:
 [1] top-level scope
   @ none:1

I'm imagining this mixed delimiter parsing as @x_cmd "hi", given that the quote starts with a backtick and it could be @x_str for ".

The rule might be that mixed delimiters can be an arbitrarily long sequence of length at least 3, and the user can always arrange for those to not be present in the string they're trying to quote.