Open clarkevans opened 3 years ago
I got something working. It plays nice with the current system and macros and have it's own escape semantics: https://github.com/heetbeet/julia/tree/b55322073313b7af0b6fa849c28fad95cc66ca79
«»
French-style, »«
Danish-style and ⟪⟫
Tai Lue-stylex"foobar"y
are left alone and remain exactly as they are (no unforeseen changes to the language).raw«foobar»
and not as some standalone «foobar»
string syntax._block
; i.e. macro x_block(str, ldelim, rdelim, suffix) end
x_block
is undefined, but x_str
is defined, x«foobar»
will dispatch to x_str
by discarding the delimiter information.UndefVarError: @x_block not defined
error.Example from a Julia session
julia> raw»hello«
"hello"
julia> bla»hello«
ERROR: LoadError: UndefVarError: @bla_block not defined
in expression starting at REPL[3]:1
julia> println(raw»I can repre\\se\\\\\\"\\\"nt these"""«)
I can repre\\se\\\\\\"\\\"nt these"""
You can make it work alongside str macros that employ their own escape semantics. I.e. you can have b"\n"
transform \n
to a newline, and have b»\n«
keep the text as \n
:
julia> b"\n" == b»\n«
true
julia> #Oopsie...
julia> macro b_block(str, args...); codeunits(str) end
@b_block (macro with 1 method)
julia> b"\n" == b»\n«
false
julia> b»\n«
2-element Base.CodeUnits{UInt8, String}:
0x5c
0x6e
Here is some more prior art.
q"…"
to allow q"(...)"
q"[...]"
q"<...>"
q"{...}"
q"EOS
This
is a multi-line
heredoc string
EOS"
q"/foo]/" // "foo]"
// q"/abc/def/" // error
R"foo(...)foo"
.// with no identifier, only (
R"(hello)" // in
hello // printout
// nested, with no identifier, only (
R"(hello R"x(world)x")" // in
hello R"x(world)x" // printout
// with no identifier, it is not possible to include )" in the string, because there is no escaping.
R"(hello)")" // in
error: Unexpected `)` after hello)". // printout
// using bar()bar to identify the delimiter level solves that problem
R"bar(world)")bar" // in
world)" // printout
// using foo()foo and bar()bar for nested strings
R"foo(hello R"bar(world)bar")foo" // in
hello R"bar(world)bar" // printout
I had one idea over at https://github.com/JuliaLang/julia/issues/39092#issuecomment-1238788154 which should belong here in this thread instead, I guess. To copy some of it:
One possibility could be that either
` or "" begins a string when it's followed by the opposite quote type, with the (reversed/same?) delimiter at the other end of the string. It's currently a syntax error to juxtapose string literals so this syntax is probably available unless I've forgotten something. For example, the string
"hi"`julia> :(x``"hi"``) ERROR: syntax: cannot juxtapose string literal Stacktrace: [1] top-level scope @ none:1
I'm imagining this mixed delimiter parsing as
@x_cmd "hi"
, given that the quote starts with a backtick and it could be@x_str
for ".The rule might be that mixed delimiters can be an arbitrarily long sequence of length at least 3, and the user can always arrange for those to not be present in the string they're trying to quote.
Executive Summary
This is a request to add a string literal syntax using paired Unicode delimiters, perhaps ⟪ and ⟫, for use in non-standard string literal macros. This is proposed as an alternative complementary to, but not as a replacement for single or triple double-quoted raw strings.
Description of Requested Syntax
'⟪': U+27EA (Ps: Punctuation, open)
and'⟫' (U+27EB, Pe: Punctuation, close)
are employed.htl
for@htl_str
, the open delimiter,⟪
, begins a string using this syntax.⟫
, is encountered.\>>
and\>>
as a way to enter these paired delimiters.Critically, this non-standard string literal syntax provides no mechanism to escape either of the delimiters, excepting that nested pairings are permitted within content. In particular, unbalanced use of the given delimiters are simply not valid syntax. Julia provides no mechanism to enter unbalanced delimiters within this syntax.
Motivation
Let's define the term notation to mean what is currently in the documentation as "non-standard string literal". The word notation is used by SGML and other standards for this concept.
For those doing data munging to interoperate with other systems, there is an opportunity for the Julia language to better utilize notations, enhancing developer experience and improving code readability. While developing HypertextLiteral (providing Julia-style string interpolation to HTML construction), I ran into 3 challenges with existing string "non-standard string literals" (notations).
1) They are not succinct. Since a great many subordinate syntaxes include the double quote character, use of the triple double-quoted form is the norm. The double quote character is already loud, tripling it on both ends... becomes a distraction. Note that this deficiency applies also to the use of
@macros()
.2) They can be surprising. For cases where someone tries to use the single double-quoted form, novice users can be caught off guard with the raw_str escaping semantics and how it interacts with the backslash. As noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab).
3) They can't be used recursively. If one would like to embed one notation inside another, a round of character escaping is required. This is unlike, for example,
@macros()
which nest perfectly well.A promising option emerged on in the discussion forums: the use of paired Unicode delimiters together with a matching parsing algorithm in place of traditional character escaping. You could think of this approach as bringing to string construction what we already know about function calling and data structures -- that they are seldom flat structures.
Specifically, we could employ
'⟪': U+27EA (Ps: Punctuation, open)
and'⟫' (U+27EB, Pe: Punctuation, close)
as paired delimiters. This particular glyph combines a doubling (reminiscent of double quotes) with that of parenthesis (implying nestability). It's not perfect, but it is visually distinct in most fonts and in mono-space fonts appears to take the space of one regular character.When Julia encounters a name token, say
htl
, followed by⟪
, it would enter "notation" parsing state. Here it would keep track of the nesting depth, increasing depth when additional⟪
are encountered, and decreasing depth when⟫
is encountered. When the depth reaches zero, the entire span (less outer most tokens) of the string is sent unprocessed to@htl_str
, and Julia parsing resumes. The REPL could add\<<
and\>>
shorthand to permit these two characters to be easily entered.This addresses the three deficiencies noted above. This paired delimiter is much more succinct and visually attractive as compared to tripled double-quotes. The rule is unsurprising since there is no escaping, only the counting of depth, as one would find with parenthesized expressions. The rule naturally supports nesting, any construction using this method could be directly embedded as a subordinate notation. Moreover, if Unicode is used, these delimiters are unlikely to collide with those used in traditional systems, and if they do, so long as those systems use only paired form, there is no difficulty.
What about content having a non-paired delimiter?
This is a two part answer. Primarily, how to avoid the chosen delimiter pair becomes the notation's concern, not Julia's. For example, HTML has ampersand escaping, so the opening delimiter could be written as
⟪
. URLs use percent-encoding. Traditional double-quoted syntax (e.g."\u27EA"
for the opening delimiter) could be used by a Python notation. For example, to encode a non-paired opening delimiter, a use of this feature might look like...htm⟪<html><body>We start these string literals with <code>⟪</code></body></html>⟫
Asking a notation to provide its own delimiter escaping is not without precedent. In web pages, embedded Javascript begins with
<script>
and ends when the HTML parser encounters</script>
-- with no escape mechanism. Javascript developers who need to represent this sequence within their logic use regular double quoted strings, with the delimiter encoded as as"<\/script>"
.As a fallback, for notations such as
@raw_str
which lack such features, if the user must include a non-paired delimiter, they could use the existing raw string syntax which would not go away. Alternatively, they could be creative and build their string in chunks, using this syntax for most of the content and concatenating with regular double quoted strings for the non-paired delimiter. This proposed syntax aims to be complementary to existing approaches and represents different set of sensibilities.Increased Usability
With this feature, a regular expression to detect quoted strings might be written as
r⟪(["'])(?:\\?+.)*?\1⟫
with no need to triple double-quote or worry about slashes. Moreover, other notations could embed regular expression notation without having to worry about a round of additional escaping.I believe these rules would permit developers integrating with foreign data producers and consumers to create their own succinct, unsurprising and nested function-like transformations that mix native languages within a Julian data processing context. Here is an example.
In HypertextLiteral, the functionality above is currently written as...
While one might argue that the latter form is particularly fine, this example works because HTL uses Julia's syntax and excellent parser. Notations defined outside of Julia's ecosystem won't have this luxury.
In conclusion, a succinct, unsurprising, and nestable way to incorporate foreign notations as Julia expressions will open up opportunities for innovative uses of Julia's excellent macro system and dynamic programming environment. What are the costs? A relatively simple parser rule and integration with existing string macros and... the assignment of a Unicode pair.