JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.04k stars 5.43k forks source link

add string literal syntax using paired Unicode delimiters #38948

Open clarkevans opened 3 years ago

clarkevans commented 3 years ago

Executive Summary

This is a request to add a string literal syntax using paired Unicode delimiters, perhaps ⟪ and ⟫, for use in non-standard string literal macros. This is proposed as an alternative complementary to, but not as a replacement for single or triple double-quoted raw strings.

Description of Requested Syntax

Critically, this non-standard string literal syntax provides no mechanism to escape either of the delimiters, excepting that nested pairings are permitted within content. In particular, unbalanced use of the given delimiters are simply not valid syntax. Julia provides no mechanism to enter unbalanced delimiters within this syntax.

Motivation

Let's define the term notation to mean what is currently in the documentation as "non-standard string literal". The word notation is used by SGML and other standards for this concept.

For those doing data munging to interoperate with other systems, there is an opportunity for the Julia language to better utilize notations, enhancing developer experience and improving code readability. While developing HypertextLiteral (providing Julia-style string interpolation to HTML construction), I ran into 3 challenges with existing string "non-standard string literals" (notations).

1) They are not succinct. Since a great many subordinate syntaxes include the double quote character, use of the triple double-quoted form is the norm. The double quote character is already loud, tripling it on both ends... becomes a distraction. Note that this deficiency applies also to the use of @macros().

2) They can be surprising. For cases where someone tries to use the single double-quoted form, novice users can be caught off guard with the raw_str escaping semantics and how it interacts with the backslash. As noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab).

3) They can't be used recursively. If one would like to embed one notation inside another, a round of character escaping is required. This is unlike, for example, @macros() which nest perfectly well.

A promising option emerged on in the discussion forums: the use of paired Unicode delimiters together with a matching parsing algorithm in place of traditional character escaping. You could think of this approach as bringing to string construction what we already know about function calling and data structures -- that they are seldom flat structures.

Specifically, we could employ '⟪': U+27EA (Ps: Punctuation, open) and '⟫' (U+27EB, Pe: Punctuation, close) as paired delimiters. This particular glyph combines a doubling (reminiscent of double quotes) with that of parenthesis (implying nestability). It's not perfect, but it is visually distinct in most fonts and in mono-space fonts appears to take the space of one regular character.

When Julia encounters a name token, say htl, followed by , it would enter "notation" parsing state. Here it would keep track of the nesting depth, increasing depth when additional are encountered, and decreasing depth when is encountered. When the depth reaches zero, the entire span (less outer most tokens) of the string is sent unprocessed to @htl_str, and Julia parsing resumes. The REPL could add \<< and \>> shorthand to permit these two characters to be easily entered.

This addresses the three deficiencies noted above. This paired delimiter is much more succinct and visually attractive as compared to tripled double-quotes. The rule is unsurprising since there is no escaping, only the counting of depth, as one would find with parenthesized expressions. The rule naturally supports nesting, any construction using this method could be directly embedded as a subordinate notation. Moreover, if Unicode is used, these delimiters are unlikely to collide with those used in traditional systems, and if they do, so long as those systems use only paired form, there is no difficulty.

What about content having a non-paired delimiter?

This is a two part answer. Primarily, how to avoid the chosen delimiter pair becomes the notation's concern, not Julia's. For example, HTML has ampersand escaping, so the opening delimiter could be written as &#10218;. URLs use percent-encoding. Traditional double-quoted syntax (e.g. "\u27EA" for the opening delimiter) could be used by a Python notation. For example, to encode a non-paired opening delimiter, a use of this feature might look like...

htm⟪<html><body>We start these string literals with <code>&#10218;</code></body></html>⟫

Asking a notation to provide its own delimiter escaping is not without precedent. In web pages, embedded Javascript begins with <script> and ends when the HTML parser encounters </script> -- with no escape mechanism. Javascript developers who need to represent this sequence within their logic use regular double quoted strings, with the delimiter encoded as as "<\/script>".

As a fallback, for notations such as @raw_str which lack such features, if the user must include a non-paired delimiter, they could use the existing raw string syntax which would not go away. Alternatively, they could be creative and build their string in chunks, using this syntax for most of the content and concatenating with regular double quoted strings for the non-paired delimiter. This proposed syntax aims to be complementary to existing approaches and represents different set of sensibilities.

Increased Usability

With this feature, a regular expression to detect quoted strings might be written as r⟪(["'])(?:\\?+.)*?\1⟫ with no need to triple double-quote or worry about slashes. Moreover, other notations could embed regular expression notation without having to worry about a round of additional escaping.

I believe these rules would permit developers integrating with foreign data producers and consumers to create their own succinct, unsurprising and nested function-like transformations that mix native languages within a Julian data processing context. Here is an example.

render(books) = htl⟪
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(htl⟪
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    ⟫ for b in books)</tbody></table>⟫

In HypertextLiteral, the functionality above is currently written as...

render(books) = @htl("""
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(@htl("""
    <tr><td>$(book.name)<td>$(join(book.authors, " & "))
    """)) for b in books)</tbody></table>""")

While one might argue that the latter form is particularly fine, this example works because HTL uses Julia's syntax and excellent parser. Notations defined outside of Julia's ecosystem won't have this luxury.

In conclusion, a succinct, unsurprising, and nestable way to incorporate foreign notations as Julia expressions will open up opportunities for innovative uses of Julia's excellent macro system and dynamic programming environment. What are the costs? A relatively simple parser rule and integration with existing string macros and... the assignment of a Unicode pair.

MasonProtter commented 3 years ago

I think a 'TLDR' at the top would be helpful. I found points 1 and 2 rather unconvincing and actually biased me against the proposal in the first place, so opening with something like

TLDR: I want paired quotation marks for string macros (perhaps and ) because having paired delimiters allows us to nest string(macro)s more clearly and it may simplify the escaping rules.

would be great to get an idea of where you're going before I read a big long opinionated post.

mgkuhn commented 3 years ago

To better appreciate that Julia currently really still has quite a bit of a problem with the mental overhead required to safely use its raw string escaping mechanics (which is the basis of all non-standard string literals), please try the following very simple exercise:

Use a byte array literal (@b_str) to produce a UInt8 array containing the four bytes corresponding to the ASCII string a\"b (e.g. part of the documented binary command syntax of some USB gadget). Can't be too difficult, right?

Since

b"""a"b""" == UInt8[0x61, 0x22, 0x62]

works as expected, surely adding a backlash won't be too hard:

b"""a\"b""" == UInt8[0x61, 0x22, 0x62]

Hm. No effect. Perhaps it needs escaping?

b"""a\\"b""" == UInt8[0x61, 0x22, 0x62]

Hm. Still no effect. More backslash?

b"""a\\\"b""" == UInt8[0x61, 0x22, 0x62]

And more?

b"""a\\\\"b""" == UInt8[0x61, 0x5c, 0x22, 0x62]

Finally! But why do I need to escape the escape character here twice? The string apparently goes through three escaping parsers. Human brains usually struggle already with anticipating the outcome of applying just two nested parsers.

And that was only the “simpler” tripple-quoted string literal, where " allegedly isn't a metacharacter that needs escaping. In the normal notation with single double quotes, we also get

b"a\\\\\"b" == UInt8[0x61, 0x5c, 0x22, 0x62]

and even

b"a\\\\\\\"b" == UInt8[0x61, 0x5c, 0x22, 0x62]

Yes, that's seven backslashes.

This isn't even a very contrived or pathological example. When you deal a lot with protocols and file formats, these are often specified as a mix of text and binary data, and these formats themselves often make use of similar string delimiters and escaping syntax, and \ and " (following the precedent of C and the Unix shell) are particularly popular meta characters for that.

(Now try the same exercise without the " and/or without the b in the ASCII string ...)

clarkevans commented 3 years ago

If the goal was to only to provide alternative escaping for raw strings, one might suggest here docs, where the user specifies the terminator. A close cousin, as mentioned by aplavin in the discourse thread, is to use sed like mechanism where a single character is specified as the delimiter.

In contrast, this proposal enables notations (non-standard string literals) to cleanly nest, so that one might be used inside another. For example, this could enable regular expressions to be used within an HTML template system, and so on. As we work in industries having with all sorts of interacting services, each with their own micro-format, an effective way to integrate them into our Julia programs is quite beneficial. As opposed to libraries using a Julia syntax, native use of the notation's syntax is more familiar to the domain expert and enables re-use of already written and tested code fragments.

Seelengrab commented 3 years ago

Since the discourse thread in question has not been linked yet: For Reference.

Having participated in that thread, to be honest, I still don't quite get what the proposal aims to bring to the table. It feels like layering yet another string specialty on top.

Under this proposal, how would I input a ⟪⟫-string that actually contains and as data? Or would this be disallowed?

Additionally, making it easier to interpolate into SQL/HTML/similar is, from my point of view, a very dangerous thing. I don't think we want to end up like PHP with 3-4 different SQL escaping functions to disallow SQL injection, none of which work 100% (whereas prepared statements do, are secure and are standard & best practice).

clarkevans commented 3 years ago

It feels like layering yet another string specialty on top.

The proposal is to add a complementary syntax for notations (non-standard string literal macros) that cannot be layered as a library because it involves parser-level semantics. If this could be done in user land, I would not propose it as a feature request.

Under this proposal, how would I input a ⟪⟫-string that actually contains and as data? Or would this be disallowed?

There are two answers. In this specific example, if the content of a notation happens to use ⟪⟫ in a pair, then absolutely no escaping is needed at all. Hence, <html><body>⟪...⟫</body></html> or print("⟪...⟫") would also work just fine.

Broadly, with this proposal, the representation of these characters just not Julia's concern -- it is delegated to the sensibilities of the notation. Hence, if the notation is HTML data, the user could represent these characters as entities, <html><body>&#10218;...&#10219;</body></html>. Alternatively, if the notation is code for a "C" like language, one could represent these characters using traditional double-quoted strings. print("\u27EA...\u27EB"). This is no different, by the way, from the representation of non-printable characters.

That this feature doesn't provide escaping (the representation of arbitrary Unicode code points) is not a defect or compromise -- it's a feature. As such, this feature complements and does not compete with the double-quoted form, which has this representation requirement. This behavior is exactly why this is cannot be implemented in user-space, it has semantics that could only be implemented by the Julia parser.

Additionally, making it easier to interpolate into SQL/HTML/similar is, from my point of view, a very dangerous thing. I don't think we want to end up like PHP with 3-4 different SQL escaping functions to disallow SQL injection, none of which work 100% (whereas prepared statements do, are secure and are standard & best practice).

Programmers are going to do what programmers are going to do -- making it less convenient to use Julia (or a well constructed library in Julia) shouldn't be a goal.

The entire reason for HypertextLiteral.jl is to provide performant, contextual handling of Julia structures within HTML, especially handling the complexities of escaping. Indeed, what I want Julia to do when writing a notation... is get out of the way without imposing its idea of what escaping should look like. As someone authoring a notation, it's my job to make sure this works well, so that others are not tempted to use ad-hoc approaches that fail at doing escaping properly.

More specifically, SQL construction is notoriously difficult, and proper escaping is only one of many concerns. Prepared statements only handle a small fraction of the kinds of construction that is necessary, hence, the necessity of libraries like SQLAlchemy. The primary challenge of SQL construction techniques is enabling sensible composability so that query fragments can be independently developed and mixed together in a way that is syntactically and semantically correct. Escaping is an important task, but it's a technical detail of this broader application developer requirement.

At this time, I'm not sure if SQL construction could benefit from this feature or not. However, I am sure there are many applications that need to work with foreign syntax notations that will strongly benefit.

Seelengrab commented 3 years ago

This behavior is exactly why this is cannot be implemented in user-space, it has semantics that could only be implemented by the Julia parser.

I'm well aware. I'm advocating for not having this at all because it's a dangerous feature that has way too much potential for misuse compared to its perceived utility. Having a lot of raw string literals with endless escaping in ones code is, to me, a big red flag for bad API design that, sooner or later, will blow up in your face. An alternative to trying to compose everthing in strings via interpolation is via julia native code (i.e. functions and structs) that compose given input (data) and the defined structure (code) by making sure the two don't ever mix.

That this feature doesn't provide escaping (the representation of arbitrary Unicode code points)

I think there's a fundamental misunderstanding here about what escaping actually is. When escaping, the goal is not to be able to enter arbitrary unicode into a literal, the goal is to make sure certain sequences are not interpreted as code, when in truth they are data (or vice versa, when talking about \n and the like, though that doesn't exist in raw strings, so I'll ignore it). This is just as applicable to julia raw string literals as it is to SQL. In julia, you have to escape " to make sure those " are not interpreted as code (the end of the string literal), but as data (just another character in the literal). The same thing goes for \ - you have to be able to escape the escaping character, if you want to be able to represent it in your data (if you don't want to do that, that's fine though).

For an SQL example (copying my example from the discourse thread):

julia> greeting = "Hello\"; select * from secret_table where \"\" = \""

julia> "select * from conversation_table where greeting = \"$greeting\""

Naive interpolation without escaping of greeting into the string results in two rather than one SQL statements being executed, namely:

select * from conversation_table where greeting = "Hello";
select * from secret_table where "" = ""

This happens because the data (greeting) is interpreted by the SQL engine as code, not data. How should it know the difference? The engine only ever sees the already interpolated string, it doesn't know that there was interpolation going on in the first place. More problematic, even if you were to try to escape greeting, you'd almost certainly miss some edge case, as has happened countless times in basically all SQL engines/frontends to date.

The solution here is to use prepared statements, where the SQL engine is made aware that yes there are slots explicitly meant for data that aren't going to be interpreted as SQL. This solves all escaping problems when trying to interpolate data into code, because there can't be a mismatch between what's data and what is code - it's clearly seperated.

You might ask how you should build larger SQL expressions, if not by interpolating and concatenating. The answer is to use subqueries, SQL views and SQL functions, since SQL expressions don't (and can't, if you don't want to risk injection) compose via a homomorphism because they are not strings. This is also why I disagree that "the representation of these characters just not Julia's concern -- it is delegated to the sensibilities of the notation". The world by and large is nowadays stuck with HTML, SQL and similar and has, over the last 30+ years, suffered through their insensibilities. We can't just stick our head in the sand and pretend they don't exist, so we have to deal with them in our frontends now. I'd be much happier if the SQL/HTML/etc engines themselves would disallow this behaviour, but alas, that ship has sailed long ago.

Exactly the same problems plague HTML since 20 years and is the reason every other site has XSS vulnerabilities to boot.

Programmers are going to do what programmers are going to do -- making it less convenient to use Julia (or a well constructed library in Julia) shouldn't be a goal.

This argument feels very weak to me. I'm reminded of a similar argument when GOTO was the hot thing and people started advocating for not using it anymore. In that sense, yes, it absolutely should be a goal to make a language slightly less expressive for a big gain in safety, security, performance, tooling and a plethora of other things that are only possible because we've agreed to not do everything we could possibly do. Moreover, we're not making it "less convenient to use Julia" - it already is a little inconvenient and I'd like it to stay that way in this part of the language (though sometimes I'd wish some things were made less convenient, to discourage people from doing them; not necessarily julia related).

--

In conclusion: Maintaining a strict line between code (or what will be code in another layer of the stack) and data is a good thing. Mixing the two is bound to lead to problems down the road, making it even easier to do so seems like a bad idea to me.

MasonProtter commented 3 years ago

Regardless of whether paired delimiters are a good idea for a SQL or HTML library is kinda irrelevant to the part of this proposal that matters to me.

I do agree though with @clarkevans that trying to design the parser in such a way to stop library writers from making libraries vulnerable to SQL injection is flat out silly. That is not the place of the parser. The Julia parser is used by everyone, not just people writing SQL or HTML. Perhaps bringing up SQL and HTML was a strategic mistake on Clark's part (I don't really know or care), but it's still an illustrative example.

To me, the important part is that regular macros are straightforwardly nestable (assuming the outer macro knows how to handle the inner macro), but string macros are not. That makes representing datastructures with string macros awkward. Having a notion of an 'opening' delimiter and a 'closing' delimiter solves the issue of nesting, and IMO that's a good enough reason to support this.

Seelengrab commented 3 years ago

Regardless of whether paired delimiters are a good idea for a SQL or HTML library is kinda irrelevant to the part of this proposal that matters to me.

That's fair, but in my opinion, without a motivating example this just collapses into syntax for syntax sake. Like a "We could do it, so let's do it" argument. To me that's just not very convincing. I do realise that I'm in no position to decide anything about this, but I do want to present a different view 🤷‍♂️

That makes representing datastructures with string macros awkward.

I agree that it's awkward, but I don't follow why that should be a reason to make it easier to do so via the mechanism of string macros/additional parser logic, when the same goal can be achieved via existing mechanisms. They compose with the rest of the ecosystem (functions, structs) and facilitate the seperation of code and data. In other words, I don't yet see why this is desirable when it seems to just lead you down the hole of manually parsing strings (even with the proposal here, as string macros still just give the string literal to the macro, no matter the delimiters) and doing extra work the parser can already do for you.

I guess I can only repeat myself - I don't believe mixing code & data in such a way via nesting string macros is necessarily a desirable thing to do.

tpapp commented 3 years ago

Notations defined outside of Julia's ecosystem won't have this luxury.

I don't understand why the Julia parser should need to deal with notations used/defined outside Julia.

My understanding is that it is meant to parse Julia source code. Please elaborate if possible.

noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab)

I am not sure why this is relevant though. Can you please explain?

clarkevans commented 3 years ago

Sukera, I must apologize for my (deleted) outburst yesterday. Not only was it inappropriate, it came from a place of frustration rather than a place of kindness, and for that I'm truly sorry. Hopefully you'll provide the opportunity to respond thoughtfully today.

I find the meta-physical distinction of code vs data unhelpful. Every second, millions of web pages are rendered. To the web browser, the HTML represents code for how the page might be formatted. To programs that transform and dynamically manipulate HTML, it's data. HTML is both code and data, simultaneously and without contradiction, as inseparable as yin and yang. Harmonious systems that that flow with its nature can be secure and produce great value. By contrast, poorly designed systems are often insecure, fragmentary, and produce great headache. In either case, HTML has become an assembler language, higher level languages that dynamically generate it and manipulate it are the norm. I can reserve my comments about SQL, but they are of the same vein. Both HTML and SQL struggle from challenges of composability, where smaller components are assembled into larger units of work.

I think that your comments may have locked onto a straw-man implementation of an SQL generation macro on the discussion forum that didn't even attempt to escape its inputs. First, I would say that the author of the comment was probably trying to demonstrate an idea, and that to them, the parts that required escaping were an obvious technical requirement of a real-world system. Second, there's really no fundamental difference between a well-designed native function vs a macro for a reusable library that does intelligent parameter substitution. Both approaches are working with the SQL as text, with a substitution parameter. A robust implementation of either technical approach will form a model of the constructed query to ensure that the substitution is appropriate and well-formed. Critically, the functional form vs the macro form come down to technical approach: is the bulk of the work done at compile time (for the macro) or at runtime (for the function). With library based approaches, you have an express compilation step, which returns a handle, followed by an execution step that uses the handle. Macros do this same division of work only that the compilation step is handled as the Julia program is compiled, providing an opportunity for the programmer to hook into Julia's amazing type specialization and code optimizer.

To explain why data/code are simultaneous and complementary views of the same phenomena, let be describe the design of HypertextLiteral.jl. For starters, the idea of a hypertext literal as articulated by Mike Bostock of ObservableHQ, is quite clever -- it includes the minimal tokenizer to validate the HTML construct so that it knows the context for each variable's substitution. On the surface it looks like simple string manipulation, but the system provides a seamless integration of host programming language and HTML. You could think of the parameters in the text as being place holders for a prepared query. The Julia implementation involves two stages. In the macro stage, the HTML is tokenized and Julia expressions are parsed, inter-woven to produce a closure that merges the two streams intelligently. This closure is then given to the Julia compiler, where it's reduced to machine code. In this regard, it is fast. When I mean fast, I mean that currently it is several times faster than plain string interpolation. This approach is an order of magnitude (20x+) faster than a more traditional object-based approach that creates a server-side DOM and then serializes it (such as the excellent Hyperscript.jl). Speaking of which, using closures like this wasn't my discovery, it's how Julia's documentation system works.

Let's take a more detailed look at these two approaches with code fragment copied from a benchmark.

hs_employee(e) = tr(td(e.last_name), td(e.first_name), td(e.title),
                    td(href="mailto:$(e.email)", e.email),
                    td(e.main_number), td(e.cell_phone),
                    td([span(c) for c in e.comments]...))
htl_employee(e) = @htl("""
      <tr><td>$(e.last_name)<td>$(e.first_name)<td>$(e.title)
          <td><a href='mailto:$(e.email)'>$(e.email)</a>
          <td>$(e.main_number)<td>$(e.cell_phone)
          <td>$([htl"<span>$c</span>" for c in e.comments]...)
""")

As you observe, the @htl equivalent is a macro and it looks like string substitution with HTML content. However, the main reason for using a macro is to have parsimonious two stage construction that is transparent to the user (no separate code for prepare/execute pair). The macro is compiled to a closure that is specialized and optimized by the Julia compiler, and then for each execution of the htl_employee function, that compiled code is called -- and it's fast. However, there is absolutely nothing magical here about macro use. The same pattern could be used with a Hyperscript equivalent, giving up on the idea of producing a modifiable server-side DOM. In fact, one could probably write a thin wrapper that let either syntax work. Some may like the former syntax over the latter, and vice versa. The nice thing is that these two approaches can even be woven together... see the the documentation.

Why in the world would you use the HTML based notation then? There are several reasons. First, there are many things that you may not need or want to model in an equivalent Julia object system. The example shown above leaves out much of the messiness of HTML construction and CSS interaction, once you go down the path of making Julia objects (even logically), it can get to be quite a translation process... for very little benefit. Second, users are probably getting the HTML fragments to work by using them in an actual browser, it's much easier to test the fragment in its natural habitat. Note that in a web browser case when the fragment is viewed, the Julia variables just show up literally -- letting them be inspected as such. This productivity boon is a reason why directly working with HTML can be a huge advantage. Finally, for someone familiar with web development, the meaning of the latter (while being a bit verbose) is more or less transparent. We should assume that users of Julia are first and foremost domain experts, they may think of themselves as "accidental programmers".

So, this isn't an either/or choice. Either approach sees HTML construction as both data and code at the same time, respecting composition semantics in the target language as well as making it convenient inside a Julia program. To address your specific concern about escaping, both HypertextLiteral or Hyperscript implement escaping strategies that their authors think are fit-to-purpose. In short, there's not one right way to do it or wrong way to do it. It's a matter of matching the implementation approach with the requirements. Do you want to construct HTML in native Julia or as a notation? Do you want a DOM (slower and flexible) or as a closure (fast and inflexible)? Anyway, I hope this discussion alleviates at least some of your concerns. Thanks for listening.

clarkevans commented 3 years ago

Hi Tamas. Thank you for asking thoughtful questions. Let me preface my answers with a statement about Julia. I see it has having two super powers -- multiple dispatch and its macros. They work together in ways that really open new opportunities in computing. It's for this reason that I don't see Julia as yet another programming language. To me Julia is special since it is a programming environment that lets us effectively work with domain specific challenges. This feature request is about improving the ergonomics of macros using a foreign, third-party syntax, such as regular expressions.

I don't understand why the Julia parser should need to deal with notations used/defined outside Julia.

The point of non-standard string literals is that the Julia parser doesn't have to deal with these notations and can delegate program construction to user land.

Your question seems to be a broader question? Besides regular expressions (which likely drove the requirements for raw_str), why does Julia even need notations? Let me speculate that this will become more apparent as Julia is adopted into various domain specific scientific environments. Many users of Julia use it as a glue language, connecting systems with disparate protocols. I'm using HTML and SQL as an example here because they are relevant to me and also because they are commonplace. However, there are countless data sources of various formats, data processing services, and devices that Julia programmers have to work with. In many cases, it's much more clear to embed these native formats inside Julia programs without translating them into the Julia object model -- see the Hyperscript vs HypertextLiteral example above. As an expert Julia programmer you may prefer the former, representing HTML concepts using Julia objects, however, as a domain expert, one may strongly prefer the latter -- keeping HTML in its native notation when embedding it inside Julia programs.

My understanding is that it is meant to parse Julia source code. Please elaborate if possible.

Julia has two ways to make macros. Standard macros are transformations of the Julia expression tree, essentially providing different semantics/interpretation to Julia's native syntax. These can be used to great effect, especially to handle redundancy in an otherwise complex Julia program. Kyrylo Simonov uses them within our DataKnots code base for pattern matching within our optimizer. In this case, Julia syntax is used, but the interpretation of the expressions are non-standard, they provide a succinct domain specific language suitable for authoring our rewrite rules.

The @htl macro also uses Julia standard macros. It takes a string expression containing HTML and provides interpolation semantics that are appropriate for the HTML context. Although it looks like regular string interpolation, it's more involved. The result of the macro execution isn't a string, it is an object that that does its work when printed. Moreover, there are interpretations of Julia expressions that are sensible to HTML construction: an Array is simply concatenating outputs, within element content there is no space, within an attribute, it is space delimited. In any case, there's significant magic going on, the resulting program generated by the macro isn't at all obvious from the syntax -- although its rules are sensible to user expectations. We reinterpret Julia structures found within the $, serializing dictionary like objects as attributes, and so on. What's fantastic is that much of this serialization can happen at macro expansion time, and this greatly speeds execution.

The other way to use macros is from a translation from a foreign syntax into a Julia program, like regular expressions. This is the non-standard string literal, or as I call it, notation. A good example of this is HAML.jl. I don't see a way that HAML can be interwoven with Julia expressions because current non-standard string literal syntax doesn't nest. To interweave, each step of nesting probably has to be realized as its own Julia function. I also implemented a htl string macro, it has sensibilities similar to @htl. This syntactical form matches my mental model -- that it's really an "HTML template engine" that happens to use a Julia-like syntax. That said, when I tried to use it in this way, the raw_str escaping rules got in my way, having to triple quote everywhere is tedious, and the expressions don't nest.

As I see it, Julia has two ways to use macros -- its own syntax and foreign notations. At this time, Julia's parser rules favor use its own syntax, and discourage macros that use 3rd party notations. In some ways, one could argue this is a bit counter intuitive. With different semantics, you might want a different syntax -- especially if you are modeling external systems with different 3rd party syntax. Furthermore, a notation might also enable syntax highlighters to know about the embedded dialect. Therefore, this ticket is about improving ergonomics that let us embed foreign notations, providing a Julia translation/glue that would integrate with the rest of the program.

noted on the discourse forums, this escaping mechanism is not a "homomorphism over string concatenation", e.g. raw(a) raw(b) != raw(ab) I am not sure why this is relevant though. Can you please explain?

It's hard to know the practical implication of this escaping mechanism deficiency. However, I think it's a symptom of the rather convoluted rules that raw_str uses. Composable systems often rely upon building blocks that are themselves composable. Hence, when making a low-level primitive, it's important to get things like this right. When things like this aren't composable, it's a good sign that they are not done correctly and will cause you issues down the road. This design property provides me with an understanding that I'm not the one at fault as I stub my toe when using single double-quoted raw strings. In the end, it's not one single issue that makes non-standard string literals inconvenient, it's a collection of them. However, they are intimately tied with the unfortunate raw_str escaping design.

Steven Johnson correctly tagged this issue as speculative. I couldn't agree more. At this time I do not have additional examples that explain how nesting of notations would unlock new value, and let developers create more ergonomic programmer interfaces that leverage Julia's excellent macro system. I might in a few more months. At this time, I'm happy if this ticket is closed, I've said my peace and expressed my logic. Thanks once again for listening.

tpapp commented 3 years ago

Personally, I consider non-standard string literals an incidental gimmick (reader macros in Common Lisp, which may have inspired the feature, were already a nightmare, thankfully Julia does not allow users to play with the read table). IMO they have little to do with what makes Julia special or powerful.

I recognize that they are OK as a convenience feature for simple stuff, but my gut feeling is that if you want something more complex with composability and nesting, you are really looking for... basic Julia syntax, perhaps within a macro. That gives you proper editor support, plain vanilla ASTs you can manipulate with common tools, etc.

Compare the rx macro in Emacs, which allows you to write

(rx bol
    (eval this-file-name)
    space
    "[" (group (one-or-more digit)) ":" (group (one-or-more digit)) "]"
    space
    (group (zero-or-more anything))
    eol)

instead of

"^blog\\.org[[:space:]]\\[\\([[:digit:]]+\\):\\([[:digit:]]+\\)][[:space:]]\\(\\(?:.\\|\\)*\\)$"

Code became so much more maintainable after we switched to rx in julia-emacs.

Consequently, I think it would be better to leave this area of the language as is and not invest in it further.

aplavin commented 3 years ago

Disregarding non-standard syntax, current escaping behaviour is really unintuitive even in simple cases. For example, escaping depends on the position within the string. This was mentioned on discourse as well:

julia> raw"a\\" |> println
a\

julia> raw"\\a" |> println
\\a

Compare to python (and many others):

>>> print(r"a\\")
a\\
>>> print(r"\\a")
\\a

Clearly, this only becomes worse when more slashes are needed.

clarkevans commented 3 years ago

Ouch. Raw string bites me again. It should be a simple answer to this person's inquiry...

I am trying to use escape_string because as per documentation, “Backslashes ( \ ) are escaped with a double-backslash ( "\" )”. This is what I tried: mypath1 = escape_string("C:\Users\User\Dropbox\codes\JULIA\stock1\") The error message I get is ERROR: LoadError: syntax: invalid escape sequence. I am not sure what I did wrong.

To which the correct answer is... mypath1 = raw"C:\Users\User\Dropbox\codes\JULIA\stock1\\"

Don't forget that extra slash at the end like I just did when trying to help a newbie (I don't know, perhaps I confused them even more?). You'd think after all of this discussion I wouldn't have made a junior mistake like that... forgetting that the trailing slash needed to be escaped since it was followed by a double quote. Note that triple double-quoting won't help here either.

heetbeet commented 3 years ago

Skipping over the usefulness of beginning/end string characters that are nestable, I find working with strings much more cumbersome and error prone in Julia than in Python, and would love to see this added to the language. Heck, even C++ has proper raw strings, although a bit verbose: R"(...)"

I find the following Julia results just really unexpected:

julia> print(raw""" \ """) # 🙂 expecting: \ 
 \
julia> print(raw""" " """) # 🙂 expecting: " 
 "
julia> print(raw""" \" """) # 🙁 expecting: \" 
 "
julia> print(raw""" \\" """) # 🙁 expecting: \\" 
 \"
julia> print(raw""" \\\\" """) # 😢
 \\"

And for me this really hinders DSLs (which is excellently supported otherwise). For instance, the raw-ish string quirkiness gets passed onto string macros:

julia> using PyCall
julia> py"""
       print(r' \ " \" ')
       """
 \ " "
julia> # expecting: \ " \" 

Raw strings isn't nearly as raw as expected, especially compared to the Python counterpart:

>>> print(r""" \ """) # 🙂 expecting: \ 
 \
>>> print(r""" " """) # 🙂 expecting: " 
 "
>>> print(r""" \" """) # 🙂 expecting: \" 
 \"
>>> print(r""" \\" """) # 🙂 expecting: \\" 
 \\"
>>> print(r""" \\\\" """) # 🙂
 \\\\"

Python isn't perfect in this regard, you still can't end a raw string with a backslash, but that is only one thing to watch out for. Python: one single parsing exception, Julia: a whole lot of wtf parsing exceptions. This along with Python's ability to mix ' and " makes 99.9% of input raw-writable with Python. There is nothing close to that convenience in Julia.

This feature request is great. I would love to have it in Julia. This is so much more expected:

julia> py⟪
       print(r' \ " \" ')
       ⟫
 \ " \" 

Also, I don't agree that parsing should be made cumbersome for the benefit of suppressing SQL and HTML strings https://github.com/JuliaLang/julia/issues/38948#issuecomment-748651098 . There are so many other languages and domains to consider. The ability to embed foreign syntax directly into Julia via macros is a true game changer. I think we DSL people are allowed to be a bit greedy here.

Seelengrab commented 3 years ago

After spending way too much time than appropriate for christmas holidays looking into possible prior art, I want to share some links that OP (I think) implicitly assumed is known to other participants here. Since OP seems familiar with this, it would have been nice to have these ressources from the beginning, but alas, that chance has passed.

First, the nesting. As far as I can tell (and from the usecases OP is talking about), this originally comes from JavaScripts' "tagged template literals" (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#Tagged_templates). They're basically what OP is describing here - they nest, they can be interpreted in non-standard ways. The only difference being that "tagged template literals" don't just receive the resulting string, but are interpolation aware by passing interpolation slots to the outer literal as arguments (if there are more interpolation slots than arguments in the literal, they seem to be silently dropped, which feels like a bad idea).

Ideally, litA"text1 ${1+1} text2" would be aware that there's a "slot" or "hole" and it had to make a decision on how to handle that. This could be checking the surrounding elements/strings left and right for a given state e.g. are we being interpolated into a HTML fragment, SQL clause, some other thing that needs escaping etc. This is better than having to parse the whole string because it can be done on all interpolation steps in parallel, different nestings can be made aware of each other and handled specially if necessary, invalid states can be easily made unrepresentable (e.g. only allow interpolation of HTML arguments into arguments, check/escape the to-be-interpolated value at creation and force the API to communicate that) and (especially important to me) there's no chance of mistakenly parsing some result of an inner literal for regular data in the outer literal. You could even keep the interpolation as a special construct to be passed directly to e.g. an SQL engine as prepared statements. To some degree, this can be written today as a regular macro:

@litA "text1 " (@litB (1+1)) " text2"

This of course has the disadvantage of being a little further from the embedded syntax, but has the enormous advantage of making the interpolation transparent to the code that is handling the interpolated data. To me, that's not a sufficient argument for adding this though.

That being said, one blog post by mozilla themselves introducing this feature is actively discouraging this kind of nesting (https://hacks.mozilla.org/2015/05/es6-in-depth-template-strings-2/):

Template strings don’t have any built-in syntax for looping—building the rows of an HTML table from an array, for example—or even conditionals. (Yes, you could use template inception for this, but to me it seems like the sort of thing you’d do as a joke.)

There's no further reasoning given, but presumably it's because how different literals interact with each other gets complicated really quick, especially if they're free to return whatever they want, as is the case in JavaScript. I think much the same. One thing I'd be interested in though is how much of modern JS actually uses nested tagged template literals in the wild - I'd wager it's not much.


Second, and I think that's slightly tangential to the nesting request, is the escaping. Let's look back at those template literals from JS - even they don't allow arbitrary data without escaping in them:

This makes sense: making them unrepresentable is a no-go, because it handily excludes both data containing those sequences (something you can ignore if you only focus on one usecase, but can't really if you're building a language level feature) and other literals your code doesn't know about containing those sequences (which is the whole point of this feature request, if I'm not mistaken). Wanting some way to represent them anyway means you have to signal the JS/julia parser that the following character(s) is/are data, not code. There's no way around that (aside from the fact that nesting is again discouraged).

At the moment, raw strings have counting rules associated with them and changing that behaviour would be pretty breaking, which is why it could be done in 2.0 at the earliest. While introducing another string literal into the language to avoid breakage is a possibility, I really don't think rushing this via that avenue is as clean cut andadvantageous as it's made out to be.


Finally, for someone familiar with web development, the meaning of the latter (while being a bit verbose) is more or less transparent. We should assume that users of Julia are first and foremost domain experts, they may think of themselves as "accidental programmers".

I shudder at the thought of someone doing web development via composing HTML in a programming language thinking of themselves as "accidental programmers" - it has been clear for quite a long time now that web development is yet another form of programming, so ignoring the complexity by downplaying their role and responsibilities to make sure what they're building is safe in their domain feels wrong to me. If this were about researchers putting together a dashboard by sticking ready made blocks together I'd be with you, but that's not web development in the sense you're describing here.

Seelengrab commented 3 years ago

As far as filepaths go, #38415 is the best starting point for the challenges & existing ideas in that domain.

Additionally, note that julia is already smart about path-like strings, even on windows:

julia> pwd()
"C:\\Users\\<me>\\AppData\\Local\\Programs\\Julia\\Julia 1.5.2\\bin"

julia> readline("../share/julia/julia-config.jl")
"#!/usr/bin/env julia"

Even mixing these two is fine:

julia> readline("../share\\julia/julia-config.jl")
"#!/usr/bin/env julia"

So there's no technical need to input literals as \\ (though users are free to do so if they're used to do that on windows, as even there inputting paths is usually done with a double \ to escape one of the \).

If you want to join paths, there's always... joinpath, which is the preferred way of dealing with paths since operating on them is not synonymous with string operations:

julia> joinpath(pwd(), "../share/julia/julia-config.jl") |> normpath
"C:\\Users\\<me>\\AppData\\Local\\Programs\\Julia\\Julia 1.5.2\\share\\julia\\julia-config.jl"
clarkevans commented 3 years ago

This proposal is about enabling foreign system notations that are succinct, unsurprising and nestable.

First, the nesting. As far as I can tell (and from the usecases OP is talking about), this originally comes from JavaScripts' "tagged template literals".

Thank you for researching how a similar feature is implemented in Javascript. It's notable that so many languages are adding this sort of feature. This signifies increasing demand for notations that are not the primary syntax. Unlike Javascript's approach, this proposal does not seek to provide built-in interpolation or any processing of the notation's content; it seeks to only count matching delimiters to know where it ends.

You have observed that with verbose usage of regular macros you could achieve the expected computing for something similar to the use cases described. Indeed, I've even demonstrated this in earlier posts. I would counter that almost any new feature proposal has a more verbose implementation without the feature.

I suggest a broader view. A specific example is used to draw clarity to the proposal, drawing critiques that target at the example, rather than the mechanism. Readers may wish to seek examples from their own experience, looking for how the proposal might be improved. I think there's a clear case here for regular expressions and raw strings as well. There are other use cases: any sort of device or complex web service you have to work with will have its own notation.

That being said, one blog post by mozilla themselves introducing this feature is actively discouraging this kind of nesting

I think this guidance, over 5 years old now, didn't quite anticipate the success mbostock's hypertext literal. New capabilities, when used creatively, often shift the goal posts and enable unexpected improvements to technology. In particular, I've learned that Julia often creates brand new ways of looking at the world – Javascript doesn't provide anywhere near the sort of macro capabilities that Julia has.

Second, and I think that's slightly tangential to the nesting request, is the escaping.

You seem to argue there is a requirement to permit arbitrary data. As noted above, I think this assumption is how existing raw string semantics were lead astray. In this proposal it is up to the notation to provide a mechanism to enable the representation of all Unicode characters, if that is a legitimate need. For precedent, look at how <script> tag is terminated... there is no escaping. One cannot, with the host language represent </script> within the notation's content. To print out </script>, Javascript programmers often write "<\/script>", but I've also seen ("</"+"script>"). This is a feature, not a defect. If it's important for the notion, the notation can provide its own mechanism. For HTML content, you can use ampersand escaping, etc.

I shudder at the thought of someone doing web development via composing HTML in a programming language thinking of themselves as "accidental programmers" - it has been clear for quite a long time now that web development is yet another form of programming,

I use the word "accidental programmer" to mean that the developer did not set out to become a programmer by trade -- not that their very nature is to ignore complexity. Most users of Julia are researchers or data analysts of some sort, quite capable even as software engineering is not their chosen profession. Web developers often come with a visual design background. Conversely though, they are programmers, not system engineers. It's our job to construct systems and package useful libraries which they could effectively use. Systems and libraries that lack sharp edges.

One very effective way to engage accidental programmers is by providing, as a notation, syntax for those environments to which they are familiar and comfortable. For example, in medical informatics, there is the Clinical Quality Language (CQL). Providing a smooth on-ramp and conversion path for CQL experts is at the top of my mind. This is another example of a notation. I can see how one may want to embed CQL fragments in a broader Julia program, or conversely, embed Julia processing within CQL.

As far as filepaths go, #38415 is the best starting point for the challenges & existing ideas in that domain.

The thing is, file paths should work out of the box and easily. The person asking questions a few days ago points in the opposite direction. I would observe that raw⟪C:\Users\User\Dropbox\codes\JULIA\stock1\⟫ would have answered this person's inquiry. An improved direction, enabled by this proposal, might be path⟪C:\Users\User\Dropbox\codes\JULIA\stock1\⟫... which could construct the Path struct described in the ticket.

Thank you for your continued thoughts. This morning @stevengj outlined an evaluation metric for a proposed syntax change: memorable, comprehensible and readable. I think this proposal rises to that metric.

clarkevans commented 3 years ago

I recognize that they are OK as a convenience feature for simple stuff, but my gut feeling is that if you want something more complex with composability and nesting, you are really looking for... basic Julia syntax, perhaps within a macro. That gives you proper editor support, plain vanilla ASTs you can manipulate with common tools, etc.

As noted on this list by several commenters, the basic mechanics of raw string escaping are problematic. If this was an "incidental gimmick", why was it included in the first place? Why do other programming languages increasingly support these sorts of facilities?

While it's nice that one could convert representations of domain specific problem, such as regular expressions, to a Julia object structure -- this doesn't help with compatibility to the broader knowledge ecosystem. Why should I manually transcode regular expressions from a well-known and established syntax into a variant that some developer proposes (even if it's a better one, which, I agree, the example you've shown looks more maintainable)?

Seelengrab commented 3 years ago

OK, I'm out. It's become tiring for me to continue this discussion, as I don't feel like you think any of the points I bring up are worthy of consideration, nay, being read & understood at all - a courtesy I think I've tried to extend to you, as I've voluntarily tried to find other occurences of the very feature you've proposed (and the downsides & tradeoffs they incur). It seems to me more like you want to amplify some imaginary power of the language instead of thinking about whether or not a feature has downsides that should be considered as well.

clarkevans commented 3 years ago

How does one integrate with a 3rd party system that has its own syntax?

There are two fundamental paths. One is to make a library native to the host programming system, such as Julia, that abstracts away the foreign system in a manner using conventions commonly used by the host programming language. These libraries conventionally mirror the 3rd party's structures in a manner sensible to the host language, let programmer's configure it, compose pieces from various parts, and then serialize the result in a manner digestible by the external system. This is the traditional approach, and one promoted by @tpapp and @Seelengrab (please correct me if I'm wrong). I see absolutely no problem with this approach, by the way.

I believe an inside-out "notation" view might be suited in some circumstances. To integrate, you create a macro that keeps the 3rd party system's native syntax (such as regular expression notation) -- binding, translating to and sprinkling in Julia functionality as needed. This is a implementation choice that is possible with Julia's macro system coupled with non-standard string literals, and it is not well explored. Even so, this approach has examples and precedent (JavaScript inside HTML, for example; or with JSX, tags within Javascript). Moreover, many new experiments in programmer experience are based upon this notation view.

What are the advantages of this notational approach?

  1. To more casual users of the programming environment, a foreign syntax notation is recognizable and familiar; reducing adoption friction and training costs.
  2. In many cases only a fraction of the foreign system need to be modeled, only the parts that need to be integrated with Julia code.
  3. Working examples can be copied, perhaps tweaked a bit, and it can be working.
  4. Nested notations could have their own syntax highlighting by code editors.
  5. The boundary between two systems is visibility indicated with an alternative syntax, there is minimal guesswork how to translate between the systems.
  6. Like any other Julia code base, it could be encapsulated with regular functions, packaged and reused.

Disadvantages of the notational approach? Many. Uncertain. Besides being completely foreign to those expecting native libraries, you're now clearly in a mixed language environment with minimal encapsulation. There may be little tooling to support things. Syntax highlighting support will at least initially be worse. Developers of the notations will need to either hand-code parsers or learn to use parser generators.

Even so, this proposal isn't about agreement with the philosophical detritus above. You can hate this this idea, and that's OK. It's not the proposal. The proposal is a better way to write non-standard string literals -- a more technical issue with clear requirements.

clarkevans commented 3 years ago

Simply, what is this proposal asking?

This proposal asks for a string literal alternative that uses Unicode pairs, and .

  1. The Unicode characters don't interfere with any existing usage or diminish any existing applications of Julia (it's sad there are no available visually succinct ASCII pairings).
  2. That the paired delimiter parsing algorithm permit nesting so that syntax notations can use those same delimiters in pairs without additional escaping.
  3. That that all other ~escaping or nesting~ interpretation is delegated to the notation and are not Julia's responsibility.
  4. That perhaps we have some nice \<< and \>> keyboard shorthands for typing in the Unicode within the REPL.

This addresses a problem many of us have, that non-standard string literals have rather surprising escaping semantics, this often leads us to use of verbose triple double-quotes, and sometimes even this isn't sufficient. A limited subset of us also think that nesting of non-standard string literals is hindered and would like to use this opportunity to address this perceived deficiency.

vtjnash commented 3 years ago

Point 2 and 3 ~3 and 4~ are contradictory. You can either have Julia count them, or have an escape sequence, you can't have both.

There's a confusing claim on this thread that the current design was based on assumptions, without regard to practice. However, on the contrary, the current design was the replacement for the old design (which was more similar to python's), after running into issues with using it in practice.

Python's design has two usability flaws, not just one: as you mentioned, you can't end in a \, but nor can you embed a naked """ sequence. The former issue is often required when writing paths, the later is occasionally required when embedding anything else complicated.

If it's important for the notion, the notation can provide its own mechanism

This contradicts your argument that the mechanism permits nesting arbitrary content, and instead shows why such a goal is unattainable. The choice isn't whether the content needs specialized mangling, but whether (a) the mangling is possible (b) the mangling is uniform. In python, the mangling is not possible as it can't encode some character sequences; in html, the mangling is not uniform as it depends on the consumer. In Julia, the current rules are both capable of mangling any string and doing so uniformly for all string macros, solving both problems.

There's perhaps room for adding another delimiter, in addition to " and """. Currently also, those have the same escaping rules, which isn't essential, but does make it somewhat more memorable and simpler.

clarkevans commented 3 years ago

@vtjnash Point 4 is about the REPL. I've updated the comment.

This proposed delimiter pairing isn't there to replace double or triple quoted strings, it's there to complement it. In particular, I do not see the need nor do I wish for this method to include the ability to represent all Unicode code points.

1st -- if someone needs to represent all Unicode code points, they can do so using the double-quoted syntax (note that the raw string syntax can't represent some sequences such as '\0' ?)

2nd -- in most notations, there are escaping mechanisms that are included; for example, if you are encoding HTML content, you can use ampersand escape sequences.

3rd -- by using paired unicode delimiters, sub-strings using the same mechanism need not be encoded (parser increases a counter for each begin marker; decrements for each end marker; and completes the string when the depth is zero).

The only escaping challenge is when the Unicode begin/end delimiters occur unpaired in content. Let's suppose one needs to represent the opening without the corresponding closing. In this case, if/how they are encoded depends upon the notation. If the notation is Python, presumably the character occurred within a string, in which case "\u27EA" could be used. If it's HTML content, you could write &#10218; in CSS you could write \27EA.

If the notation doesn't have a way to represent this delimiter, yea, you're sorry out of luck. However, given the complexities of escaping, that's a perfectly fine rule. Especially when double quoted strings could be used as an alternative.

vtjnash commented 3 years ago

Oops, got those numbers completely wrong, meant 2 and 3 (edited the above for posterity)

raw string syntax can't

This one it can do:

julia> a = raw""
"\0"

Though true it will have difficulty with non-UTF8 sequences.

clarkevans commented 3 years ago

@vtjnash Thanks. I don't see how 2 & 3 are contradictory. When the parser hits it increases the depth; when it hits it decreases depth. When depth is 0, it packages up the string and sends it to the string macro for interpretation. How the content inside is interpreted is completely determined by the notation.

  1. Nested pairs of and will make it though unscathed. If the macro wishes to provide interpretation for it, so be it. If not, that's fine too.
  2. The issue is when the delimiters are used unpaired. In which case, you've got two choices: either the notation needs to provide a way to represent those characters (e.g. &#10218; for encoding within HTML), or the programmer will have to use double quoted form.

Sorry for all the repetition. One of these days I might say it in a manner that is clear. Critically, I think that double-quoting already has an escape mechanism. Therefore, as a complementary syntax, there's no need for this one to have the same semantics. If someone needs the double-quoting features, it's still there.

vtjnash commented 3 years ago

Being UTF-8 though, we could remove the validator, and permit any content. Currently though it gives errors such as:

julia> Meta.parse("raw\"\xefabc\"")
ERROR: Base.Meta.ParseError("invalid UTF-8 sequence")

julia> Meta.parse("raw\"\xef\"")
ERROR: Base.Meta.ParseError("extra token after end of expression")

Yes, but you're claiming that the content will be handled by a functionally divergent algorithm inside, but that the two results will be the same (that they'll agree on the end of stream count). That's not true even for html, which has context-sensitive parsing rules (A > or a " is only special inside of <, and this changes the escaping rules, so that it's non-trivial to determine "pairing")

clarkevans commented 3 years ago

Yes, but you're claiming that the content will be handled by a functionally divergent algorithm inside, but that the two results will be the same (that they'll agree on the end of stream count).

I'm assuming that the person working within the notation knows what they are doing.

Upstream, to demonstrate this point I've used the example of <script> that marks the beginning of a script within HTML; there is no escaping within this parsing context. HTML knows it hit the end of the script when it encounters </script>. Javascript programmers who need to represent "</script>" simply write it using "<\/script>" which, in the context of Javascript is the same as "</script>".

Is it exactly the same? No. Does the distinction matter for any practical purpose? No.

That's not true even for html, which has context-sensitive parsing rules (A > or a " is only special inside of <, and this changes the escaping rules, so that it's non-trivial to determine "pairing")

I actually don't understand your point.

In the places where ampersand escaping of would fail in HTML are also places where this character is actually invalid. Once again, the point is that Julia doesn't have to know or care what goes on inside the notation. That's between the notation and the developer.

Moreover, if the notation can't represent the delimiter ... there is always double quoting. This is a complementary approach, and not being able to represent every character is perfectly acceptable. Being a different mechanism it can draw on a completely different compromise position, better at some things, and well, worse at others.

vtjnash commented 3 years ago

w3 says this is valid: https://validator.w3.org/nu/#textarea

<!DOCTYPE html>
<html lang="en"><title>foo</title><body>
<div class='>'>></div>
</body></html>

while your proposed rules would accidentally truncate the document at the end of the div tag—because Julia does know and care about the content (it is counting the observed <> pairs)

clarkevans commented 3 years ago

while your proposed rules would accidentally truncate the document at the end of the div tag—because Julia does know and care about the content (it is counting the observed <> pairs)

I actually don't know what you're talking about. Let's take this example...

html⟪
<!DOCTYPE html>
<html lang="en"><title>foo</title><body>
<div> We start non-standard string literals with <code>&#10218;</code></div>
</body></html>
⟫

Under what circumstance would you expect a problem?

From Julia's perspective, I cannot use directly in my example because it isn't paired... the parser will end up scanning to the end of my file complaining about my missing . However, this turns out not to be a problem, for this particular notation, since I could easily include in the HTML content using ampersand escaping, &#10218;.

clarkevans commented 3 years ago

@vtjnash Oh! I think I get what you're saying. You're using < and > as delimiters in Julia land rather than the Unicode ones presented here? OK. I can run with that. So, let's say that html is the string literal macro. In this case, the person writing the content would need to escape the unbalanced > for it to survive. One could encode > using &gt; in HTML.

html<
<!DOCTYPE html>
<html lang="en"><title>foo</title><body>
<div class='&gt;'>&gt;</div>
</body></html>
>

It's not like forgetting to encode these two &gt; would go unnoticed. The Julia parser would hit the end of the div> and then run into a syntax error of its own -- this would happen well before mangled HTML were produced. Conversely, if there was an extra < tag anywhere, the Julia parser would scan to the end of the document and bark there instead. Regardless of the situation, forgetting to do balanced escaping will lead to parser errors in Julia, well before it ever mangled output in an HTML browser.

Anyway... we're not taking about letting < and > be the actual delimiters though, the delimiters picked are Unicode and extremely unlikely to have collisions with existing formats. But, regardless, I like your mental exercise, it presumes a time/place when the same Unicode pairs might be used by the notation. In this case, there's no real issue... provided the notation has its own mechanism for escaping said delimiters. If not, there's always "raw strings".

clarkevans commented 3 years ago

@vtjnash wrote:

There's a confusing claim on this thread that the current design was based on assumptions, without regard to practice. However, on the contrary, the current design was the replacement for the old design (which was more similar to python's), after running into issues with using it in practice.

I see from reading ticket #22926, it seems existing raw string syntax it was a hard-won compromise, and of course it was based upon experience that came from practice. My remarks stated that the design of raw string notation included the assumption that the delimiters themselves be representable. I do apologize if my remarks seem disparaging. I certainly don't mean to diminish the work put into the design.

Python's design has two usability flaws, not just one: as you mentioned, you can't end in a \, but nor can you embed a naked """ sequence. The former issue is often required when writing paths, the later is occasionally required when embedding anything else complicated.

Yes, and Julia's design of raw strings has other usability flaws as remarked upon in comments in this ticket. Notably, regular, non-raw double quoted strings already have the requirement that the delimiter be representable; in Python, users can still add on that trailing "\" with little irritation. Julia's experience, by contrast, seems jarring. Anyway, I should have more carefully chosen my words.

Given that this proposal is complementary and not meant to replace either double quoted form or raw string forms, I also think it's a perfect place to ask what it means to have complementary syntax rules. Rather than assuming we need an escape mechanism, this proposal relies upon paired delimiters. It also builds upon experience working with existing notations: they often have their own escape mechanisms that can be leaned upon, freeing Julia to not be burdened with more complex rules. Of course, this is possible only because there is a fallback, double quotes or the raw string syntax.

clarkevans commented 3 years ago

Since I find myself repeating things in comments, it means that the opening of this ticket is poorly written. I've tried to improve the opening post, with an attempt at a more clear technical description of the proposal, without removing anything commented upon.

tpapp commented 3 years ago

If this was an "incidental gimmick", why was it included in the first place?

Dunno, one would have to ask the the language creators. I would guess that it was inspired by a similar feature that various Lisps have.

Also, note that the above is just my opinion. I would compare non-standard literals to the 17th tool on a Swiss army knife: conceivably useful, but if you didn't have it you wouldn't really miss it.

Why do other programming languages increasingly support these sorts of facilities?

I don't think many languages have a facility like the one you are proposing (if that's incorrect, I would love to hear about them.) Some languages have something similar.

Don't get me wrong, I am not questioning that facilities like this can be useful under some circumstances. I just don't think they are so generally useful that it justifies adding them to a language like Julia which does not focus on string manipulation, as each feature carries a cost in the long run.

clarkevans commented 3 years ago

Tamas, I like your response. I do agree this does come down to a value judgement if having an additional way to represent raw strings is merited. I know there are others on this list that agree with you, and I absolutely respect that call. Even if someone chooses not to use the syntax for their own works, there is additional semantic burden created, as most people have to read/maintain others code.

Similar sensibilities are in play for abbreviated currying forms, and what not. Those who see themselves as potential beneficiaries of a new syntax are enthusiastic, those that look at it as increasing mental burden would rather not. This is why I like Steven's rubrik that a new syntax should be memorable, comprehensible and readable. Incidentally, with this test, I don't think the current raw string semantics pass, as rules are not very memorable (even if they are comprehensible to some) -- they look almost exactly like double-quoted strings, but have very different processing.

To clarify my previous comments and with the benefit of hindsight, I do think that Julia's raw string syntax could have been done differently at the time. In #22926, Keno listed as option number 4 "Disallow " in custom string literals entirely." In this case, a regular expression matching a quoted string could always be written r"([\x22'])(?:\\?+.)*?\1". The triple double-quoted form could have simply not permitted the triple double quotes or the string to end with a quote. In this case, one could represent unencodable strings though concatenation, raw"""He said: "Hello""" * '"'. I would argue this would make the syntax an actual raw string and would complement existing double quoted forms. Regardless, it's too late for the existing raw string syntax to have this behavior, but a new syntax could.

But as you remark, is an additional syntax for string literals essential? Absolutely not. The world will go on. That said, while Julia's initial use cases were more based upon those who needed to crunch numbers, many newer works using Julia involve integration with other text-based protocols and systems.

Thanks again for your thoughtful comments.

clarkevans commented 3 years ago

I had a lovely chat with @vtjnash on slack. I think there is not a technical conflict, so much as there a conflict of expectations and terminology.

This proposed mechanism differs from other string literal formats in that, from Julia's perspective, there is deliberately no universal escaping mechanism that would permit the representation of unbalanced delimiters. This may violate people's assumptions. However, this proposal relies upon three mitigating circumstances: the delimiters are Unicode, and hence, unlikely to collide with existing textual formats; due to pairing algorithm, there is no challenge when balanced pairs of the delimiters occur within the string literal; and that most interesting notations have their own form of escaping where these delimiters could be expressed in a manner native to the content.

That said, from the user's perspective, unbalanced delimiters can often represented just fine. Within regular expressions, [\u27ea] can be used to match . Within HTML, this opening delimiter could be written &#10218;. Within a Python notation, most likely the need to have these delimiters would be in a string literal, e.g. "\u27EA". Within a UTF-8 encoded URL, %E2%9F%AA. While the exact mechanics of escaping the unbalanced delimiter may vary by notation, it is usually possible.

Further, as this literal format is a complement to and not a replacement for existing literal syntax, there is a fallback: one can always be creative by using string concatenation with regular double-quoted strings, which can include the delimiters and, indeed, any Unicode code point without challenge.

...

If we can agree that there's no technical challenges here, but rather a conflict of expectations, and move on to other decision criteria, that'd be helpful. I think there are three primary classes of proponents.

1) Some of us have regular expressions, files paths, and other foreign syntax where they wish to simply copy known working fragments (often containing combinations of double quote and slash characters) without having to discover how the existing raw string escaping rules would apply. Examples such as r⟪(["'])(?:\\?+.)*?\1⟫ and raw⟪C:\Users\User\Dropbox\codes\JULIA\stock1\⟫ would work out-of-the-box. See also @heetbeet's Python example.

2) Some of us have string literal notations where we believe the ability to easily nest them would be beneficial. HypertextLiteral's non-standard string literal falls into this category. Moreover, in these cases triple double-quote form is often required since content includes a double quote, and this could be made much more succinct with matching Unicode delimiters.

3) Some of us think that the existing raw-string escaping rules are unsound, and that they may cause confusion though the re-use the double quote character but with alternative escaping semantics. For these proponents, a differentiated syntax with complementary semantics, especially one that implies an opening and closing, becomes memorable, comprehensible and readable.

I don't want to speak for the detractors of this proposal. However, it is understandable to me that there are those who are skeptical that benefits explained will be substantial or realized. Moreover, I can understand those who describe a community cost, even if one chooses not to use this format in their own work, they will still have to work with it in examples and maintain code of people who use the proposed functionality.

heetbeet commented 3 years ago

Here is a naive implementation of this feature if you want to test it out: https://github.com/heetbeet/julia/tree/add-bracketed-quote-syntax

I've immediately noticed that Windows cmd doesn't support that characters, so if you are on Windows, open julia in a git bash to ensure it renders correctly.

clarkevans commented 3 years ago

I've immediately noticed that Windows cmd doesn't support that characters, so if you are on Windows, open julia in a git bash to ensure it renders correctly.

Looks like character selection might be another process.

Anyway, it works beautifully with your patch! The sample code I was testing with worked out of the box.

build_result(d) = htl⟪
  <html>
    <head><title>$("Customers & Employees")</title></head>
    <body>
    $(htl⟪
        <dl>
          <dt>Company<dd>$(c.company)
          <dt>Phrase<dd>$(c.phrase)
          <dt>Active Since<dd>$(c.active)
          <dt>Employees<dd>
            <table>
              <tr><th>Last Name<th>First Name<th>Title
                  <th>E-Mail<th>Office Phone<th>Cell Phone
                  <th>Comments</tr>
               $(htl⟪
                <tr><td>$(e.last_name)<td>$(e.first_name)<td>$(e.title)
                    <td><a href='mailto:$(e.email)'>$(e.email)</a>
                    <td>$(e.main_number)<td>$(e.cell_phone)
                    <td>$(htl⟪<span>$c</span>⟫ for c in e.comments)
               ⟫ for e in c.employees)
            </table>
        </dl>⟫ for c in d)
    </body>
  </html>
⟫

This @htl_str notation works beautifully... it's a cousin of @htl which takes the string, splits on $ and passes each fragment to Meta.parse. The rest of the work is identical to the macro. Yesterday, I had forked notation that would implement htl⟪...⟫ locally, however, it relies on hacks and still requires the top level triple-quote; moreover, a real implementation would trip up on places where a slash precedes a double-quote character. Moreover, with heavy-lifting done at a lower-level, different notations could freely intermix. Lovely.

heetbeet commented 3 years ago

The rules for this syntax is simpler than the other string syntax-es and ended up being easier to implement that I expected (though it was the first Lisp code I wrote https://xkcd.com/297 ). Man, Julia is such a cool language! I could literally just jump in and write new parsing rules. I would not even try with another language, let alone succeed.

Yes, we would definitely need to look at character selection if Windows doesn't play nice. We can consider « and », they are from the extended ascii range and labelled as "quotes and parenthesis". I did like the idea of getting an obscure character from the Unicode range, though. Maybe there's a way to add a layer to the Julia REPL in order to play nice with the Windows command prompt. Unfortunately you would then still see a lot of █ type output in logs and when running Julia programs outside of the REPL.

Do note, with my current implementation it is implemented as both a string syntax and as macro-str syntax. I.e. both these are valid:

julia> ⟪hello⟫
"hello"

julia> raw⟪hello⟫
"hello"

It was just easier to mimic the other string implementations fully. Maybe it would be better to disallow the ⟪hello⟫ string syntax and only keep the raw⟪hello⟫ macro-str syntax to minimize the amount of change to the language. Then if bracketed macro strings are popular and mature enough, we could debated on whether this naked ⟪hello⟫ string syntax is necessary.

clarkevans commented 3 years ago

The issue observed in the discussion forum is that « (xAB) » (xBB) are in regular use in many regions of the world; worse, some regions use them in the opposite order, e.g. He said: »Hello!« which would defeat copy/pasting literal Croatian or Danish texts without notation-level escaping. There was also resistance on the discussion forums that people would start using an alternative casually, randomly switching back and forth between syntax forms. Hence, I think one might want to not enable this for regular string syntax to make it less competitive with double-quoted strings.

heetbeet commented 3 years ago

Okay yes I agree, normal language should be able to be copy pasted into this syntax raw. Damn Windows terminal

clarkevans commented 3 years ago

To provide an apples-to-apples comparison, and to provide a counter argument, here is the equivalent nested @htl macro form... that already works on Julia v1+.

build_result(d) = @htl("""
  <html>
    <head><title>Customers &amp; Employees</title></head>
    <body>
    $((@htl("""
        <dl>
          <dt>Company<dd>$(c.company)
          <dt>Phrase<dd>$(c.phrase)
          <dt>Active Since<dd>$(c.active)
          <dt>Employees<dd>
            <table>
              <tr><th>Last Name<th>First Name<th>Title
                  <th>E-Mail<th>Office Phone<th>Cell Phone
                  <th>Comments</tr>
               $((@htl("""
                <tr><td>$(e.last_name)<td>$(e.first_name)<td>$(e.title)
                    <td><a href='mailto:$(e.email)'>$(e.email)</a>
                    <td>$(e.main_number)<td>$(e.cell_phone)
                    <td>$((@htl("<span>$c</span>") for c in e.comments))
               """) for e in c.employees))
            </table>
        </dl>""") for c in d))
    </body>
  </html>
""")

I think the proposed syntax notation is much cleaner since (@htl(""" … """)… ) is significant boilerplate as compared to htl⟪…⟫… However, as seen by this comparison, the proposed notation is an incremental improvement. That said, the macro oriented approach shown above is possible since normal use of HTML doesn't include backslashes (unlike regex) and since the backbone of HTL already uses the Julia parser. I don't think this macro oriented approach with standard Julia triple double-quoted strings a generalized solution as is the current proposal.

clarkevans commented 3 years ago

Note that @mgkuhn reported a related escaping inconsistency separately as #39092. Thanks!

clarkevans commented 3 years ago

To complement @heetbeet's implementation, a PR should have some documentation. Here's a thought.

help?> ⟪
  raw⟪non-standard string literal⟫

  Unicode delimiters ⟪ and ⟫ are prefixed with a string macro, such as `@raw_str`.
  Rather than using regular interpolation and unescaping, this syntax is used to
  input a literal value so that the string macro can provided further interpretation,
  if any. Unmatched delimiters are not permitted.

  Also see `"` and `"""`  as alternative string literal notation.

  Examples
  ≡≡≡≡≡≡≡≡≡≡
  julia> println(raw⟪C:\my\path\⟫)
  C:\my\path\

  julia> println(b⟪\"⟫)
  UInt8[0x5c, 0x22]

  julia> println(raw⟪raw⟪nested⟫⟫)
  raw⟪nested⟫

  julia> println(raw⟪unmatched⟫⟫)
  ERROR: syntax: unmatched ⟫ delimiter
mgkuhn commented 3 years ago

If we are going to have additional delimiters (and associated escaping conventions) for non-standard string literals, then the macros being applied by these literals should have access to the information about which delimiters were used, and therefore (by implication) what escape processing has already happened. This is because some macro authors may want to revert some of the escape processing that has been triggered by some of the delimiters before the macro is applied, such that it does not duplicate subsequent escape processing by the macro, e.g. as demonstrated in #39092 for b"...".

I can think of two ways to supply such information:

Passing on to the non-standard-string-literal the information about which delimiter was used also opens up the possibility that the choice of delimiter can be used to change the semantics of the string, e.g. b“hello” and b‘hello’ might apply different forms of interpolation.

mgkuhn commented 3 years ago

Regarding the choice of new delimiter pairs: rather than having just a single one hardcoded, such as ⟪...⟫ or “...”, there could be a choice of several delimiters. This will make it easier for authors to pick a delimiter pair that is

There is unlikely to be one single delimiter pair that optimizes all of these criteria. The Ps and Pe (punctuation start/end) categories in the UnicodeData.txt database are a rich source of such pairs.

Another rich source of paired delimiters are ASCII digraphs, such as <?...?> (used for SGML processing instructions), <%...%>, etc.

Regarding prior art in other languages: the Wikipedia article Delimiter gives an overview. (But keep in mind that non-standard string literals are a Julia speciality not found in most other languages, so requirements here may be a bit more complex due to chains of escape processing being more likely.)

tpapp commented 3 years ago

Once we allow a zoo of delimiters, why not just allow arbitrary markers? Eg Bash-style markers along the lines of

<<MARKER
text that does not contain marker
MARKER

are already familiar to a lot of people (you usually see them as <<EOF, but that's just a convention).

heetbeet commented 3 years ago

Issue with mixing string semantics

@mgkuhn , after pondering about issue #39092 I am starting to see where things might go wrong. It seems like *_str macro authors have a tradition to rectifying escape semantics to allow for Julia-like string escaping. This is done by writing macros such as these

macro foo_str(s)
   v = foo_converter(unescape_string(Base.escape_raw_string(s)))
   QuoteNode(v)
end

If successful, one can have what #39092 seems to aiming for: (note the following example doesn't touch on the semantics of the meta character $)

b"""" """ == codeunits("""" """) #true
b"""\" \\""" == codeunits("""\" \\""") #true
b"\" \\" == codeunits("\" \\") # true
b"\\\\\\" == codeunits("\\\\\\") #true
b"""\\\" \\""" == codeunits("""\\\" \\""") #true
b"""\\\\" \\""" == codeunits("""\\\\" \\""") #true

but with all this character escaping and unescaping, the behavior of ⟪⟫-strings are very unexpected:

# Few expected examples
b⟪" ⟫ == codeunits(⟪" ⟫) #true
b⟪\" \\⟫ == codeunits(⟪\" \\⟫) #true
b⟪\" \\⟫ == codeunits(⟪\" \\⟫) # true
b⟪\\\\\\⟫ == codeunits(⟪\\\\\\⟫) #true
b⟪\\\" \\⟫ == codeunits(⟪\\\" \\⟫) #true
b⟪\\\\" \\⟫ == codeunits(⟪\\\\" \\⟫) #true

# Adding character mappings like \n results in unexpected behaviour
b⟪\n⟫ == codeunits(⟪\n⟫) #false
b⟪\n⟫ == codeunits("\n") #true
b⟪\\n⟫ == codeunits(⟪\\n⟫) #false
b⟪\\n⟫ == codeunits("\\n") #true
b⟪\\⟫ == codeunits(⟪\\⟫) #true
b⟪\\⟫ == codeunits("\\") #false

(the code was run under https://github.com/heetbeet/julia/tree/add-bracketed-quote-syntax with the added b_str fix mentioned in #39092)

Possible integration

I think the aim of this thread is to process data inline in the rawest form and to enhance the flexibility of DSLs. The collective thought was to add a new string syntax like ⟪⟫ and to expand the *_str macro notation using that. But maybe it's better to just add a syntax that will allow pushing raw data to macros, and keep it more in line with macros than with the *_str syntax. I propose the following parsing rule (by example):

macro raw_delimited(data, args...)
    data
end

@raw⟪a = b + c⟫  ->  @raw_delimited("a = b + c", '⟪', '⟫')  ->  "a = b + c"
@raw᚜a = b + c᚛  ->  @raw_delimited("a = b + c", '᚜', '᚛')  ->  "a = b + c"
@raw“a = b + c”  ->  @raw_delimited("a = b + c", '“', '”')  ->  "a = b + c"

You will be able to write macros that works on arbitrary data via adding <name>_delimited macro to your global scope. Furthermore, the macro has access to the delimiter pair, in order to process to implement nesting semantics. You will even be able to write multi-method macros to dispatch on the value of the delimiters.

We can then write macros that are specialized for raw data and as a consequence be able to also get a raw string from a blob of text (e.g. \\"""\\\n$dsf from @raw⟪\\"""\\\n$dsf⟫). The parser can keep a list of allowable delimiters that are hand-picked from something like this: https://gist.github.com/claybridges/8f9d51a1dc365f2e64fa . I think there are some character real-estate left for something like this.

heetbeet commented 3 years ago

An alternative might be to allow overloading the *_str macros with two additional arguments and use that as a syntactical dispatch rule. The syntax can then also be more inline with the foo"" syntax.

macro raw_str(data, ldelim, rdelim)
    data
end

raw⟪a = b + c⟫  ->  @raw_str("a = b + c", '⟪', '⟫')  ->  "a = b + c"
raw᚜a = b + c᚛  ->  @raw_str("a = b + c", '᚜', '᚛')  ->  "a = b + c"
raw“a = b + c”  ->  @raw_str("a = b + c", '“', '”')  ->  "a = b + c"

If you are not planning on doing any parsing on the input, you can even define a catch-all macro by adding default arguments to the delimiters:

macro foo(data, ldelim=nothing, rdelim=nothing)
    println(data)
end

foo⟪hello⟫ #hello
foo"hello" #hello

edit: when I tried to implement this and the previous post, I realized that I forgot to include the suffix notation x"foobar"y. My final design also ended up having to allow for 1,2, 3, and 4 argument str macros with some introspection that needs to be done on compile-time and not parser-time.

mgkuhn commented 3 years ago

@tpapp So-called here docs, which are introduced in the POSIX shell and in Perl with <<, come in multiple forms. If you write <<'EOF' you get them without interpolation and \ is no metacharacter, with <<"EOF" you get them with interpolation and backslash sequences, and if EOF is surrounded by backticks, they are executed as shell commands (see the <<EOF section in man perlop). In Perl, <<~EOF starts an indented here doc, where the whitespace before the terminator is removed from all lines. They have their own constraints, e.g. you can't have a here docs that does not end in a line feed. And you can stack multiple here docs in one line.

Something like here docs might be a useful extension to Julia, but again you can't simply copy them from other languages, because in Julia you also have the question of how they will interact with special-string literals.