JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

@b_str removes backslashes twice #39092

Open mgkuhn opened 3 years ago

mgkuhn commented 3 years ago

The byte-array literals syntax

julia> @show b"hi\n";
b"hi\n" = UInt8[0x68, 0x69, 0x0a]

is currently implemented as

"""
    @b_str

Create an immutable byte (`UInt8`) vector using string syntax.
"""
macro b_str(s)
    v = codeunits(unescape_string(s))
    QuoteNode(v)
end

This implementation hides a rather counter-intuitive and undocumented property: in certain situations, the unescaping procedure to remove backslashes is applied twice. As a result, a user needs to use no less than five (5) backslashes to obtain the byte sequence of the ASCII string \":

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x22]

Julia's raw strings use the following escaping rule:

(This is also the escaping mechanism that the Microsoft C runtime library uses when parsing quoted strings from the Windows command line into argv.)

This removal of backslashes before " occurs not only in raw strings, but in all non-standard string literals, which are just macros ending in _str. This can be seen from the trivial implementation of the macro behind raw string literals, which is just the identity function:

macro raw_str(s); s; end

Therefore, when b"\\\\\"" is processed, backslashes are removed in the following two steps:

  1. The raw-string parser replaces 5 = 2×2+1 backslashes in front of the " with 2 backslashes
  2. The call to the unescape_string() function by macro @b_str() replaces the remaining \\ with \.

This duplicate backspace reduction is entirely unnecessary in non-standard string literals where the corresponding macro calls unescape_string(), because that function does already perform the same \\\ and \"" mapping that is behind the 2n+1 rule of the raw-string processing. This redundant, duplicate processing is also likely to surprise users, especially since the documentation does not warn about this at all. It certainly surprised me!

There is a simple workaround in the case of @b_str(), namely to undo the backslash removal performed by the raw-string processing, using Base.escape_raw_string:

import Base.@b_str
macro b_str(s)
    v = codeunits(unescape_string(Base.escape_raw_string(s)))
    QuoteNode(v)
end

Now we get

julia> @show b"\\\"";
b"\\\"" = UInt8[0x5c, 0x22]

julia> @show b"\\\\\"";
b"\\\\\"" = UInt8[0x5c, 0x5c, 0x22]

which seems much more intuitive and unsurprising.

But @b_str() may be just one example of a type of non-standard string literal that further processes the string received with unescape_string(), or with any other function that uses backslashes as escape symbols, and therefore performs the same \\\ and \"" mapping. If this is indeed the case, then perhaps the compiler mechanics behind non-standard string literals should not remove any backslashes at all, and leave this to the author of the macro? The 2n+1 vs 2n rule would then merely be used to identify the terminating quotation mark, but all characters before that would be passed through to the macro untouched.

vtjnash commented 3 years ago

this in intentional, and, I believe, documented

clarkevans commented 3 years ago

@vtjnash The help for @b_str makes no reference to the semantics of @raw_str nor does it have examples that demonstrate these edge cases. Perhaps the only way to address this is to better document the unexpected behavior. Generally, ensure that all string macros, such as regex, provide documentation of these edge cases?

clarkevans commented 3 years ago

@mgkuhn You are attempting to make this invariant hold?
@b_str("…") == b"…" for all "…"

heetbeet commented 3 years ago

Should the invariant also hold that @b_str("""…""") == b"""…""" for all """…"""

Because I don't think you will be able to achieve both. I can't find a good example to explain my suspicion though.

mgkuhn commented 3 years ago

@clarkevans @heetbeet No, both your suggested invariants are neither reasonable goals nor achieveable: "..." interprets backslashes and so does @b_str, so concatenating both in @b_str("...") will still interpret backslashes twice:

julia> @show @b_str("\x5c\x6e");
b"\n" = UInt8[0x0a]

julia> @show b"\x5c\x6e";
b"\x5c\x6e" = UInt8[0x5c, 0x6e]

(Same with triple quotes, which make no difference here.)

heetbeet commented 3 years ago

Okay I see I made a mistake in my code, and I expect the same happened to @clarkevans. Let's try again.

Should the invariant hold that

codeunits("…") == b"…" for all "…"
codeunits("""…""") == b"""…""" for all """…"""
heetbeet commented 3 years ago

In my initial post I expected that this cannot hold for " and """ syntax simultaneously. But after consideration I changed my position. I couldn't find any counter example to support my claim. It seems both uses the same escape semantics except that the """ allows for un-escaped ", but since it still supports escaping \" -> " I think a sort of mapping can be build to support both, since any received raw " can be made escape proof by adding \". I'll try to add code examples.

heetbeet commented 3 years ago

@mgkuhn seems like your revised code has exactly this property for the example I tried:

Before fixing b_str

b"""" """ == codeunits("""" """) #true
b"""\" \\""" == codeunits("""\" \\""") #true
b"\" \\" == codeunits("\" \\") # true
b"\\\\\\" == codeunits("\\\\\\") #false
b"""\\\" \\""" == codeunits("""\\\" \\""") #false
b"""\\\\" \\""" == codeunits("""\\\\" \\""") #false

After fixing b_str

import Base.@b_str

macro b_str(s)
   v = codeunits(unescape_string(Base.escape_raw_string(s)))
   QuoteNode(v)
end

b"""" """ == codeunits("""" """) #true
b"""\" \\""" == codeunits("""\" \\""") #true
b"\" \\" == codeunits("\" \\") # true
b"\\\\\\" == codeunits("\\\\\\") #true
b"""\\\" \\""" == codeunits("""\\\" \\""") #true
b"""\\\\" \\""" == codeunits("""\\\\" \\""") #true
mgkuhn commented 3 years ago

@heetbeet None of your invariants can be true unless you exclude $, because "..." can also interpolate variable expressions (i.e., $ is a meta-character that splits what looks like a string literal into an array of values and wraps that with function calls that iterate over that array and join it with Base.print_to_string to a dynamically allocated string at runtime), whereas special-string literals do not interpolate (because they always are raw strings), and therefore can be processed by macros as compile-time string literals:

julia> a=1;

julia> @show b"$a";
b"$a" = UInt8[0x24, 0x61]

julia> @show codeunits("$a");
codeunits("$(a)") = UInt8[0x31]

Same for """, which again makes no difference here.

(I see how the discussion here evolves once more as evidence for widespread misunderstandings of how Julia's many different string literals work and relate to each other.)

heetbeet commented 3 years ago

I see what you mean, I forgot about those.

mgkuhn commented 3 years ago

@vtjnash What was the design rationale for the current behaviour?

Wouldn't it be cleaner to separate for special strings the following two operations:

  1. decide where the end delimiter is (done by the parser, using the 2n(+1) backslashes rule), and
  2. the interpretation and substitution of backslashes as escape characters (done by the special-string literal macro)

?

This separation could be introduced in a non-breaking way by offering a new, alternative interface for special-string literal macros, such that existing string literal macros continue to receive what they get at present (i.e., some backslashes removed).

c42f commented 2 years ago

Perhaps a sensible and generic fix for these kind of woes is to allow more flexibility in the string delimiters for custom string macros? (Also related #41041)

Then individual string macros wouldn't need weird heuristics to avoid double escaping - the generic answer if the user is having escaping issues would be to use another set of delimiters. Which exact delimiters are available? One possibility could be that either ` or "" begins a string when it's followed by the opposite quote type, with the (reversed/same?) delimiter at the other end of the string. It's currently a syntax error to juxtapose string literals so this syntax is probably available unless I've forgotten something. For example, the string"hi"`

julia> :(x``"hi"``)
ERROR: syntax: cannot juxtapose string literal
Stacktrace:
 [1] top-level scope
   @ none:1

I'm imagining this mixed delimiter parsing as @x_cmd "hi", given that the quote starts with a backtick and it could be @x_str for ".

The rule might be that mixed delimiters can be an arbitrarily long sequence of length at least 3, and the user can always arrange for those to not be present in the string they're trying to quote.

(This is just one idea - perhaps there's other delimiters available?)

adkabo commented 2 years ago

@c42f

Perhaps a sensible and generic fix for these kind of woes is to allow more flexibility in the string delimiters for custom string macros?

See https://github.com/JuliaLang/julia/issues/38948

c42f commented 2 years ago

Ah yes thanks. I thought I'd seen a longer discussion of this somewhere but couldn't find it.