erlang / eep

Erlang Enhancement Proposals
http://www.erlang.org/erlang-enhancement-proposals/
264 stars 67 forks source link

Define verbatim sigil strings as truly verbatim #55

Closed RaimoNiskanen closed 11 months ago

josevalim commented 11 months ago

If you go ahead with the delimiter change for Erlang, I will open up a discussion to align Elixir with Erlang here and deprecate our escaping of closing delimiters. However, in Elixir, since sigils are used-defined, we probably shouldn't call them verbatim (although it would be nice to align on the naming as well).

Also, I am not sure if we should have verbatim regular expressions (~R). I would expect ~R/\d/ to literally match on the string ~B"\d" (i.e. verbatim) and not on any digit. A verbatim regex would always be equivalent to an exact string match which is why I don't see the benefit there.

Finally, I am a bit worried about introducing « as delimiter because I believe it will be the first time (as far as I know) that Erlang introduces non-ascii characters in its syntax/tokenizer. I don't know if it is a good or bad idea but I think it probably deserves a wider discussion around it.

essen commented 11 months ago

Finally, I am a bit worried about introducing « as delimiter because I believe it will be the first time (as far as I know) that Erlang introduces non-ascii characters in its syntax/tokenizer. I don't know if it is a good or bad idea but I think it probably deserves a wider discussion around it.

Probably deserves its own EEP to define best practices around using non-ascii characters in the language and documenting them.

That said, I'm all for « and → and a few others as that's what I've configured my editor to replace << and -> with when displaying files. Having « as a sigil delimiter, which will likely be fairly rare for quite some time, seems like a good way to introduce them.

zuiderkwast commented 11 months ago

Please drop the « » quotes. Not only is it non-ascii. They can be confused with similar Unicode symbols like 《 》. Similarly, if we add ‹ ›, they can easily be confused with ⟨ ⟩ and < >.

The guille­mets can also be confusing given some languages use them in reversed order for quoting »...« (Danish, Hungarian, etc.) or even »...» (Finnish and Swedish). For more symbols we shouldn't add, see https://en.wikipedia.org/wiki/Quotation_mark#Summary_table.

Regarding regular expressions, if "verbatim" here means that backslashes are passed verbatim to the PCRE compiler, then that's what you always want. Nobody wants to write ~r/\\d/ to match a digit. (We do that today as "\\d" only because we have to write them in strings.) So if there are two variants ~R and ~r, only one will be useful and the existence of the other one will just be confusing.

essen commented 11 months ago

Please drop the « » quotes. Not only is it non-ascii. They can be confused with similar Unicode symbols like 《 》. Similarly, if we add ‹ ›, they can easily be confused with ⟨ ⟩ and < >.

They look very different to me. There are characters that can indeed be confused with others but I don't think these are in that category. I also assume the documentation would provide the Unicode character name and numbers.

The guille­mets can also be confusing given some languages use them in reversed order for quoting »...« (Danish, Hungarian, etc.) or even »...» (Finnish and Swedish). For more symbols we shouldn't add, see https://en.wikipedia.org/wiki/Quotation_mark#Summary_table.

The language doesn't have to enforce that « » are in this specific order. It could have both pairs « » and » « and both characters as single start+end delimiters. But probably better to start small.

Regarding regular expressions, if "verbatim" here means that backslashes are passed verbatim to the PCRE compiler, then that's what you always want. Nobody wants to write ~r/\\d/ to match a digit. (We do that today as "\\d" only because we have to write them in strings.) So if there are two variants ~R and ~r, only one will be useful and the existence of the other one will just be confusing.

I think the difference will come when/if interpolation gets introduced. Which hopefully it won't.

josevalim commented 11 months ago

Regarding regular expressions, if "verbatim" here means that backslashes are passed verbatim to the PCRE compiler, then that's what you always want. Nobody wants to write ~r/\d/ to match a digit.

Agreed no one wants double escapes but I would say that if ~R means “passed verbatim to PCRE” one could say ~B means “passed verbatim to the character escaping” which converts \n to new lines. In other words, saying that the contents are verbatim to some processor is confusing, because it means any implementation can behave differently. Verbatim should be verbatim and that \d should literally match \d and not a digit. So we agree on no double escaping but I am arguing it should be ~r. :)

RaimoNiskanen commented 11 months ago

Verbatim?

From the point of view of this EEP, it is trying do define what a sigil is and how it behaves, without knowing exactly about future sigil backends such as regular expressions.

From that point of view it is natural to call a sigil type "verbatim" when all characters up to the end delimiter are passed as they are through the sigil mechanism.

But as @josevalim has pointed out (a few times), that is not what the user, the programmer, wants to know. It is how the frontend+backend combination behaves that is interesting.

@josevalim: Since there are custom sigils in Elixir - remind me - how is the end of the content decided for different sigil types? I presume the customization implementation cannot affect how the content end is found as in; can the end delimiter be escaped, should the \ character be escaped, can there be a final \ character in the string content? But then the customization can decide on the escaping rules within the string content... There is a frontend/backend separation here that shines through no matter how much we would want to hide it.

As the ~S and ~B sigils are proposed in this PR, I think it is correct to call them verbatim, as seen from the user. If they would be customizable as in Elixir, when describing them to the customization implementer I think it is also valid to call them verbatim. The ~s and ~b sigils (~c and ~s in Elixir) would be harder to put such a property on. Their default implementations handles escape sequences, but to a customization implementer they are only almost verbatim in that there have to be rules for how to escape the end delimiter and possibly how to escape an escape char that is last in the string. Perhaps that is enough to call them not verbatim... So, it might be correct to call the ~S and ~B sigils "verbatim", for the end user, but the customization implementer needs a better explanation.

Regarding the regular expression sigil(s): as already said there is no use in having the frontent+backend combination verbatim. Since the end char scanning rules are decided from the sigil name only we can say that ~r and ~R differs in how they find the content end, that is; if the end delimiter can be escaped, or we can skip end delimiter escaping alltogether. Then the ~R name might me more appropriate, which is why I wrote that in the EEP, but we can simply decide that there is only ~r and it doesn't allow end delimiter escaping. The "verbatim" property shouldn't be on the table for regular expression sigils.

Anyway, neither ~r nor ~R will be implemented yet, as stated in the EEP, to get more time to figure these details out, and if we were to implement only one, it should be ~r.

«quote chars»

@zuiderkwast: I had no idea that they were used as »quote« in a number of languages, and certainly not »quote». It seems to be just Finnish that only use »quote» (as an alternative to "quote"). Swedish seems to also allow »quote«. Therefore it seems to be safe to say that there is a minimal minority (Finnish) that uses »quote».

There would be no problem to add » « as an alternative to « » since it is like that in a number of languages. We want a start delimiter to decide the end delimiter to not have to search for more than one.

The "rigth/left-pointing double angle quotation mark"s are in latin1 (ISO 8859-1). The latin1 range is the character range that Erlang always has been defined in. The letters in latin1 (above 127) are allowed in variable names and unquoted atoms, so they are already in the syntax. But they haven't been used for keywords and such before.

I see no technical problem in using them as delimiters. They were actually considered for the binary syntax instead of << and >>, but then it was regarded too strange to force users into figuring out how to type them on all keyboards. The quirkiness of the new << and >> operators was considered acceptable.

In this case it is optional to use « », so nobody has to, until you get to modify someone else's code and then you copy and paste, until you resign and figure out how to type them.

josevalim commented 11 months ago

Since there are custom sigils in Elixir - remind me - how is the end of the content decided for different sigil types?

The choice of lowercase/uppercase decides at the tokenizer level if interpolation is enabled and Elixir only handles the escaping of the closing delimiter (which I believe we should align with Erlang, as per the previous message). Everything else is handled by the sigil implementation.

The "verbatim" property shouldn't be on the table for regular expression sigils.

Agreed. It would be nice if we could call all uppercase sigils "verbatim" though. I understand now that you used ~R to remove the need to escape the closing delimiter but, honestly, I don't think escaping the closing delimiter is a big deal for regexes. Almost all languages have a single delimiter /, which everyone escapes, and we already have several delimiters to avoid conflicts. I'd say that we are pretty well covered and ~r gets my vote, but that's for another day. :)

zuiderkwast commented 11 months ago

The "rigth/left-pointing double angle quotation mark"s are in latin1 (ISO 8859-1).

Fine then. :-)

@zuiderkwast: I had no idea that they were used as »quote« in a number of languages, and certainly not »quote». It seems to be just Finnish that only use »quote» (as an alternative to "quote"). Swedish seems to also allow »quote«. Therefore it seems to be safe to say that there is a minimal minority (Finnish) that uses »quote».

I think it's safe; nobody will be confused. I have seen this style in Swedish books though. It's not that uncommon. Do you have any old printed books around?

The quotation marks article on Swedish Wikipedia has this reference: ^ Svenska skrivregler. Språkrådets skrifter 8 (3. utökade utgåvan). Stockholm: Liber. 2008. ISBN 978-91-47-08460-9. ”I svensk text är inledande och avslutande tecken av tradition oftast riktade åt samma håll, med spetsen åt höger: »...», men det förekommer också att de är riktade mot varandra: »...«.”

It's not that easy to find scanned books online but here are two screenshots that I found:

image (from http://www.eom.nu/wp-content/uploads/2018/05/sandebud-1937-del-1.pdf)

image (from https://www.hembygd.se/nassjo/gesallprovet-nassjo-tryckeriet-25-ar)

Though the more common in Swedish are ”…”, both pointing in the same direction, not “…” as in English or „…“ as in German.

RaimoNiskanen commented 11 months ago

@zuiderkwast

I svensk text är inledande och avslutande tecken av tradition oftast riktade åt samma håll, med spetsen åt höger: »...», men det förekommer också att de är riktade mot varandra: »...«.”

That probably explains why the Finnish has »that» too. Old Swedish influence.

Edit: Found one. "Illiaden | Odysén", printed 1963. Uses »...». Bummer! I argue against the traditions of my motherland :-(

michalmuskala commented 11 months ago

I think introducing into the language syntax characters that aren't easily available on most keyboards is a mistake - I don't think it would look like a serious language feature if I need to copy the characters from documentation just to use it.

RaimoNiskanen commented 11 months ago

@michalmuskala: "easily available" and "most keyboards" are grey zones, and "need to copy" might be a bit lazy. It should be a solvable problem, if even a problem. One can choose to use other delimiters.

RaimoNiskanen commented 11 months ago

The great advantage of having « » as delimiters is that they almost certainly don't collide with the content, just because they are so uncommon.

michalmuskala commented 11 months ago

At least the US keyboard layout (and all others based on it), don't have the character easily available. This already excludes large swaths of the programmer population.

RaimoNiskanen commented 11 months ago

Google says that they are at [AltGr] [[] and [AltGr] []] on a US International layout. When using X Compose they are at [Compose] [<] [<] and [Compose] [>] [>].

essen commented 11 months ago

Google says that they are at [AltGr] [[] and [AltGr] []] on a US International layout. When using X Compose they are at [Compose] [<] [<] and [Compose] [>] [>].

On French keyboards [AltGr] [z] and [AltGr] [x]. «». Easy.

RaimoNiskanen commented 11 months ago

On French keyboards [AltGr] [z] and [AltGr] [x]. «». Easy.

I hear it is the same on Swedish International.

michalmuskala commented 11 months ago

Google says that they are at [AltGr] [[] and [AltGr] []] on a US International layout.

On my Polish keyboard I get „‚, and setting to "US International" I get “‘, so doesn't seem to be working.

And this is kind of my point - if I have to google how to type in the programming language's syntax - it's already failing at providing good syntax.

essen commented 11 months ago

And this is kind of my point - if I have to google how to type in the programming language's syntax - it's already failing at providing good syntax.

But ~ isn't on all keyboards either... First Japanese keyboard I searched doesn't have it. Some do. Some don't. We can't accomodate for all keyboards. We should worry about whether it is accessible enough not about whether you have to search how to input a character the first time you're using it.

erszcz commented 11 months ago

I second @michalmuskala's doubts about « and » delimiters. @RaimoNiskanen, as you pointed out, these characters are in Latin-1 character set, but outside the ASCII range, and the "easy enough to type" votes here, at least so far, are from speakers of Latin-1 encodable languages.

To give a counterexample, these characters are not easily typable on Polish keyboards. I imagine it's similar for any other users of Latin-2-suited keyboards, i.e. all of Central and Eastern Europe.

wojtekmach commented 11 months ago

The great advantage of having « » as delimiters is that they almost certainly don't collide with the content, just because they are so uncommon.

If this is the primary use case, I'd consider ` instead. It's a very good delimiter for Markdown code spans and blocks exactly because it almost never collides with content. :)

cc @josevalim

essen commented 11 months ago

To give a counterexample, these characters are not easily typable on Polish keyboards. I imagine it's similar for any other users of Latin-2-suited keyboards, i.e. all of Central and Eastern Europe.

It's in the Latin-2 set though?

174 AE  Left-pointing double angle quotation mark
175 AF  Right-pointing double angle quotation mark

Just to be clear, these characters are not shown on my keyboard either, I simply pressed [Alt Gr] and tried every keys until I found where they are. But ultimately they ended up easy enough to type (just not obvious). How do you type these on Polish keyboards? If it's not a simple [Alt Gr] it might require [Shift] + [Alt Gr] which, while less convenient, is still within acceptable bounds IMO, for a character that will be sparsely used.

ferd commented 11 months ago

As far as I can tell, there's a kind of reason a lot of languages ended up with literal strings being declared as heredoc strings something like:

<<<PAT
unescaped content
PAT;

if only because if you're not gonna allow escaping, you're going to always have edge cases, so what they all end up doing is having a configurable delimiter with unmistakable syntax, which nobody fully likes because of how much room it takes.

Picking other fixed delimiters are always going to inherently trade-off the experience of some users in some contexts. It doesn't matter if you pick ~"..."~, or `...`, or « ... », or even XML's <![CDATA[...]]> really. The moment you've decided to drink the poison of unescaped strings, you've got to deal with some weird ass syntax that's either:

RaimoNiskanen commented 11 months ago

@wojtekmach: We have had long internal discussions, which have homed in to that the delimiters "should look like delimiters" as in vertical lines: | / but not \ since it is used as an escape char, or "be parentheses": () [] {} <>, already "be quote chars": " '.

# would be too useful to not include since it is a comment in many target contents.

And we don't want chars that are easily mistaken so not ¦ (latin1), not ´ (looks like '), so not `.

But ` has the advantage that it is already used as quote char in e.g Markdown, and it is 7-bit ASCII.

Erlang has already been defined for latin-1. All latin-1 letters are allowed in variable names and unquoted atoms. Still almost nobody uses that. Probably to not exclude e.g latin-2 programmers.

I'll sleep on this, but since latin-1 is particular to western Europe, some characters excludes eastern Europe and large parts of the rest of the world. This may be the argument against allowing latin-1 for syntax that I have been missing.

But ~ is 7-bit ASCII, for the record...

And, @essen: « and » are not in latin-2.

I'll be back.

RaimoNiskanen commented 11 months ago

@ferd: Quite right.

We have landed on our Here-documents, "triple-quoted strings" that allow an number of " characters as the start delimiter and the end delimiter has to have the same number of " characters, and has to be first on a line (after white space). So the start and end delimiters can be chosen to not collide with the content.

We also have non-verbatim strings where all \ char and the end delimiter char have to be escaped. And the end delimiter can be chosen from this set we are discussing.

So now we are trying to find the best set of delimiters for verbatim strings where the end delimiter char cannot be escaped. So it is a corner case, but worth to try finding something "optimal".

erszcz commented 11 months ago

@essen

It's in the Latin-2 set though?

It's in code page 852 and Windows code page 1250, which overlap with Latin-2, but are different encodings, and ISO 8859-2 aka Latin-2 is yet a different one.

How do you type these on Polish keyboards?

The point is I can't :| At least not on a Mac, maybe it's different on Windows due to the above ISO vs MS differences.

essen commented 11 months ago

maybe it's different on Windows due to the above ISO vs MS differences.

Ah I was looking at the wrong "Latin-2" (CP852), my bad. Sounds like those characters may not be a good fit for general use then.

RaimoNiskanen commented 11 months ago

I need to apologize - I was apparently under the delusion that latin-1 was more universal than it is, but it is just one of the latin-* siblings. It is not a common denominator despite it's status as the base of Unicode. 7-bit US ASCII is the common denominator. (Not entirely, but still...)

Guillemets are out, and backtick is in, just because it is in 7-bit US ASCII, uncommon, and used in e.g Markdown for this purpose.

Sorry for the noise, and thank you for the counter noise :-)

I will also write something in the EEP about why latin-1 is a bad choice, even though it is the character set that Erlang is defined in. It is one thing to allow variables and atoms to contain characters that may be unusable in other languages, then you should be aware of not being international. But to be lured into using a quote character just because it is allowed and available on your keyboard is a worse matter. Just as I almost did in this EEP.

josevalim commented 11 months ago

Guillemets are out, and backtick is in, just because it is in 7-bit US ASCII, uncommon, and used in e.g Markdown for this purpose.

Excellent!

May I add just a tiny bit of noise? As far as I know, backticks are not used anywhere in Erlang and I think we should reserve them for future use. Imagine we need a new syntax for something in 10 years and backticks are no longer an option due to sigils. It is a bit silly and sigils are already gated with a ~ prefix... but it doesn't cost much to leave them out given we already have | / # " ' as single delimiters.

Disclaimer: backticks have no use in Elixir too for similar reasons.

essen commented 11 months ago

I was wondering the same thing, I think only ^ & # and the backtick remains. But I don't think it's a problem with sigils because the ~ modifies the meaning of the character. Most sigil delimiters already have other meanings outside of sigils, so if that's OK, then it should be OK to use # or the backtick outside of sigils too in the future.

RaimoNiskanen commented 11 months ago

As @essen says, all suggested delimiters has got other meanings in Erlang today. Only ` ^ & remains lacking meaning, we now used ~ for the sigils. Since ~ is an escape that changes the meaning of the delimiter character there should be no problem to allow characters that might have future meaning.

@josevalim: I agree that it itches a bit to add `, one of 3 remaining, but since it is used for this purpose in other languages it is nice to have, and it should not be a future problem... Right?

josevalim commented 11 months ago

and it should not be a future problem... Right?

It depends on how safe you want to be. Not adding it is 100% conflict-free. Adding it is less than 100%... but probably safe enough. :D Either way is fine, I just thought I would mention it for completeness.

RaimoNiskanen commented 11 months ago

Not adding it is 100% conflict-free. Adding it is less than 100%... but probably safe enough.

It is a valid point. We cannot be 100% certain in predicting that no future syntax suggestion will ever collide with the sigil syntax delimiter handling.

But I hope we can be sure that any future syntax suggestion can be designed to not collide with the ` character use in sigils.

And I do hope that anyone that can see such a danger will see it now, before OTP-27...