eclipse-archived / ceylon

The Ceylon compiler, language module, and command line tools
http://ceylon-lang.org
Apache License 2.0
398 stars 62 forks source link

Single-quoted strings #3500

Closed CeylonMigrationBot closed 8 years ago

CeylonMigrationBot commented 12 years ago

[@gavinking] So it's time to answer some questions about single-quoted literals. I've recently decided that there's no reason for a special Quoted type, and that single quotes should just be a different way to write a String literal, like in JavaScript and some other languages. So, how do single quoted literals differ from double quoted literals?

[Migrated from ceylon/ceylon-spec#394] [Closed at 2013-01-06 15:17:02]

CeylonMigrationBot commented 12 years ago

[@quintesse] - So how can single quoted literals contain single quotes?

    value str = '   ' '''
       some text
    ''';
CeylonMigrationBot commented 12 years ago

[@RossTate] So, just to check that I'm understanding things correctly, this is like @"string" for C#?

Honestly, sometimes I've wondered why the default is to have escape characters, and why I can't choose what my escape character is.

As for multi-line, I definitely recall times where I've wanted to be able to just copy and paste from some file to another and not worry about escaping or dealing with whitespace or anything. Many editors don't make it convenient to insert the appropriate leading whitespace.

No solutions, just relaying personal experience.

CeylonMigrationBot commented 12 years ago

[@gavinking] Tako, I think we should support """" (a literal ") and '''' (a literal ').

CeylonMigrationBot commented 12 years ago

[@quintesse] In double quoted strings as well? But there we can just use the escape, right?

CeylonMigrationBot commented 12 years ago

[@gavinking] The first example is a double quoted string containing a double quote. The second example is a single quoted string containing a single quote. \' and \" would end up being more of a legacy thing.

CeylonMigrationBot commented 12 years ago

[@quintesse] Well we don't need \' in a double quoted string and a single quoted string cannot have any escapes, so a double ' is almost the only way to insert single quote. But we don't need two ways of handling inserting a " in a double quoted string right? Especially because there might be a bunch of other escapes anyway \r, \n, \t, {#} ... which reminds me, if we can't have escapes how do we handle unicode in single quoted literals? Only by actually entering unicode characters in the file like that?

CeylonMigrationBot commented 12 years ago

[@gavinking] Let's go back to why we need single-quoted strings in the first place: we need them because in a double quoted string, \ is a special character, which means that for a couple of very common minilanguages (regexes, and windows paths) you wind up with a whole lot of disgusting \\, and even \\\\.

As pointed out by Ross, one solution to this problem might be to allow the user to specify a custom escape character, as in:

"\w+(/t\w+)*":/
"^{#00E5}ngstr^{#00F6}ms":^

or something like that. But to me this is pretty amazingly cryptic, and also seems more flexible than what we really need. It's also repetitive: we have to type :/ after ever windows path and after every regex?

Another, more radical, solution might be to eliminate the traditional C-style backslash escapes altogether, and use interpolation together with a modified character literal format instead:

"\w+(" tab "\w+)*"
"" '#00E5' "ngstr" '#00F6' "ms"

I must admit that I find this approach extremely appealing, but I assume that taking away people's precious \n and \t is going to result in howls of protest. A more technical problem is string literals that appear in annotations: these aren't expressions, and don't currently support interpolation. (The second problem is perhaps solvable, not sure about the first.)

So as some kind of compromise, Ceylon has two string literal formats, one of which supports the traditional escapes, together with the new \{unicode} escape, and a separate format which does not have escapes at all. That is, the only reason we really need single-quoted literals is to have a literal format where \ is an ordinary character.

So we can write our examples in two ways:

"\\w+(\t\\w+)*"
"\{#00E5}ngstr\{#00F6}ms"

or:

'\w+(' `\t` '\w+)*'
'' `\{#00E5}` "ngstr" `\{#00F6}` "ms"

Now, I definitely don't find this as clean as the "radical" solution above, but it does solve the marketing problem of letting people write \n\t. :-/

CeylonMigrationBot commented 12 years ago

[@gavinking] Hrm, interesting, a different kind of compromise solution just occurred to me: we could make escape sequence interpretation be the responsibility of a method on String, instead of something that happens at compile time. So people could still write strings containing \t, \n and friends, but they would need to call the method explicitly:

print("Hello!\nGoodbye!".raw())
Regex("\w+(/t\w+)*".raw('/'))

This idea is growing on me. The only major downsize I see is that you might quite often find yourself wondering "does this annotation/attribute accept/return a raw string or an escaped string?".

CeylonMigrationBot commented 12 years ago

[@RossTate] I kind of like that last idea. The problem, though, is that we need some way to escape quotes.

As for my proposal, you could always make no escaping the default, so then you don't need :\ after each string.

CeylonMigrationBot commented 12 years ago

[@gavinking] > The problem, though, is that we need some way to escape quotes.

I think quote-escaping is not a big problem. A number of languages use "" as an escaped double quote and to me that is convenient, readable, and natural. A few languages even give you the choice between:

"Gavin's dog says ""Woof""."

And:

'Gavin''s dog says "Woof".'

Which I find pretty nice, personally.

As for my proposal, you could always make no escaping the default, so then you don't need :\ after each string.

True. Then it would be the compile-time version of the raw() method, and have some of the same advantages.

Also not a bad option, I suppose. What would be the best syntax? A : or something else?

CeylonMigrationBot commented 12 years ago

[@RossTate] I vote for -! It's like you're subtracting the character, heheh. If there's no escape character (i.e. whitespace), then there's no escape except we convert your {...} stuff to Unicode characters. Do we need something for escape character but no Unicode conversion?

CeylonMigrationBot commented 12 years ago

[@gavinking] P.S. Note that the raw() method approach would go along with the following simplification of the character literal format:

CeylonMigrationBot commented 12 years ago

[@gavinking] > I vote for -! It's like you're subtracting the character, heheh.

I agree that - looks nice, but it's also already an infix operator, which makes me a little uncomfortable (yes, I do realize that you can't subtract from a String, but still...).

CeylonMigrationBot commented 12 years ago

[@RossTate] I imagine it shouldn't be a problem to have the sequence "- handled specially by the parser.

CeylonMigrationBot commented 12 years ago

[@gavinking] No, it's not a real problem. Nevertheless, I'm still very doubtful that this is the sort of thing I want to see in Ceylon code:

print("Hello!\nGoodbye!"-\);
CeylonMigrationBot commented 12 years ago

[@gavinking] So I need from more feedback from you guys on this one.

Frankly, I'm starting to take seriously the idea of simply eliminating \ escapes.

CeylonMigrationBot commented 12 years ago

[@chochos] I don't think I'll miss backslash escapes. However, I absolutely hate double-double quotes for escaping quotes. That path leads to madness; That double-quoting thing that's so popular in SQL for example, well it's so damn ugly that certain DBMS's have alternatives for it (PostgreSQL lets you do either 'don''t' or E'don\'t', so they basically went back to backslash escapes).

I agree that \n is pointless in Ceylon because we have multiline strings, but \t is unambiguous and I think we need something for that (I mean, when you look at a bunch of whitespace, you can't tell if it's spaces or tabs just by looking at it, so \t is very useful there).

Maybe the constants are a good solution for double-quotes and tabs and such, but they can be quite verbose.

CeylonMigrationBot commented 12 years ago

[@gavinking] > I absolutely hate double-double quotes for escaping quotes. That path leads to madness;

Eh? really? What madness? It always seemed very sane to me—it's not like 'don''t' could possibly mean anything else...

Maybe the constants are a good solution for double-quotes and tabs and such, but they can be quite verbose.

Well, \t vs "ht" is four characters instead of two. Now, sure, you're more likely to type it as " ht " but that to me is an advantage not a disadvantage. I don't fell like the tab character is such a common thing these days that I need to be able to express it as two characters rather than four. In fact, I struggle to think what it is even used for anymore in these days of markup languages...

CeylonMigrationBot commented 12 years ago

[@chochos] When the escaped quotes are at the beginning or end of a string it gets very ugly very fast:

"SELECT * FROM ""namespace.table""" or "to escape quotes, type them twice like """"this"""""

CeylonMigrationBot commented 12 years ago

[@gavinking] @chochos We could, following JavaScript and XML, continue to accept either single quotes and double quotes, but with no difference in semantics. For example:

"Gavin's dog " + 'says "Woof".'

Or:

"Gavin's dog says " '"Woof"' "."

I'm pretty certain that it's almost vanishingly uncommon to have a string with a mix of " and ' embedded in it in an expression in regular code.

On the other hand, in markdown doc annotations, it's going to be pretty damn common to want to type code examples with a mix of double and single quotes. I don't like the option of " at all for this purpose, and I think that "" is much better.

Note that markdown doc annotations, along with regexes and windows paths, is another place where you're going to get into the madness of \\ and \\\\—which I think is a far greater madness than "" and """".

CeylonMigrationBot commented 12 years ago

[@gavinking] > When the escaped quotes are at the beginning or end of a string it gets very ugly very fast:

Yeah but this is just the same problem that you already have with \. I suppose your argument is going to be that " occurs much more often in code and regular text than \ does. But my counter to that is that in C-style string literals you need to be careful with both \ and ", whereas in my suggestion, you only need to be careful of " in double-quoted strings, and of ' in single-quoted strings.

CeylonMigrationBot commented 12 years ago

[@tombentley] I've said it before (and if I got an answer, I'm afraid I forget, so forgive me), but: Why doesn't the language spec mandate that Ceylon source code is UTF-8 encoded? Then the need for stuff like '#212B' just disappears.

Having said that, I do like the idea of allowing a unicode character name ('ANGSTROM SIGN') as a character literal. That is a lot more readable for those of us who don't have Unicode memorized. It would of course be a compile time error if such a character name didn't exist.

I've never liked the doubled-up-quotes way of quoting. I'm sure it's the sort of thing you get used to, but we'd stand open the accusation of changing things for the sake of it (which we're accused of a lot already).

I'd really like to be able to get the compiler to generate a single String literal for something like

"Hello!" ht "Goodbye!"

which we can only do if ht and friends live in ceylon.language, and that doesn't really scale to a large number of characters. But I guess we'd only need a handful of things because of the unicode names as character literals idea.

\n is pointless in Ceylon because we have multiline strings

Well, thing is on windows we'd expect

"foo
bar"

to be foo\r\nbar, so we'd still need a way to produce plain \n in those cases where that's what you really meant.

CeylonMigrationBot commented 12 years ago

[@gavinking] > Well, thing is on windows we'd expect ... to be foo\r\nbar

No, for portability reasons, we always produce a \n here, if I recall correctly.

CeylonMigrationBot commented 12 years ago

[@gavinking] Here's how I see it. Let's consider the canonical case of a string literal in a code example in a doc annotation. This is going to be the most likely thing you would have to deal with.

We have:

print(\"My name is \\\"Gavin\\\".\");

vs:

print(""My name is """"Gavin""""."");

And, FTR, you would also have the choice of using single quotes in your doc annotation and typing just:

print("My name is ""Gavin"".");

To me it's very difficult to see how the double-double solution is worse...

CeylonMigrationBot commented 12 years ago

[@tombentley] > No, for portability reasons, we always produce a \n here, if I recall correctly.

Well, that's not very helpful in the common case of wanting to output a string with platform-specific line separators, which is what you want most of the time. This seems to reduce the utility of allowing newlines in String literals.

I wonder if something like the following could be made to work (borrowing the raw() idea):

"foo
bar".newlines(`\n`)

(So the default behaviour for multiline strings would be to use the platform specific line separator, transforming at runtime if necessary). Just a thought...

CeylonMigrationBot commented 12 years ago

[@gavinking] I guess one reason why I prefer ""Hello"" to \"Hello\" and """"Gavin"""" to \\\"Gavin\\\" is that the double-double option is visually symmetric, which is the right thing here, I think.

CeylonMigrationBot commented 12 years ago

[@gavinking] > Well, that's not very helpful in the common case of wanting to output a string with platform-specific line separators, which is what you want most of the time. This seems to reduce the utility of allowing newlines in String literals.

I'm not really convinced by this. You get the same behavior by writing a literal with \n in it...

I wonder if something like the following could be made to work...

Well, sure, today you can write it like this:

"foo
 bar".replace("\n", process.newline)

But there are more efficient ways to write/implement it.

CeylonMigrationBot commented 12 years ago

[@RossTate] For escaping quotes, could we do \"Gavin says "Hello".\", or something else along the lines of making the "end of string" marker be two characters rather than one?

CeylonMigrationBot commented 12 years ago

[@gavinking] > For escaping quotes, could we do \"Gavin says "Hello".\", or something else along the lines of making the "end of string" marker be two characters rather than one?

Sure, I suppose any of the following would be pretty easy for the lexer to deal with, and not step on anything we're likely to want to do later:

\\Gavin says "Hello".//
\"Gavin says "Hello"."/
\'Gavin says "Hello".'/

or:

\\Gavin says "Hello".\\
\"Gavin says "Hello"."\
\'Gavin says "Hello".'\
CeylonMigrationBot commented 12 years ago

[@tombentley] > I'm not really convinced by this. You get the same behavior by writing a literal with \n in it...

Well, yes you do. But you're making it even easier to do what is usually not the right thing. It just seems silly to me that the default will be incorrect for an OS which (although I despise it) has a pretty big market share.

CeylonMigrationBot commented 12 years ago

[@FroMage] Not much to say here but a few points:

CeylonMigrationBot commented 12 years ago

[@FroMage] Oh and I like the idea of allowing single and double quotes for strings to replace the delimiter which has to be quoted like in JavaScript but it does take single quoted literals away from us

CeylonMigrationBot commented 12 years ago

[@gavinking] > I always hated visceraly the SQL double quote quoting

OK, well, I don't really understand this at all, but if enough people feel so strongly in this direction, that's a reason in itself to not go down that path.

\nrt and friends are a lot easier on the eyes than their non-visible screen effects, when you care precisely what kind of white space you are describing.

Certainly, no argument there. All I'm saying is that:

It's not that I have a particular problem with \t, \n, etc, (though FTR I don't love how they tend to run into the following word) it's just that we have these things like regexes and windows paths where having to escape the backslash is a real problem.

And I'm not proposing to take away peoples \n\t entirely ... at most I'm suggesting that it would not be the thing you get by default.

perl has this interesting feature for regexes

Yeah, I know about that one. It always seemed a little too flexible to me. I was hoping for a slightly simpler solution.

if our regexes can be expressed in a syntax where the metacharacters don't need silly quoting like Java strings we're good, but it's likely we will want to keep the string escapes there such as \n, not replace them but extend them, like with \w.

Let's keep a clear distinction here:

For traditional regexes, I don't propose to change the syntax at all. For the language-specified pattern matching language what I'm proposing is to use built-in rules. For example: letter, digit, word, integer, float. There would be no issue with \ here.

Oh and I like the idea of allowing single and double quotes for strings to replace the delimiter which has to be quoted like in JavaScript

Right, this is something I found really surprisingly convenient in JavaScript, which is I guess why I'm trying to push you guys on this issue of the backslash escapes. The thing is that I think I would get much more convenience from having the choice between 'string' and "string" than I get from \n and friends.

CeylonMigrationBot commented 12 years ago

[@gavinking] How about this:

So:

print(\"Hello!\n\tGoodbye!");

prints:

Hello!
    Goodbye!

but:

print("C:\Hello\World");

prints:

C:\Hello\World

This is still not quite a perfect solution as far as embedding code samples in doc annotations goes, but it's pretty good. You have your pick of:

doc 'You can print a string 
     literal like this:

          print("Hello!");'

And

doc "You can print a string 
     literal like this:

          print('Hello!');"

And even

doc \"You can print a string 
      literal like this:

           print('Hello!');

      or like this:

           print(\"Hello!\");"
CeylonMigrationBot commented 12 years ago

[@gavinking] Alternative to initial backslash is:

doc \"You can print a string 
      literal like this:

           print('Hello!');

      or like this:

           print(\"Hello!\");"\

of course. Not sure which I prefer. Perhaps the trailing \ looks a bit "unfinished".

CeylonMigrationBot commented 12 years ago

[@gavinking] OK, so here's an idea that combines Ross' idea of configgable escape characters with an idea I toyed with briefly and discarded. We already have the notion that the # and $ characters are somehow used to select between formats (that's what we're doing with binary/hex literals).

So what if I could write:

import ceylon.time { $ = Date, \ = Time }
import ceylon.regex { # = createPattern }

Then a literal prefixed with $ would be parsed by Date(), a literal prefixed with \ would be parsed by Time(), and a literal prefixed with # by createPattern():

Date date = $'2012-10-3';
Time time = \'10:30 PM CST';
Pattern pattern = #"\d*(\w+ )*\w\d"; 

Therefore, by default, I can write #FF0011, and $110001101, and \"Hello!\n\tGoodbye!". Tell me that does not kick ass.

WDYT?

CeylonMigrationBot commented 12 years ago

[@gavinking] A really surprising consequence of this proposal is that I can write the following:

print(\Hello);

Another consequence is that we would need a different syntax for quoted identifiers if we adopt this idea. Probably something like i\JAVA_IDENTIFIER and I\className.

CeylonMigrationBot commented 12 years ago

[@quintesse] My first reaction is that there are only a handful of symbols available to us, so that if this takes off we will have many formats needing their own symbol. Now, within a single source file you might need only a couple of the at the same time so this should be manageable. But... this would mean that you'd often be re-using symbols for different formats, it would also mean that different people will use different symbols thereby making it almost impossible to develop "an eye" for literals. Now a Date is normally easily distinguishable from a RegExp but there might be other examples where this isn't the case and you'd have to go to the import to see what kind of literal it is.

I also don't see any advantage in letting you write print(\Hello), it's just another way of writing print("Hello") and why would you introduce 2 ways of doing the same thing? It just makes it more difficult to understand things "at a glance" IMO.

CeylonMigrationBot commented 12 years ago

[@chochos] So basically you can fake operator overloading and make your code unreadable by using unicode symbols in import aliases? I can now rename print to § and coalesce to © etc?

CeylonMigrationBot commented 12 years ago

[@gavinking] > I also don't see any advantage in letting you write print(\Hello), it's just another way of writing print("Hello") and why would you introduce 2 ways of doing the same thing?

I didn't say it was an advantage or something that is a good idea, I just noted that it is a consequence of the rules.

CeylonMigrationBot commented 12 years ago

[@gavinking] > So basically you can fake operator overloading and make your code unreadable by using unicode symbols in import aliases? I can now rename print to § and coalesce to © etc?

Actually my proposal is limited to precisely three characters: \, #, and $. And the idea is that you would only be allowed to use them as aliases for methods/constructors that parse data formats.

And anyway, in Ceylon today you can write:

import ceylon.language { ñ=print }

In fact, according to the spec, any unicode letter is a legal identifier. (Though § and © are not because they are symbols, not letters.)

CeylonMigrationBot commented 12 years ago

[@gavinking] > Now a Date is normally easily distinguishable from a RegExp but there might be other examples where this isn't the case and you'd have to go to the import to see what kind of literal it is.

I don't think that's right at all. At most you would need to hover over the literal to see what type it is.

But seriously, when I see 10:30 GMT+1, http://ceylon-lang.org, or 1974-03-25 in a file I know wtf I'm looking at!

CeylonMigrationBot commented 12 years ago

[@chochos] Letters are very different from little stars that make code look like it was written by a 2-year old girl with Strawberry Shortcake stamps.

I didn't know about the unicode id's. Now I can rename print to órale in my code!

Seriously though, how can you tell that a constructor parses a data format? Some annotation, or any constructor that receives a String? What about method like parseInteger?

CeylonMigrationBot commented 12 years ago

[@gavinking] > Letters are very different from little stars that make code look like it was written by a 2-year old girl with Strawberry Shortcake stamps.

As I said above. The proposal is limited to exactly three characters.

Seriously though, how can you tell that a constructor parses a data format? Some annotation,

Right. The literal annotation that specifies the pattern for syntactic validation of the literal.

CeylonMigrationBot commented 12 years ago

[@gavinking] @chochos see section 6.2.5 of the spec. According to this proposal, the examples from that section would look like:

value regex = $'^\w+@((\w+)\.)+$';

ph = #"+1 (404) 129 3456";

value duration = \1h30m;

value dt = Datetime(#'25/03/2005', $'12:00 AM PST');

Timer { schedule = \"0 0 23 ? * MON-FRI"; onTimeout=purge; }.start();

Text { color = #FF3B66; "Hello World!" }

Link { 
    url = #'http://jboss.org/ceylon'; 
    "Powered by Ceylon" 
}

Email {
    to = $'gavin@hibernate.org'; 
    subject = "Ceylon"; 
    text = "Need some help with the compiler?"; 
}

It's different from the approach in the spec because the spec approach uses left-to-right type inference to infer the type of the literal, whereas in this proposal, we use the import statement and the prefix character.

CeylonMigrationBot commented 12 years ago

[@chochos] Ok then... Those examples look really nice.

CeylonMigrationBot commented 12 years ago

[@tombentley] > * You can quote a string with either ' or ".

  • You can prefix the string literal with a \, in which case, the traditional C-style escapes will be interpreted by the compiler.

Those are both compromises I'd personally be very comfortable with.

As for the #, / and $ idea: I can definitely see the appeal because those rules are fairly simple and allow us to do some nice things. The one part I don't like is the print(\Hello); thing. But I can't see how we could keep the rules simple and also prevent that.

CeylonMigrationBot commented 12 years ago

[@quintesse] > I know wtf I'm looking at!

Take it easy man. We're talking here about a system that is generically useful, there will possibly be hundreds of "literal formats", so without a doubt there will be many of them where you'll be wondering "what is this?". I mean, you really think many people are going to understand \"0 0 23 ? * MON-FRI" at first glance? Not everybody knows Cron (especially coming from Windows). So personally I'd definitely prefer something like cron "0 0 23 ? * MON-FRI" so you'd at least have some hint (or something to Google for).

And hovering? People do not use an IDE all the time. Also think of all the other moments that you could be looking at Ceylon code (online, a magazine article, vi).

It's like somebody saying it's okay to alias Array, List and Sequence to X, Y and Z , while using the same letters for different classes in the next source file. You won't be able to just look at a snippet of code and know what it's about, because $ was one thing on one source file and something else the next.

CeylonMigrationBot commented 12 years ago

[@chochos] I thought the thing with the 3 symbols was in addition to the other literal formats (integer,cron,date,etc let's call them named literals for now). Those are absolutely essential; the 3 symbols are a nice shorthand for literals you will use a lot in your code, but

CeylonMigrationBot commented 12 years ago

[@chochos] ...but as Tako says, they'll have different meanings in diff files and also someone might prefer to use an unambiguous named literal for some obscure format; and most importantly, whenever you need to use more than 3 formats in the same compilation unit, you need named literals.