dotnet / vblang

The home for design of the Visual Basic .NET programming language and runtime library.
290 stars 64 forks source link

[Proposal] Modify how literal strings are parsed. #301

Open rskar-git opened 6 years ago

rskar-git commented 6 years ago

Two items to this proposal:

Item 1: ANSI-quotation-marks versus Unicode-double-quotation-marks; if the string begins with " (i.e. Chr(34)) it must end only with ".

Item 2: A new Option which allows for C-style escape codes.

I believe these would address issues such as those stated in #276 and #299.

P.S. Alternative idea for Item 1 above: Introduce @"" to VB with a meaning and syntax identical with that of C# (and perhaps $@"" too). This avoids breaking existing code, and also would naturally complement Option of Item 2.

=== Item 1 ===

In VB today, a literal string can be composed with ANSI-quotation-marks (code &H22) and Unicode-double-quotation-marks (left and right, codes &H201C and &H201D). However, they are treated like exactly equivalent symbols. In other words, the literal can begin with any one of &H22, &H201C, or &H201D, but how it ends does not depend on how it begins - currently it can again be any one of them.

' &H201C to begin, then &H201C to end?!
Dim s = “This is (strangely) a valid string in today's VB“
' &H201D to begin, then &H201D to end?!
Dim s = ”This is (also strangely) a valid string in today's VB” 
' &H22 to begin, then &H201D to end - talking about mixing it up!
Dim s = "This is (too strangely) a valid string in today's VB” 

I am guessing the designers went this way as a help to those who code in a (or via a?) word-processor, such as Microsoft Word; or copy-paste from badly edited web pages. I'm not sure how many folks are out there who regularly do that - I can only guess they are the few and the proud. Considering how many languages in use today use either &H22 or ANSI-apostrophe (&H27) to form their literals, it doesn't seem like there are any real international/Unicode issues at play here.

So, I would like a change: If the literal begins with &H22, it ends with &H22. Therefore, this would then be valid:

' String begins and ends with &H22, and encloses a &H201C and a &H201D.
Dim s = "“This could someday be a valid string in VB, but sadly not today”"

Note there would no longer be a need to double-up on &H201C and &H201D in this mode.

I sincerely doubt this to be a breaking change in terms of how coding is actually done. I would leave it to others to decide on what to do about literals starting with &H201C or &H201D - I'm OK with keeping current behavior (any one of &H22, &H201C, or &H201D will do).

Alternatively, we could instead introduce C#-style @"" (which would work nicely with Item 2 below).

Dim s = @"“Maybe this instead could someday be a valid string in VB”"

=== Item 2 ===

Introduce a new Option which allows for C-style escape codes in string literals. I'll leave it for others to consider on whether we need yet another literal string format - perhaps we could simply follow C# here, and allow @"" and $@"" (both redundant in today's VB). Anyway, maybe we could call it CharacterEscapes with settings of On and Off.

Option CharacterEscapes On
Dim s = "Look ma!\nSee that, new line! \u263A (smiles!)"
Dim f = @"C:\Users\MyAcct\Documents\SuperImportant.docx"
paul1956 commented 6 years ago

I am sure allowing "UncodeOpenDoubleQuote" would break something (it would in my code) but it would not be silent and I personally think it is worth it.

Sidenote I can't even get this editor to accept a Unicode quote so I type it as UncodeOpenDoubleQuote.

I prefer @"" to "Option CharacterEscapes On" unless we get local Options because sometimes I want Unicode characters and sometimes I really want "\u" and don't want to escape it. If we use @"" you could just use the C# code unmodified to parse it guaranteeing compatibility.

rskar-git commented 6 years ago

If we use @"" you could just use the C# code unmodified to parse it guaranteeing compatibility.

And also add to the "fun" of having a diametrically different meaning for @"". As in, you know, it means "allow escapes" in VB and "disallow escapes" in C#. Sorry, but I see that as inviting even more h8 on VB. Much better would be for us getting local Options, as you mentioned.

Escapes can be handy, but really they're not that essential. I think Option CharacterEscapes On will suit most well enough, probably mostly for folks such as yourself who need to (manually?) translate much C# (or C/C++ or Java or Python...) into VB.

BTW, what sort of code base(s) are you translating that has all these crazy unicode double quotes in them? (which in turn are driving you crazy?)

paul1956 commented 6 years ago

@rskar-git I didn't know @"" was different, I don't know C# that well. I was asking that they be the same. As for the code base I am using Roslyn as my test vehicle. I also had no idea VB supported escapes.

pricerc commented 6 years ago

Implementing escapes would have been less painful if they'd been added along with interpolation. Then $"" could be used for either, and people would have gotten used to it as part of switching to interpolation.

Since we missed that opportunity, since we already have a string suffixes for char literals ("x"c), what about just using a different one for escaped strings, something like

Dim s = "Look ma!\nSee that, new line! \u263A (smiles!)"E

or one could have different variants for ASCII and UNICODE.

paul1956 commented 6 years ago

@pricerc That works for me. Hopefully what is inside the quotes is identical to C#.

rskar-git commented 6 years ago

@paul1956

I also had no idea VB supported escapes.

Sorry about the miscommunication, but presently VB does NOT support escapes. I was simply pointing out how @"" is used in C# today (which means "do not do escapes" in C#), and that for that reason it would be a bad idea to introduce @"" into VB and have it mean something different.

C# has @"" mostly as a means to let a backslash be a backslash without the hassle of escapes; this is because @ disables escapes. The biggest pain point is a Windows OS filesystem path, which delimits with backslashes. So, without the @, you have to double-up on backslashes like so:

var filePath = "C:\\Users\\MyAcct\\Documents\\MyImportantFile.docx";

But with @, it's more like VB:

var filePath = @"C:\Users\MyAcct\Documents\MyImportantFile.docx";

So C# has these two modes as part of its syntax. But to reiterate, VB does NOT support escapes as a part of its syntax; however, via the .NET Framework, VB can decode escapes via Text.RegularExpressions.Regex.Unescape.

I don't know C# that well.

Can you tell us a little more about what it is you need to get done? Perhaps there are better ways to get it accomplished in VB in dealing with unicode etc.

rskar-git commented 6 years ago

@pricerc

Implementing escapes would have been less painful if they'd been added along with interpolation.

The limitation to that idea is that $"" cannot be used for defining constants (e.g. this would not work: Const Smiles As String = $"{\u263A}"; nor this: Const Smiles As String = $"{ChrW(&H263A)}").

for escaped strings, something like Dim s = "Look ma!\nSee that, new line! \u263A (smiles!)"E

Actually, I wonder if that concept could be made to work. There are two downsides I can see: (1) The string will need to be rescanned (to be decoded) since the E is at the end; and (2) we probably cannot get around the doubling-up the double-quotes this way, i.e. \" would be useless as an escape. Which means we haven't addressed paul1956's issue with unicode double-quotes.

I still think that Option CharacterEscapes On is the smarter way to go, and add @"" to VB syntax (and $@"") to mean the same as in C# (which means "do not do escapes").

paul1956 commented 6 years ago

I am working on a C# to VB translator that preserves comment, and formatting where possible. I started with some of the Roslyn Samples but they throw away most, if not all comments and formatting so the resulting code is really hard to read and understand and some lines are 1,000's of characters. Also many features that are easy to translate are just skipped in most translators. The best example is Checked math which is what VB does by default. At this point I can successfully translate and compile the first 2,000 C# files in the Roslyn src tree and preserve all the comments but they are not all in the correct place (or sometimes even close to where they belong). In the process I am learning to read C#. Just because something compiles doesn't mean the code executes correctly, my misunderstanding C# escaped strings is an example of that, plus I have not found a general workaround for Unchecked Math, I have special cased typical uses so it is not hopeless. What I really need is a VB comment that can be used in more places or more flexibility around comments and blank lines in argument lists. Just looking at Roslyn, if they is a way to write a C# comment and place it, somewhere in Roslyn is an example. Except for 1 very small dll for Hash and Unchecked math, everything in written in VB using Roslyn.

paul1956 commented 6 years ago

@pricerc Given VB's compatibility requirements I can't think of any workaround for Unicode Double Quotes without an Option or a version specific feature that breaks existing code. I think most users would be happy for a little pain removing the doubled Unicode quotes for the convivence of just being able to past from a Word Processor or Web Page and not have to double all the quotes.

Echo-8-ERA commented 6 years ago

@rskar-git

The limitation to that idea is that $"" cannot be used for defining constants (e.g. this would not work: > Const Smiles As String = $"{\u263A}"; nor this: Const Smiles As String = $"{ChrW(&H263A)}").

Would it be possible to just do away with that limitation? i.e. if it's possible to evaluate an interpolated string at compile time (as in all inserted values are constants), the compiler does so and treats it just like any other string literal.

paul1956 commented 6 years ago

@rskar-git VB has many artificial limitations around constants that are not obvious and maybe something to look at fixing even if only a few at a time. Chr(W) with a constant, Nothing...

pricerc commented 6 years ago

tangential to the topic. Since Chr and ChrW both return the same UNICODE Char datatype in VB.NET, does anyone know why ChrW wasn't retired or Chr and ChrW made synonyms with the advent of VB.NET?

I get that you'd want some compatibility with VB6/VBA, but I'm not sure I see the value in the distinction.

pricerc commented 6 years ago

parts of this discussion sound a bit like #184 and #27. #184 is basically where I got my idea from.

rskar-git commented 6 years ago

@paul1956

I am working on a C# to VB translator...

That's a big job, made all the more challenging by learning C# as you go - hat's off to you for taking it on!

Remarkably, each of these mess-up on the fancy unicode double quotes!:

Are you doing this for yourself or an employer? Asking because I'm curious if you've checked this out: https://www.tangiblesoftwaresolutions.com/product_details/csharp-to-vb-converter.html

rskar-git commented 6 years ago

@Echo-8-ERA

Would it be possible to just do away with that limitation?

I suppose, but as I'm not among the somebodies whose job it is to maintain compiler, IDE, and "tooling", I've got no informed opinion on whether that's a big job or not.

Actually why is this good enough:

Const Smiles As String = $"{ChrW(&H263A)}"

when instead adding an Option to allow for escapes lets us do:

Const Smiles As String = "\u263A"

The other idea of:

Const Smiles As String = $"{\u263A}"

doesn't strike me as doable at all. Right now any valid expression is what's expected between the curly braces. Adding in the complexity of detecting escapes sounds like a messy and painful job.

rskar-git commented 6 years ago

@pricerc

Since Chr and ChrW both return the same UNICODE Char datatype in VB.NET, does anyone know why ChrW wasn't retired or Chr and ChrW made synonyms with the advent of VB.NET?

Chr is stuck in the days of ASCII + IBM-PC character sets and "code pages" (see https://en.wikipedia.org/wiki/Code_page). Contrast this with ChrW which is a straight conversion to Unicode. To see this in action, try:

System.Threading.Thread.CurrentThread.CurrentCulture = System.Globalization.CultureInfo.GetCultureInfo("ru-RU")
Dim s1 = Chr(170)
Dim s2 = ChrW(170)
Stop ' Now have a look at s1 and s2

So there you have it. Chr and ChrW for sake of backwards compatibility.

paul1956 commented 6 years ago

@rskar-git Doing it for myself, I love the tangible software solutions converter and have provided feedback to them to improve it but it doesn't do many of the things I need and some of the code it produces converts but doesn't compile. What I have is already well beyond anything available, If I understood GitHub better I would be happy to share source but my past experience trying to fix issues in Open Source projects without being an insider has proven very frustrating. I have a UI that translates folders recursively and compiles the result, I am working on a comment comparer to make sure I don't drop anything but in some cases the Roslyn Syntax Walker is the cause of the issue especially around document comments where Roslyn, VB, C# and Visual Studio all allow malformed comments but you can't create them with a SyntaxFactory and if you look at them they look perfectly correct.

Echo-8-ERA commented 6 years ago

@rskar-git

Actually why is this good enough: Const Smiles As String = $"{ChrW(&H263A)}" when instead adding an Option to allow for escapes lets us do: Const Smiles As String = "\u263A"

The former supports a larger set of uses than the latter for one. While the latter is strictly for escaping characters, the former would also enable stuff like:

Private Const IdentifierStartPattern = "(\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl})"
Private Const IdentifierEndPattern = "(\p{Mn}|\p{Mc}|\p{Nd}|\p{Pc}|\p{Cf})"
Private Const IdentifierPattern = $"{IdentifierStartPattern}({IdentifierStartPattern}|{IdentifierEndPattern})*"
Private Const QualifiedIdentifierPattern = $"({IdentifierPattern}\.)*{IdentifierPattern}"
bandleader commented 6 years ago

I am liking the idea of doing this in combination with the new $"..." syntax, since users already know that {...} expressions are being parsed differently and interpolated -- so lets use this for escaping Unicode chars too! We could do something like the following:

Dim message = $"Hi {\uH263A} Have a great day!"

...which is readily understandable to existing users, requires almost no new syntax, and causes no breaking changes or ambiguity since expressions in VB can't begin with a \.

Another idea, less readable but requires no new syntax at all:

Dim message = $"Hi {ChrW(&H263A)} Have a great day!"

...and just have the compiler realize that this is an escape, and inline it directly into the string, instead of an actual call to ChrW. It's almost a performance optimization rather than a new feature.

(But I prefer the first way.)

Echo-8-ERA commented 6 years ago

@bandleader Both have already been suggested.

KathleenDollard commented 6 years ago

Item 1 of the proposal is a breaking change, so we are not going to do it.

Item 2 introduces a new option, which further complicate the language.

The underlying problem of escape sequences deserves more thought, but this proposal isn't a solution we are happy with.

AdamSpeight2008 commented 6 years ago

@KathleenDollard I've been think about change how strings are represented internally in the visual basic language. By considering the inheriting from a base, that represents textual symbols in the language.

Char Literal
String Literal
Interpolation String Literals

This textual base, consist of up to three sections

abstract  Texual := Prefix? Content Postfix? ;
abstract Context := LeftQuotation Char* RightQuotation ;
abstract  Prefix := <!- to be implemented in inherited -!>;
abstract Postfix := <!- to be implemented in inherited -!>;

String_Literal := Textual with { Prefix:= Nothing, Postfix:= Nothing }
  Char_Literal := Textual with { Prefix:= Nothing, Postfix:= 'c' | 'C' }
Interpolation_String := Textual with {Prefix: '$', Postfix:= Nothing }

The semantic validity of the char literal, is lifted out of syntax analysis to semantic analysis.

This then allows us to potentially extend or add additional forms of textual representations.

jrmoreno1 commented 5 years ago

@KathleenDollard : when this was considered by the LDM was the “\u263A”e syntax considered? It’s not mentioned in the proposal, but one of the comments by @pricerc

KathleenDollard commented 5 years ago

Since this was a proposal, we restricted our decision to reject this to this approach.

I'd love to see one or more issues created from the underlying problems (unless they already exist, in which case, ping that issue with your thoughts)

jrmoreno1 commented 5 years ago

The underlying issue (difficulty working with Unicode strings) was brought up as a problem in

276. It doesn’t propose a solution, just asks if one could be created.