Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

highly illegal variable names are now accidentally legal #12172

Closed p5pRT closed 11 years ago

p5pRT commented 12 years ago

Migrated from rt.perl.org#113620 (status was 'resolved')

Searchable as RT113620$

p5pRT commented 12 years ago

From tchrist@perl.com

Call me a Luddite\, but the following program is to my mind being very naughty -- which is the very *nicest* thing I can say about it.

  use v5.16;   use utf8;

  $— = "EM DASH"; # gc=Punctuation   say $—;

  $ = "APPLE LOGO"; # Private Use Area   say $;

  $ÂŁ = "POUND STERLING"; # gc=Symbol   say $ÂŁ;

  $­ = "SOFT HYPHEN" ; # gc=Control   say $­;

  $  = "THIN SPACE"; # whitespace\, can you believe it!?!?   say $  = "THIN SPACE";

  $ďż˝ = "HYPER 0x11_1111"; # trans-Unicode   say $ďż˝;

  $ďż˝ = "SURROGATE DC00"; # this should never be possible   say $ďż˝;

  $̈̈ = "COMBINING DIARESIS";   say $̈̈ ;

  $⃠ = "COMBINING ENCLOSING CIRCLE BACKSLASH";   say $⃠ ;

  say "That’s all\, folks!";

Because it in fact compiles and runs. Messily\, yes\, but it runs. I don't understand why it even compiles.

  % ~/blead/perl -I ~/blead/lib /tmp/testu   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21.   Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed.   EM DASH   APPLE LOGO   POUND STERLING   SOFT HYPHEN   THIN SPACE   HYPER 0x11_1111   SURROGATE DC00   COMBINING DIARESIS   COMBINING ENCLOSING CIRCLE BACKSLASH   That’s all\, folks!

% blead -v says that

  This is perl 5\, version 17\, subversion 0 (v5.17.0-352-g3630f57) built for darwin-2level

I can handle punctuation. I can handle symbols.

I can even handle private use area.

I don't know what I think about hypers. Probably yes ok.

But I see no place for control characters or combining marks of any sort\, and I am really unhappy about whitespace variable names. What's next\, dollar tab? Beyond that\, I am *exceedingly* displeased with surrogates. That's just evil and wrong\, and in so many ways.

--tom

Lest there be any question\, here is a verbosely uniquoted version​:

  1\t   2\tuse v5.16;   3\tuse utf8;   4\t   5\t$\N{EM DASH} = "EM DASH"; # gc=Punctuation   6\tsay $\N{EM DASH};   7\t   8\t$\N{U+F8FF} = "APPLE LOGO"; # Private Use Area   9\tsay $\N{U+F8FF};   10\t   11\t$\N{POUND SIGN} = "POUND STERLING"; # gc=Symbol   12\tsay $\N{POUND SIGN};   13\t   14\t$\N{SOFT HYPHEN} = "SOFT HYPHEN" ; # gc=Control   15\tsay $\N{SOFT HYPHEN};   16\t   17\t$\N{THIN SPACE} = "THIN SPACE"; # whitespace\, can you believe it!?!?   18\tsay $\N{THIN SPACE} = "THIN SPACE";   19\t   20\t$\N{U+111111} = "HYPER 0x11_1111"; # trans-Unicode   21\tsay $\N{U+111111};   22\t   23\t$\N{U+DC00} = "SURROGATE DC00"; # this should never be possible   24\tsay $\N{U+DC00};   25\t   26\t$\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} = "COMBINING DIARESIS";   27\tsay $\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} ;   28\t   29\t$\N{COMBINING ENCLOSING CIRCLE BACKSLASH} = "COMBINING ENCLOSING CIRCLE BACKSLASH";   30\tsay $\N{COMBINING ENCLOSING CIRCLE BACKSLASH} ;   31\t   32\tsay "That\N{RIGHT SINGLE QUOTATION MARK}s all\, folks!";

p5pRT commented 12 years ago

From @ikegami

On Wed\, Jun 13\, 2012 at 1​:16 PM\, tchrist1 \perlbug\-followup@​perl\.org wrote​:

What's next\, dollar tab?

Dollar-tab does not appear to be valid *code*\, but the *variable* has existed for quote some time! It's set by -i\, and it's usually accessed as $^I.

$ perl -i'Just_Another_Perl_hacker\,' -E'say ${"\t"}' Just_Another_Perl_hacker\,

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From j.imrie1@virginmedia.com

Tom Wrote

But I see no place for control characters or combining marks of any sort\, and I am really unhappy about whitespace variable names. What's next\, dollar tab? Beyond that\, I am *exceedingly* displeased with surrogates. That's just evil and wrong\, and in so many ways.

Sorry Tom but we have had control characters since as long as I can remember\, which is admittedly only as far back as 5_005

As I understood it. $^A\, was syntactic snugger for the variable $ \ and you could in fact write a pure \ character in your source code to access this.

John

p5pRT commented 12 years ago

From tchrist@perl.com

"John Imrie via RT" \perlbug\-followup@​perl\.org wrote   on Wed\, 13 Jun 2012 13​:13​:24 PDT​:

Tom Wrote

But I see no place for control characters or combining marks of any sort\, and I am really unhappy about whitespace variable names. What's next\, dollar tab? Beyond that\, I am *exceedingly* displeased with surrogates. That's just evil and wrong\, and in so many ways.

Sorry Tom but we have had control characters since as long as I can remember\, which is admittedly only as far back as 5_005

As I understood it. $^A\, was syntactic snugger for the variable $ \ and you could in fact write a pure \ character in your source code to access this.

I'm well aware of the ASCII controls. And you still can't write dollar tab unless you write ${"\t"}.

I mean that when we expended Perl to include Unicode in our source\, I didn't think we were going to add more than the *reasonable* code points\, meaning punctuation and symbols\, and by necessity\, PUA ones.

The rest can go hang.

--tom

p5pRT commented 12 years ago

From @rjbs

* tchrist1 \perlbug\-followup@​perl\.org [2012-06-13T13​:16​:09]

$— = "EM DASH";  \# gc=Punctuation
say $—;

Sure\, okay. I'm a little surprised\, but I think I can cope.

$ = "APPLE LOGO";  \# Private Use Area
say $;

Not very happy with that. Maybe. I think that code point itself may have bene discussed in the past\, I'm going to dig up the thread.

$ÂŁ = "POUND STERLING";  \# gc=Symbol
say $ÂŁ;

Still a little surprised\, still coping.

$­ = "SOFT HYPHEN" ; \# gc=Control
say $­;

No.

$  = "THIN SPACE";   \# whitespace\, can you believe it\!?\!?
say $  = "THIN SPACE";

No.

$ďż˝ = "HYPER 0x11\_1111";  \# trans\-Unicode
say $ďż˝;

Um. Uh.

I think no. I mean... um. What? I think this can't be used for anything but evil\, right?

$ďż˝ = "SURROGATE DC00"; \# this should never be possible
say $ďż˝;

No.

$̈̈ = "COMBINING DIARESIS"; 
say $̈̈ ;

AWESOME. But no.

$⃠  = "COMBINING ENCLOSING CIRCLE BACKSLASH";
say $⃠ ;

my NonScalar $⃠ = @​array;

Ha ha ha no.

This definitely needs fixing.

What characters can be a literal punctuation var? Possibly​:

  (\pP | \pS) & \P{Space}

I think there's room to wiggle here\, but it doesn't include surrogates or leading marks.

(Is $™́̈ a permissible punctuation variable?)

-- rjbs

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 15​:54​:25 2012\, perl.p5p@​rjbs.manxome.org wrote​:

* tchrist1 \perlbug\-followup@​perl\.org [2012-06-13T13​:16​:09]

$— = "EM DASH";  \# gc=Punctuation
say $—;

Sure\, okay. I'm a little surprised\, but I think I can cope.

$ = "APPLE LOGO";  \# Private Use Area
say $;

Not very happy with that. Maybe. I think that code point itself may have bene discussed in the past\, I'm going to dig up the thread.

All I remember being brought up was lexical punctuation variables.

No. ... No. ... Um. Uh.

I think no. I mean... um. What? I think this can't be used for anything but evil\, right?

What about control characters? Perl\, as far as I know\, has always allowed $ to refer to the accumulator.

All we did was fix the parser not to lose track of whether it was looking at bytes or utf8. So it no longer lies to itself. It used to be that $£ was permitted under ‘use utf8’ but only when written in Latin-1! (I.e.\, one had to have a mangled file for perl to parse it.)

I don’t see why we need to disallow some punctuation variables and not others. (The name ‘punctuation variable’ is misleading\, since control characters are not punctuation. I’m using the term to refer to single-character non-identifier variables.) It would arbitrarily complicate things.

$ďż˝ = "SURROGATE DC00"; \# this should never be possible
say $ďż˝;

No.

I don’t see how that’s much different from $ ($^C).

$̈̈ = "COMBINING DIARESIS";
say $̈̈ ;

AWESOME. But no.

$⃠  = "COMBINING ENCLOSING CIRCLE BACKSLASH";
say $⃠ ;

my NonScalar $⃠ = @​array;

Ha ha ha no.

Now *that* no seems completely arbitrary to me.

This definitely needs fixing.

What characters can be a literal punctuation var? Possibly​:

(\pP | \pS) & \P{Space}

I think there's room to wiggle here\, but it doesn't include surrogates or leading marks.

(Is $™́̈ a permissible punctuation variable?)

That consists of more than one Perl character\, so no.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

On Mon Jun 18 15​:54​:25 2012\, perl.p5p@​rjbs.manxome.org wrote​:

* tchrist1 \perlbug\-followup@​perl\.org [2012-06-13T13​:16​:09]

$— = "EM DASH";  \# gc=Punctuation
say $—;

Sure\, okay. I'm a little surprised\, but I think I can cope.

$ = "APPLE LOGO";  \# Private Use Area
say $;

Not very happy with that. Maybe. I think that code point itself may have bene discussed in the past\, I'm going to dig up the thread.

All I remember being brought up was lexical punctuation variables.

No. ... No. ... Um. Uh.

I think no. I mean... um. What? I think this can't be used for anything but evil\, right?

What about control characters? Perl\, as far as I know\, has always allowed $ to refer to the accumulator.

All we did was fix the parser not to lose track of whether it was looking at bytes or utf8. So it no longer lies to itself. It used to be that $£ was permitted under ‘use utf8’ but only when written in Latin-1! (I.e.\, one had to have a mangled file for perl to parse it.)

I don’t see why we need to disallow some punctuation variables and not others. (The name ‘punctuation variable’ is misleading\, since control characters are not punctuation. I’m using the term to refer to single-character non-identifier variables.) It would arbitrarily complicate things.

$ďż˝ = "SURROGATE DC00"; \# this should never be possible
say $ďż˝;

No.

I don’t see how that’s much different from $ ($^C).

You don't see why U+003 is a legal Unicode code point but U+DC00 is not? That isn't UTF-8. This is wrong. You lied when you were using UTF-8. Perl should not put up with illegal UTF-8 -- period.

--tom

p5pRT commented 12 years ago

From tchrist@perl.com

As for allowing arbitrary code points\, I think it is very wrong to allow things that aren't printable characters\, or which cause things to print wrong. There is no reason to propagate the original sin.

Let it be symbols and punctuation only\, with PUAs grandfathered in because they might be such.

--tom

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 18​:01​:25 2012\, tom christiansen wrote​:

You don't see why U+003 is a legal Unicode code point but U+DC00 is not? That isn't UTF-8. This is wrong. You lied when you were using UTF-8. Perl should not put up with illegal UTF-8 -- period.

If I have it inside a string that I pass to eval\, there is no UTF-8 involved. It’s just a string of Perl characters\, which is a superset of Unicode.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

Control characters were allowed only because of the possibility fo the $^? = 1 notation.

That no longer applies\, and they should therefore be banned.

--tom

p5pRT commented 12 years ago

From tchrist@perl.com

On Mon Jun 18 18​:01​:25 2012\, tom christiansen wrote​:

You don't see why U+003 is a legal Unicode code point but U+DC00 is not? That isn't UTF-8. This is wrong. You lied when you were using UTF-8. Perl should not put up with illegal UTF-8 -- period.

If I have it inside a string that I pass to eval\, there is no UTF-8 involved. It’s just a string of Perl characters\, which is a superset of Unicode.

Irrelevant and immaterial.

If you say "use utf8"\, and Perl encounters illegal UTF-8 during compilation\, it should abort.

--tom

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 18​:08​:43 2012\, tom christiansen wrote​:

On Mon Jun 18 18​:01​:25 2012\, tom christiansen wrote​:

You don't see why U+003 is a legal Unicode code point but U+DC00 is not? That isn't UTF-8. This is wrong. You lied when you were using UTF-8. Perl should not put up with illegal UTF-8 -- period.

If I have it inside a string that I pass to eval\, there is no UTF-8 involved. It’s just a string of Perl characters\, which is a superset of Unicode.

Irrelevant and immaterial.

If you say "use utf8"\, and Perl encounters illegal UTF-8 during compilation\, it should abort.

I don’t have to say ‘use utf8’ to pass a character string to eval.

I repeat​: There doesn’t have to be any UTF-8 involved at all for me to pass a string of wide characters to eval.

Simply saying it is irrelevant doesn’t make it so.

Please consider this from the standpoint of parsing strings of Perl characters. I’m not saying you don’t have a point\, but it has nothing to do with UTF-8. (Also\, don’t forget that utf8 [without a hyphen] is different from UTF-8 [with a hyphen].)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 18​:08​:08 2012\, tom christiansen wrote​:

Control characters were allowed only because of the possibility fo the $^? = 1 notation.

That no longer applies\, and they should therefore be banned.

Er\, what no longer applies? I don’t understand what you are saying.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

"Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote   on Mon\, 18 Jun 2012 18​:20​:21 PDT​:

On Mon Jun 18 18​:08​:08 2012\, tom christiansen wrote​:

Control characters were allowed only because of the possibility fo the $^? = 1 notation.

That no longer applies\, and they should therefore be banned.

Er\, what no longer applies? I don’t understand what you are saying.

There is no way to specify the Unicode control characters using the caret notation with a non-control character in Perl.

For example\, an LRE\, meaning U+202A\, LEFT-TO-RIGHT EMBEDDING\, would need to be ^ in front of U+206A\, because 0x202A ^ ord("@​") is 0x206A. But U+206A is still a non-printing \p{Cf} type character. You can't xor your way out of it.

Why in the world are you trying to argue that allowing nonprinting variable names is a *good* thing?

--tom

p5pRT commented 12 years ago

From tchrist@perl.com

I don’t have to say ‘use utf8’ to pass a character string to eval.

I repeat​: There doesn’t have to be any UTF-8 involved at all for me to pass a string of wide characters to eval.

Simply saying it is irrelevant doesn’t make it so.

Please consider this from the standpoint of parsing strings of Perl characters. I’m not saying you don’t have a point\, but it has nothing to do with UTF-8. (Also\, don’t forget that utf8 [without a hyphen] is different from UTF-8 [with a hyphen].)

I am perfectly aware of the distinction. It is mere castuistry\, and it is a security flaw. If I say "use utf8"\, then it is a security flaw to allow illegal UTF-8. Think non-shortest-forms\, for example.

There are very good reasons that UTF-8 forbids all that crap. And there is no reason we should tolerate it. Do I really have to develop and publish a security exploit before this is fixed forever?

--tom

p5pRT commented 12 years ago

From @rjbs

* Father Chrysostomos via RT \perlbug\-followup@​perl\.org [2012-06-18T20​:57​:20]

On Mon Jun 18 15​:54​:25 2012\, perl.p5p@​rjbs.manxome.org wrote​:

No. ... No. ... Um. Uh.

I think no. I mean... um. What? I think this can't be used for anything but evil\, right?

What about control characters? Perl\, as far as I know\, has always allowed $ to refer to the accumulator.

Variables like that are\, in my opinion\, an historical curiosity. I don't think they form a good example to draw on\, because their visual samey-samey-ness doesn't carry into the future. There's no $^x for an x that gets us sane "funny punctuation vars."

All we did was fix the parser not to lose track of whether it was looking at bytes or utf8. So it no longer lies to itself. It used to be that $£ was permitted under ‘use utf8’ but only when written in Latin-1! (I.e.\, one had to have a mangled file for perl to parse it.)

I am very glad to have it not get tripped up!

Now that it can properly identify the characters in the source\, we should also make it do something sane with them. I'm saying that in many of the cases that Tom brought up\, the sane thing is to say "no."

I don’t see why we need to disallow some punctuation variables and not others. (The name ‘punctuation variable’ is misleading\, since control characters are not punctuation. I’m using the term to refer to single-character non-identifier variables.) It would arbitrarily complicate things.

There are a few kinds of cases to consider.

The first one is one that I think got omitted​: ${"\N{U+2009}"}

Do we want space characters to be legal punctuation variables? dollar thin space\, dollar hair space\, dollar en space\, dollar idiographic space. For a flare\, dollar mongolian vowel separator.

I'm not saying you can't make this variable if you want to go putz around in the stash\, but should it be directly written in a Perl program's source? No.

If we exclude Space\, we probably need to exclude Default_Ignorable_Code_Point. Even for non-identifier-type variables. What happens\, otherwise\, when someone's source is dollar-right-to-left-embedding?

$ďż˝ = "SURROGATE DC00"; \# this should never be possible
say $ďż˝;

No.

I don’t see how that’s much different from $ ($^C).

It is different because \cC is a valid character which can appear in a source file. A surrogate is not. ${surrogate} is never going to appear in a legal source file. Invalid UTF-8 in source files should be fatal.

Since it can never appear in a source file\, I am loathe to allow it in source text. We end up with "some Perl 5 source documents can be valid strings\, but can't be represented as interpretable octets."

Alternately\, if punctuation variables are "single-character non-identifier variables\," we're *really* pushing the limit with surrogates\, which "must not" be interpreted as abstract characters.

$̈̈ = "COMBINING DIARESIS";
say $̈̈ ;

AWESOME. But no.

$⃠  = "COMBINING ENCLOSING CIRCLE BACKSLASH";
say $⃠ ;

my NonScalar $⃠ = @​array;

Ha ha ha no.

Now *that* no seems completely arbitrary to me.

Both of those noes? Do we want to allow variable names that form combining glyphs with their sigils? It's a readability nightmare.

Maybe the answer is "so don't do that\," but if we're going to rule out characters based on their properties\, I think being gc=Mark is a reasonable exclusion. Or\, as I suggested\, [\pP\pS] is a reasonable start for inclusions.

If we *aren't* going to rule out characters based on properties\, what are we left with? Something like "any non-identifier character that did not previously have meaning after a symbol is now a punctuation variable if it could form a one-character token."

Non-ASCII identifiers made of "identifier characters" already make the source difficult to read reliably (because of $͜ ne $И and other visually similar glyphs)\, but allowing just about any codepoint\, no matter how outrÊ\, to be used as a single-character identifier seems not only infelicitous\, but dangerous.

(Is $™́̈ a permissible punctuation variable?)

That consists of more than one Perl character\, so no.

:-)

-- rjbs

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 18​:29​:01 2012\, tom christiansen wrote​:

Why in the world are you trying to argue that allowing nonprinting variable names is a *good* thing?

(This answers Ricardo Signes at the same time​:)

Internal consistency.

$ perl5.8.1 -le 'eval qq|print q\x{d800}hello\x{d800}|' hello $ perl5.16.0 -le 'eval qq|print q\x{d800}hello\x{d800}|' hello

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 20​:11​:34 2012\, perl.p5p@​rjbs.manxome.org wrote​:

Alternately\, if punctuation variables are "single-character non- identifier variables\," we're *really* pushing the limit with surrogates\, which "must not" be interpreted as abstract characters.

Abstract Unicode characters. Perl characters are another thing.

Arbitrary delimiters\, especially things like surrogates\, are extremely useful in generated code. I just want things to be somewhat consistent\, which they currently are (see my previous message).

Ha ha ha no.

Now *that* no seems completely arbitrary to me.

Both of those noes? Do we want to allow variable names that form combining glyphs with their sigils? It's a readability nightmare.

Just the second one. Firefox misrendered it\, and I did not realise it was a combining character until I copied and pasted it into another program.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @doy

On Mon\, Jun 18\, 2012 at 08​:25​:21PM -0700\, Father Chrysostomos via RT wrote​:

On Mon Jun 18 18​:29​:01 2012\, tom christiansen wrote​:

Why in the world are you trying to argue that allowing nonprinting variable names is a *good* thing?

(This answers Ricardo Signes at the same time​:)

Internal consistency.

$ perl5.8.1 -le 'eval qq|print q\x{d800}hello\x{d800}|' hello $ perl5.16.0 -le 'eval qq|print q\x{d800}hello\x{d800}|' hello

This should *also* die\, it shouldn't be an argument for why it should be valid as an identifier.

-doy

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 18 18​:32​:36 2012\, tom christiansen wrote​:

Please consider this from the standpoint of parsing strings of Perl characters. I’m not saying you don’t have a point\, but it has nothing to do with UTF-8. (Also\, don’t forget that utf8 [without a hyphen] is different from UTF-8 [with a hyphen].)

I am perfectly aware of the distinction. It is mere castuistry\, and it is a security flaw. If I say "use utf8"\, then it is a security flaw to allow illegal UTF-8. Think non-shortest-forms\, for example.

Are those a security issue because something scanning for certain characters in UTF-8 can miss them?

I was not suggesting allowing those. I was speaking specifically about logical Perl characters\, disregarding how they might be encoded in this or that encoding.

There are very good reasons that UTF-8 forbids all that crap.

Mostly because UTF-8 limits itself to Unicode scalar values.

And there is no reason we should tolerate it. Do I really have to develop and publish a security exploit before this is fixed forever?

Yes. :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @rjbs

* Ricardo Signes \perl\.p5p@​rjbs\.manxome\.org [2012-06-18T18​:53​:45]

What characters can be a literal punctuation var? Possibly​:

(\pP | \pS) & \P{Space}

I think there's room to wiggle here\, but it doesn't include surrogates or leading marks.

Letting people program in their native languages is cool​:

  my $größe = 'groß';

...but we didn't translate "my." We don't translate the rest of Perl. Really\, we usually just mean we're going to allow non-ASCII identifiers.

All the punctuation variables are reserved for Perl. If we're not going to use any more\, we don't need to address the question of what punctuation variables are allowed\, and can avoid the entire question of $≠.

Is there some reason to plan not only for XIDS-friendly non-ASCII identifiers\, but also non-ASCII non-XIDS puntuation identifiers?

-- rjbs

p5pRT commented 12 years ago

From @cpansprout

On Sun Jun 24 05​:11​:38 2012\, perl.p5p@​rjbs.manxome.org wrote​:

Is there some reason to plan not only for XIDS-friendly non-ASCII identifiers\, but also non-ASCII non-XIDS puntuation identifiers?

I don’t think so. And I was actually planning to bring this up myself. :-) What really bothered me was the thought of making an arbitrary choice about which >255 characters can and can’t be punct vars\, that brought little benefit. If we are going to restrict it\, restrict it to 0-255.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

"Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote   on Sun\, 24 Jun 2012 06​:16​:24 PDT​:

On Sun Jun 24 05​:11​:38 2012\, perl.p5p@​rjbs.manxome.org wrote​:

Is there some reason to plan not only for XIDS-friendly non-ASCII identifiers\, but also non-ASCII non-XIDS puntuation identifiers?

I don’t think so. And I was actually planning to bring this up myself. ​:-) What really bothered me was the thought of making an arbitrary choice about which >255 characters can and can’t be punct vars\, that brought little benefit. If we are going to restrict it\, restrict it to 0-255.

You mean 0-127.

--tom

p5pRT commented 12 years ago

From @cpansprout

On Sun Jun 24 06​:18​:23 2012\, tom christiansen wrote​:

"Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote on Sun\, 24 Jun 2012 06​:16​:24 PDT​:

On Sun Jun 24 05​:11​:38 2012\, perl.p5p@​rjbs.manxome.org wrote​:

Is there some reason to plan not only for XIDS-friendly non-ASCII identifiers\, but also non-ASCII non-XIDS puntuation identifiers?

I don’t think so. And I was actually planning to bring this up myself. ​:-) What really bothered me was the thought of making an arbitrary choice about which >255 characters can and can’t be punct vars\, that brought little benefit. If we are going to restrict it\, restrict it to 0-255.

You mean 0-127.

No\, really. :-)

Since $£ has always been allowed in Latin-1 scripts\, why should ‘use utf8’ restrict the character set? I just have a hard time getting my head around the idea that ‘use utf8’ extends the range of id chars\, but restricts the range of punct vars.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

I don’t think so. And I was actually planning to bring this up myself. :-) What really bothered me was the thought of making an arbitrary choice about which >255 characters can and can’t be punct vars\, that brought little benefit. If we are going to restrict it\, restrict it to 0-255.

You mean 0-127.

No\, really. :-)

Since $£ has always been allowed in Latin-1 scripts\, why should ‘use utf8’ restrict the character set? I just have a hard time getting my head around the idea that ‘use utf8’ extends the range of id chars\, but restricts the range of punct vars.

I agree that I don't see why Latin1 should be privileged.

--tom

p5pRT commented 12 years ago

From @doy

On Sun\, Jun 24\, 2012 at 07​:25​:49AM -0600\, Tom Christiansen wrote​:

I don’t think so. And I was actually planning to bring this up myself. :-) What really bothered me was the thought of making an arbitrary choice about which >255 characters can and can’t be punct vars\, that brought little benefit. If we are going to restrict it\, restrict it to 0-255.

You mean 0-127.

No\, really. :-)

Since $£ has always been allowed in Latin-1 scripts\, why should ‘use utf8’ restrict the character set? I just have a hard time getting my head around the idea that ‘use utf8’ extends the range of id chars\, but restricts the range of punct vars.

I agree that I don't see why Latin1 should be privileged.

Sure\, but I'm also not very comfortable with the logic behind "well\, we allowed ÂŁ to be used in the past when we didn't really have any consistent idea of what we were doing\, therefore we now have to allow all Unicode punctuation". If disallowing $ÂŁ at this point would be too big of a backwards-compatibility break (which i'm not especially convinced that it would be)\, I don't think that there would be too much wrong with just saying "we allow Latin1 punctuation characters as a historical curiosity\, we discourage their use\, and we will not be extending this usage past the Latin1 character set".

Honestly\, naming actual variables in actual code with punctuation just seems like a really questionable thing to want to do\, and so I don't really see why we need to be encouraging it.

-doy

p5pRT commented 12 years ago

From @davidnicol

On Sun\, Jun 24\, 2012 at 7​:10 AM\, Ricardo Signes \perl\.p5p@​rjbs\.manxome\.org wrote​:

Letting people program in their native languages is cool​:

 my $größe = 'groß';

...but we didn't translate "my."  We don't translate the rest of Perl.  Really\, we usually just mean we're going to allow non-ASCII identifiers.

But we *could.* Allowing a formal localization overlay won't be that tough to do\, and feels like the kind of thing that if called for could generate some grant applications from the people able to do it.

Such a system would keep all the English names (for portability) and also allow local aliases. It would be a configure-time option\, causing something like -DPERL_L11N_OVERLAY=klingon to get appended everywhere during the compile\, and including something##PERL_L11N_OVERLAY##.h would define a mess of other macros that the lexer would use to extend the tokenization process to recognize alternatives.

Put that in your pipe and smoke it\, ECMA!

-- Run it up the flagpole and see who salutes it

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 25 16​:36​:12 2012\, davidnicol@​gmail.com wrote​:

On Sun\, Jun 24\, 2012 at 7​:10 AM\, Ricardo Signes \perl\.p5p@​rjbs\.manxome\.org wrote​:

Letting people program in their native languages is cool​:

�my $gr��e = 'gro�';

...but we didn't translate "my." �We don't translate the rest of Perl. �Really\, we usually just mean we're going to allow non-ASCII identifiers.

But we *could.* Allowing a formal localization overlay won't be that tough to do\, and feels like the kind of thing that if called for could generate some grant applications from the people able to do it.

Such a system would keep all the English names (for portability) and also allow local aliases. It would be a configure-time option\, causing something like -DPERL_L11N_OVERLAY=klingon to get appended everywhere during the compile\, and including something##PERL_L11N_OVERLAY##.h would define a mess of other macros that the lexer would use to extend the tokenization process to recognize alternatives.

Put that in your pipe and smoke it\, ECMA!

Making all keywords overridable might be the better solution. Then it can be prototyped on CPAN. :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Mon Jun 25 16​:50​:31 2012\, sprout wrote​:

Making all keywords overridable might be the better solution. Then it can be prototyped on CPAN. :-)

Er\, never mind. What you are suggesting does not involve existing keywords.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @lizmat

On Jun 26\, 2012\, at 1​:35 AM\, David Nicol wrote​:

On Sun\, Jun 24\, 2012 at 7​:10 AM\, Ricardo Signes \perl\.p5p@​rjbs\.manxome\.org wrote​:

Letting people program in their native languages is cool​:

my $größe = 'groß';

...but we didn't translate "my." We don't translate the rest of Perl. Really\, we usually just mean we're going to allow non-ASCII identifiers.

But we *could.* Allowing a formal localization overlay won't be that tough to do\, and feels like the kind of thing that if called for could generate some grant applications from the people able to do it.

Such a system would keep all the English names (for portability) and also allow local aliases. It would be a configure-time option\, causing something like -DPERL_L11N_OVERLAY=klingon to get appended everywhere during the compile\, and including something##PERL_L11N_OVERLAY##.h would define a mess of other macros that the lexer would use to extend the tokenization process to recognize alternatives.

Put that in your pipe and smoke it\, ECMA!

As someone who has been exposed to similar attempts in other programming languages in the past​: please don't. It's not worth it. The confusion of having differently named keywords is just way too much to handle. Look at the effort it already now takes to maintain perl documentation in Spanish.

It always seems like a nice idea. But in the end\, it will cost just way too many tuits.

My 2c worth.

Liz

p5pRT commented 12 years ago

From @Hugmeir

On Wed\, Jun 13\, 2012 at 10​:15 AM\, Tom Christiansen \tchrist@​perl\.com wrote​:

Call me a Luddite\, but the following program is to my mind being very naughty -- which is the very *nicest* thing I can say about it.

use v5.16; use utf8;

$— = "EM DASH"; # gc=Punctuation say $—;

$ = "APPLE LOGO"; # Private Use Area say $;

$ÂŁ = "POUND STERLING"; # gc=Symbol say $ÂŁ;

$­ = "SOFT HYPHEN" ; # gc=Control say $­;

$ = "THIN SPACE"; # whitespace\, can you believe it!?!? say $ = "THIN SPACE";

$ďż˝ = "HYPER 0x11_1111"; # trans-Unicode say $ďż˝;

$ďż˝ = "SURROGATE DC00"; # this should never be possible say $ďż˝;

$̈̈ = "COMBINING DIARESIS"; say $̈̈ ;

$⃠ = "COMBINING ENCLOSING CIRCLE BACKSLASH"; say $⃠ ;

say "That’s all\, folks!";

Because it in fact compiles and runs. Messily\, yes\, but it runs. I don't understand why it even compiles.

% ~/blead/perl -I ~/blead/lib /tmp/testu Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 20. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed at /tmp/testu line 21. Code point 0x111111 is not Unicode\, all \p{} matches fail; all \P{} matches succeed. EM DASH APPLE LOGO POUND STERLING SOFT HYPHEN THIN SPACE HYPER 0x11_1111 SURROGATE DC00 COMBINING DIARESIS COMBINING ENCLOSING CIRCLE BACKSLASH That’s all\, folks!

% blead -v says that

This is perl 5\, version 17\, subversion 0 (v5.17.0-352-g3630f57) built for darwin-2level

I can handle punctuation. I can handle symbols.

I can even handle private use area.

I don't know what I think about hypers. Probably yes ok.

But I see no place for control characters or combining marks of any sort\, and I am really unhappy about whitespace variable names. What's next\, dollar tab? Beyond that\, I am *exceedingly* displeased with surrogates. That's just evil and wrong\, and in so many ways.

--tom

Lest there be any question\, here is a verbosely uniquoted version​:

1\\t
2\\tuse v5\.16;
3\\tuse utf8;
4\\t
5\\t$\\N\{EM DASH\} = "EM DASH";  \# gc=Punctuation
6\\tsay $\\N\{EM DASH\};
7\\t
8\\t$\\N\{U\+F8FF\} = "APPLE LOGO";  \# Private Use Area
9\\tsay $\\N\{U\+F8FF\};

10\t 11\t$\N{POUND SIGN} = "POUND STERLING"; # gc=Symbol 12\tsay $\N{POUND SIGN}; 13\t 14\t$\N{SOFT HYPHEN} = "SOFT HYPHEN" ; # gc=Control 15\tsay $\N{SOFT HYPHEN}; 16\t 17\t$\N{THIN SPACE} = "THIN SPACE"; # whitespace\, can you believe it!?!? 18\tsay $\N{THIN SPACE} = "THIN SPACE"; 19\t 20\t$\N{U+111111} = "HYPER 0x11_1111"; # trans-Unicode 21\tsay $\N{U+111111}; 22\t 23\t$\N{U+DC00} = "SURROGATE DC00"; # this should never be possible 24\tsay $\N{U+DC00}; 25\t 26\t$\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} = "COMBINING DIARESIS"; 27\tsay $\N{COMBINING DIAERESIS}\N{COMBINING DIAERESIS} ; 28\t 29\t$\N{COMBINING ENCLOSING CIRCLE BACKSLASH} = "COMBINING ENCLOSING CIRCLE BACKSLASH"; 30\tsay $\N{COMBINING ENCLOSING CIRCLE BACKSLASH} ; 31\t 32\tsay "That\N{RIGHT SINGLE QUOTATION MARK}s all\, folks!";

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro. And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range. pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should. The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

Internally\, three things might be sorta icky and really need someone to look them over; First\, I changed the definition of isIDFIRST_lazy_if and isALNUM_lazy_if to use isIDFIRST_L1(*s) and (isALNUMC_L1(*s) || *s == '_')\, respectively\, if we aren't under UTF mode. Second\, to fix the "ascii letters being consumed too early" bug above\, I had to turn around how scan_ident and scan_word work\, by putting the UTF case first. This probably leads to some slowdowns. Third\, I've changed several spots from using isALNUM_lazy_if to isIDFIRST_lazy_if -- This made sense to me at the time\, but an extra pair of eyes would be welcome.

p5pRT commented 12 years ago

From @Hugmeir

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

...and I've just pushed a new version\, which I think deals with the last of the outstanding bugs in identifier parsing. Would be terrific if someone with more C experience than me (or\, let me rephrase\, someone with C experience) could take a look at toke.c's S_parse_ident()\, which the last commit introduces; It's basically the pseudo-repeated code that was in scan_ident\, but with 400% more me fumbling with C.

--hugmeir

p5pRT commented 12 years ago

From @cpansprout

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

Internally\, three things might be sorta icky and really need someone to look them over; First\, I changed the definition of isIDFIRST_lazy_if and isALNUM_lazy_if to use isIDFIRST_L1(*s) and (isALNUMC_L1(*s) || *s == '_')\, respectively\, if we aren't under UTF mode. Second\, to fix the "ascii letters being consumed too early" bug above\, I had to turn around how scan_ident and scan_word work\, by putting the UTF case first. This probably leads to some slowdowns. Third\, I've changed several spots from using isALNUM_lazy_if to isIDFIRST_lazy_if -- This made sense to me at the time\, but an extra pair of eyes would be welcome.

Well\, I haven’t read your patch because I disagree with it in principle.

--

Father Chrysostomos

p5pRT commented 12 years ago

From tchrist@perl.com

I really need to read through Brian's patch. My apologies that I have not yet done so.

--tom

p5pRT commented 12 years ago

From @doy

On Sun\, Jul 01\, 2012 at 02​:03​:30PM -0700\, Father Chrysostomos via RT wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I think this is probably a reasonable thing to do\, in the name of backwards compatibility. It doesn't add a huge amount of complexity\, and can be explained pretty easily by "historical reasons".

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

I don't understand how this is related - can you give an example of a specific issue you're concerned about?

pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

This is clearly a bugfix.

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

We have changed the rules for what is an identifier several times in the past couple releases. I don't see why this case in particular is different.

-doy

p5pRT commented 12 years ago

From @cpansprout

On Sun Jul 01 14​:23​:36 2012\, doy@​tozt.net wrote​:

On Sun\, Jul 01\, 2012 at 02​:03​:30PM -0700\, Father Chrysostomos via RT wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I think this is probably a reasonable thing to do\, in the name of backwards compatibility. It doesn't add a huge amount of complexity\, and can be explained pretty easily by "historical reasons".

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

I don't understand how this is related - can you give an example of a specific issue you're concerned about?

qÿfooÿ. Of course\, no one would write it like that. But if the script is in another ASCII-based encoding (without perl’s knowledge; perl doesn’t always have to know) it might be a punctuation character.

This is another issue I’m concerned about​:

As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

This is clearly a bugfix.

I’m not sure it’s so clear. Old scripts written before utf8 support that do not themselves deal with text should continue to work seamlessly. Syntax changes like this should be kept inside a pragma.

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

We have changed the rules for what is an identifier several times in the past couple releases.

Once\, as far as I can remember\, and I complained at the time.

I don't see why this case in particular is different.

At the time\, no tests broke. This one breaks something explicitly tested for.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @Hugmeir

On Sun\, Jul 1\, 2012 at 2​:03 PM\, Father Chrysostomos via RT \perlbug\-followup@​perl\.org wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I did. I apologize\, I didn't mean to ignore you\, but we've gone through at least four iterations of this same thread over the last year\, and we always get stuck deciding on a sane set of rules for identifiers; then the thread dies and a stable release goes out\, with the worst possible scenario implemented. So I just wanted to get a patch out; ironing out the rules can happen on top of an implementation\, rather than in the vacuum.

You are saying that we should allow 0-255 for punctuation/length-one variables\, that I understand\, but what about your normal\, run of the mill identifiers under 'no utf8'? Should $asdÂŁasd be a legal variable\, or _Ă _ a legal bareword?

The rules in the branch are\, more or less​:

qr/ (?(DEFINE)   (?\ (?&sigil)   (?​:   (?&specials)   | (?&ident)   | (?&bracket_ident)   )   )   (?\ (?​: :​: )* (?&normal_ident) (?​: (?​: :​: | ' ) (?&ident) )* (?​: :​: )* )   (?\<bracket_ident> \{ \s* (?&ident) \s* \}   )   (?\<normal_ident> (?​: (?&Perl_XIDS) \p{XIDC}* )+ )   (?\ (?​: (?&length_one) | (?&other_specials) ) (?!​::) )   (?\<length_one> (?&Perl_XIDS) | [\p{POSIX_Digit}\p{POSIX_Cntrl}] )   (?\<other_specials> \^[A-Z] | \{ \^ [A-Z][A-Z_]+ \} | [0-9]+ )   (?\<Perl_XIDS> (?=\p{Word}) [\p{XIDS}_] )   (?\ [&%*\@​\$] ) ) /x;

I tried to write a similar regex for 5.16\, but there's honestly so many edge cases that I've never mustered the motivation to finish it.

Basing myself on that\, here's the modifications that I think you're going for​:

  (?\<normal_ident> (?(?{UTF})   (?​:(?&Perl_XIDS) \p{XIDC}*)+   | (?aa) [\w\x80-\xff]+   )   )   (?\<length_one> (?&Perl_XIDS) | [\p{POSIX_Digit}\p{POSIX_Cntrl}] | [\x80-\xff] )

Although I'm a bit confused on whenever you want the change to the normal_ident case. I suppose not\, since that's not backwards compatible. But then I have to ask​: What SHOULD be the rules for "normal" identifiers under no utf8? DO those apply the same to variables and barewords?

Uhm\, in fact\, if you have the time/motivation\, if you could modify (or rewrite from scratch\, whatever works for you) the regex above to show what you mean\, that would be extremely helpful to me.

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

pyoq?

This might be less of a problem that it may appear to be​: it only affects latin1 characters that are also XIDS characters; There's 65 of those\, yes\, but they all also match \w (well\, you know\, sometimes : D)\, which is how we've always offhandedly defined identifiers! So those delimiters were buggy by our own definition (again\, sometimes). And on a personal note\, they do strike me as unusual choices for delimiters​: qqÇthings like thisÇ. I'll grant you that the prospect of rewriting scripts -- and I have several that use qqĂ Ă \, because they could -- isn't exactly fun\, but the cost to make them work in every version of perl\, present and future\, is just a space\, which seems entirely reasonable to me.

pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

Well\, why? The current situation is that $y dies under strict\, and so does $Ăż.. if you're under use utf8. Oh\, except that in 5.16\, because of totally unrelated changes\, one XIDS character in the latin1 range started dying too.

For future reference\, the isIDFIRST_lazy() in gv_fetchpvn_flags is what caused this. The branch fixed it by turning that into isIDFIRST_lazy_if()\, and then by redefining isIDFIRST_lazy_if to use isIDFIRST_L1 if not under use utf8. So to get the behavior that Father C wants\, that's the conditional that needs changing.

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

But the patch doesn't change any rules\, only enforces what we've been saying for years were the rules. perlvar and perldata both say alphanumerics & letters. That is exactly what is being enforced by the patch.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

Internally\, three things might be sorta icky and really need someone to look them over; First\, I changed the definition of isIDFIRST_lazy_if and isALNUM_lazy_if to use isIDFIRST_L1(*s) and (isALNUMC_L1(*s) || *s == '_')\, respectively\, if we aren't under UTF mode. Second\, to fix the "ascii letters being consumed too early" bug above\, I had to turn around how scan_ident and scan_word work\, by putting the UTF case first. This probably leads to some slowdowns. Third\, I've changed several spots from using isALNUM_lazy_if to isIDFIRST_lazy_if -- This made sense to me at the time\, but an extra pair of eyes would be welcome.

Well\, I haven’t read your patch because I disagree with it in principle.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @doy

On Sun\, Jul 01\, 2012 at 08​:01​:57PM -0300\, Brian Fraser wrote​:

pyoq?

"pick your own quote"

-doy

p5pRT commented 12 years ago

From tchrist@perl.com

"Brian Fraser via RT" \perlbug\-followup@&#8203;perl\.org wrote   on Sun\, 01 Jul 2012 16​:02​:46 PDT​:

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

This\, I think\, is the take-home message from all this. It is ridiculously difficult to describe the spec for a Perl variable by name. It shouldn't be.

--tom

p5pRT commented 12 years ago

From @jplinderman

---------- Forwarded message ---------- From​: Tom Christiansen \tchrist@&#8203;perl\.com To​: perlbug-followup@​perl.org Cc​: Date​: Mon\, 02 Jul 2012 08​:11​:42 -0600 Subject​: Re​: [perl #113620] highly illegal variable names are now accidentally legal "Brian Fraser via RT" \perlbug\-followup@&#8203;perl\.org wrote   on Sun\, 01 Jul 2012 16​:02​:46 PDT​:

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

This\, I think\, is the take-home message from all this. It is ridiculously difficult to describe the spec for a Perl variable by name. It shouldn't be.

--tom

Probably a sign of the end-times when Tom and I agree to agree :-)\, but the same is also true of Perl numbers​:

http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2005-01/msg00434.html

Chip's follow-up

http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2005-01/msg00468.html

is one of my all-time favorites. But it's nothing to be terribly proud of. -- jpl

p5pRT commented 12 years ago

From @khwilliamson

On 07/02/2012 08​:11 AM\, Tom Christiansen wrote​:

"Brian Fraser via RT" \perlbug\-followup@&#8203;perl\.org wrote on Sun\, 01 Jul 2012 16​:02​:46 PDT​:

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

This\, I think\, is the take-home message from all this. It is ridiculously difficult to describe the spec for a Perl variable by name. It shouldn't be.

--tom

It seems to me\, therefore\, that "The Way Forward"(Tm) is to come to some consensus about what should be the semantics\, and then implement that\, modified by any appropriate deprecations.

I'd like to hear some concrete proposals from the simplify folks.

p5pRT commented 12 years ago

From @doy

On Mon\, Jul 02\, 2012 at 08​:56​:59PM -0600\, Karl Williamson wrote​:

On 07/02/2012 08​:11 AM\, Tom Christiansen wrote​:

"Brian Fraser via RT" \perlbug\-followup@&#8203;perl\.org wrote on Sun\, 01 Jul 2012 16​:02​:46 PDT​:

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

This\, I think\, is the take-home message from all this. It is ridiculously difficult to describe the spec for a Perl variable by name. It shouldn't be.

--tom

It seems to me\, therefore\, that "The Way Forward"(Tm) is to come to some consensus about what should be the semantics\, and then implement that\, modified by any appropriate deprecations.

I'd like to hear some concrete proposals from the simplify folks.

I think that Brian's proposal (possibly modified by the Latin1 range changes for backcompat) is quite concrete (there's code for it!) and quite simple.

-doy

p5pRT commented 12 years ago

From @nwc10

On Sun\, Jul 01\, 2012 at 08​:01​:57PM -0300\, Brian Fraser wrote​:

On Sun\, Jul 1\, 2012 at 2​:03 PM\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I did. I apologize\, I didn't mean to ignore you\, but we've gone through at least four iterations of this same thread over the last year\, and we always get stuck deciding on a sane set of rules for identifiers; then the thread dies and a stable release goes out\, with the worst possible scenario implemented. So I just wanted to get a patch out; ironing out the rules can happen on top of an implementation\, rather than in the vacuum.

You are saying that we should allow 0-255 for punctuation/length-one variables\, that I understand\, but what about your normal\, run of the mill identifiers under 'no utf8'? Should $asdÂŁasd be a legal variable\, or _Ă _ a legal bareword?

As far as I can work out from some superficial experimentation\, 5.005_03 and 5.6.1 (not under utf8) both *only* allow octets in the range 128-255 to be used in source code as punctuation character variables (Effectively\, punctuation variables)\, in comments\, in string constants\, and as the single character delimiter for quoting operators.

5.6.0 expanded the "rules" on punctuation variables to permit multi-character punctuation variables. From experimentation\, it seems that

Seems to be the same rules for (some subset of) octets 1-31 and 127. 0 doesn't seem to work reliably.

Everywhere else they are not legal. Which means no barewords\, and not in any part of multi-character variables. Effectively\, (some subset of) code points 1-31 and 127-255 (outside of utf8) are all treated consistently\, as non-printing control characters.

And\, unless and until deprecated\, I think this should be how behaviour stays when outside use utf8;

I also think that it would be sane to deprecate the use of (at least) literal code points 128-255 in the source for punctuation variables and delimiters (outside of use utf8) in the scope of use 5.018; and later.

Although I'm a bit confused on whenever you want the change to the normal_ident case. I suppose not\, since that's not backwards compatible. But then I have to ask​: What SHOULD be the rules for "normal" identifiers under no utf8? DO those apply the same to variables and barewords?

I think that the rules (outside of use utf8) historically always were that the octets were all treated consistently as if they were an 8 bit clean superset of ASCII\, as for the "POSIX" (or "C") locale\, except that all characters in the range 128-255 were treated as controls\, rather than unknowns.

So everything in the range 128-255 was treated equivalently\, whatever interpretation ISO-8859-1 gives to it\, and equivalent to character 127 (and the subset of 0-31).

But the patch doesn't change any rules\, only enforces what we've been saying for years were the rules. perlvar and perldata both say alphanumerics & letters. That is exactly what is being enforced by the patch.

Right\, but does the documentation *also* say that your source code is treated as ISO-8859-1 if 'use utf8' isn't mentioned?

Because historically it *wasn't*. It was treated as ASCII + 128 more controls.

Someone may mention (and someone has mentioned) that Perl isn't 'doing it right'. But Perl has been 'doing it' since before the current definition of 'right' existed.

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

Yes\, which dates back to a mistaken assumption when Unicode support was first added - a confusion between ASCII and ISO-8859-1\, and that the perl interpreter was treating input as ISO-8859-1.

Which we are still suffering from.

Nicholas Clark

p5pRT commented 12 years ago

From @rjbs

* Brian Fraser \fraserbn@&#8203;gmail\.com [2012-07-01T19​:01​:57]

The rules in the branch are\, more or less​:

First off\, thanks very much for doing work on this. I'm interested in seeing it through.

\(?\<normal\_ident>    \(?&#8203;: \(?&Perl\_XIDS\) \\p\{XIDC\}\* \)\+                   \)
\(?\<specials>        \(?&#8203;: \(?&length\_one\) | \(?&other\_specials\) \) \(?\!&#8203;::\)  \)
\(?\<length\_one>      \(?&Perl\_XIDS\) | \[\\p\{POSIX\_Digit\}\\p\{POSIX\_Cntrl\}\] \)
\(?\<other\_specials>  \\^\[A\-Z\] | \\\{ \\^ \[A\-Z\]\[A\-Z\_\]\+ \\\} | \[0\-9\]\+         \)
\(?\<Perl\_XIDS>       \(?=\\p\{Word\}\) \[\\p\{XIDS\}\_\]                            \)
\(?\<sigil>           \[&%\*\\@&#8203;\\$\]                                        \)

I'm just a bit confused by the definition of Perl_XIDS\, here.

Surely\, Perl adds _ to its XIDS. So\, have I forgotten how regex work? It's a definite possibility. I read your Perl_XIDS as "any XIDS character\, plus LOW LINE\, but they also have to be Word character."

So\, what are you excluding with that lookahead?

  $ unichars '\P{Word}' '\p{XIDS}'   ℘ U+2118 WEIERSTRASS ELLIPTIC FUNCTION   ℮ U+212E ESTIMATED SYMBOL

Can you set me straight\, if I'm confused? Is the idea to remain consistent with the longstanding (at least in my mind) notion that \w characters are the ones you can use in an identifier?

In general\, I like your rules\, although as Nick says\, I think we need to restrict them to the scope of utf8\, and have a different set of rules for non-utf8.

I am tempted to say that we deprecate non-ASCII outside of literals\, but I think that's just a hobgoblin talking to me\, and that we should simply emulate and codify the old rules\, and only talk about changing it if we find very problematic cases.

-- rjbs

p5pRT commented 12 years ago

From @cpansprout

On Sun Jul 01 16​:02​:45 2012\, Hugmeir wrote​:

On Sun\, Jul 1\, 2012 at 2​:03 PM\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I did. I apologize\, I didn't mean to ignore you\, but we've gone through at least four iterations of this same thread over the last year\, and we always get stuck deciding on a sane set of rules for identifiers; then the thread dies and a stable release goes out\, with the worst possible scenario implemented. So I just wanted to get a patch out; ironing out the rules can happen on top of an implementation\, rather than in the vacuum.

In that case\, thank you for the patch.

You are saying that we should allow 0-255 for punctuation/length-one variables\, that I understand\, but what about your normal\, run of the mill identifiers under 'no utf8'? Should $asdÂŁasd be a legal variable\, or _Ă _ a legal bareword?

No.

The rules in the branch are\, more or less​:

qr/ (?(DEFINE) (?\ (?&sigil) (?​: (?&specials) | (?&ident) | (?&bracket_ident) ) ) (?\ (?​: :​: )* (?&normal_ident) (?​: (?​: :​: | ' ) (?&ident) )* (?​: :​: )* )

I would change ' to '(?!​::)

\(?\<bracket\_ident>  \\\{ \\s\* \(?&ident\)  \\s\* \\\}
       \)
\(?\<normal\_ident>    \(?&#8203;: \(?&Perl\_XIDS\) \\p\{XIDC\}\* \)\+

The XIDC part worries me. I’ll explain below.

) (?\ (?​: (?&length_one) | (?&other_specials) ) (?!​::) ) (?\<length_one> (?&Perl_XIDS) | [\p{POSIX_Digit}\p{POSIX_Cntrl}] ) (?\<other_specials> \^[A-Z] | \{ \^ [A-Z][A-Z_]+ \} | [0-9]+ ) (?\<Perl_XIDS> (?=\p{Word}) [\p{XIDS}_]

I don’t understand (or maybe I just don’t remember) the (?=\p{Word}) part. I thought Karl Williamson changed that.

) (?\ [&%*\@​\$] ) ) /x;

I tried to write a similar regex for 5.16\, but there's honestly so many edge cases that I've never mustered the motivation to finish it.

Basing myself on that\, here's the modifications that I think you're going for​:

\(?\<normal\_ident>    \(?\(?\{UTF\}\)
                           \(?&#8203;:\(?&Perl\_XIDS\) \\p\{XIDC\}\*\)\+
                       |   \(?aa\) \[\\w\\x80\-\\xff\]\+

Change ‘[\w\x80-\xff]+’ to ‘(?!\d)\w+’.

                    \)
\)
\(?\<length\_one>      \(?&Perl\_XIDS\) |

[\p{POSIX_Digit}\p{POSIX_Cntrl}] | [\x80-\xff] )

Yes\, exactly.

Although I'm a bit confused on whenever you want the change to the normal_ident case. I suppose not\, since that's not backwards compatible. But then I have to ask​: What SHOULD be the rules for "normal" identifiers under no utf8? DO those apply the same to variables and barewords?

Uhm\, in fact\, if you have the time/motivation\, if you could modify (or rewrite from scratch\, whatever works for you) the regex above to show what you mean\, that would be extremely helpful to me.

I believe I have just done that.

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

pyoq?

This might be less of a problem that it may appear to be​: it only affects latin1 characters that are also XIDS characters; There's 65 of those\, yes\, but they all also match \w (well\, you know\, sometimes : D)\, which is how we've always offhandedly defined identifiers! So those delimiters were buggy by our own definition (again\, sometimes). And on a personal note\, they do strike me as unusual choices for delimiters​: qqÇthings like thisÇ. I'll grant you that the prospect of rewriting scripts -- and I have several that use qqĂ Ă \, because they could -- isn't exactly fun\, but the cost to make them work in every version of perl\, present and future\, is just a space\, which seems entirely reasonable to me.

I think Nicholas Clark answered this well enough.

pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

Well\, why? The current situation is that $y dies under strict\, and so does $Ăż.. if you're under use utf8. Oh\, except that in 5.16\, because of totally unrelated changes\, one XIDS character in the latin1 range started dying too.

I concede this point.

For future reference\, the isIDFIRST_lazy() in gv_fetchpvn_flags is what caused this. The branch fixed it by turning that into isIDFIRST_lazy_if()\, and then by redefining isIDFIRST_lazy_if to use isIDFIRST_L1 if not under use utf8. So to get the behavior that Father C wants\, that's the conditional that needs changing.

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

But the patch doesn't change any rules\, only enforces what we've been saying for years were the rules. perlvar and perldata both say alphanumerics & letters.

But XIDC contains things that are neither. It contains punctuation marks. Yes\, I do think Unicode screwed up here. And I do think there is a high possibility for breakage.

But instead of trying to fight with Unicode\, we can just put it under a pragma. Outside of ‘use 5.018’ we can stick with our current rules\, but under ‘use 5.018’ follow Unicode.

That is exactly what is being enforced by the patch.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

Yes\, but so is /d. We have to live with it.

--

Father Chrysostomos

p5pRT commented 11 years ago

From @khwilliamson

I think we need to do something on this for 5.18. We can't\, IMO\, continue\, for example\, to allow surrogates to be identifier names.

If you read below\, it looks to me that consensus was being approached. FC had a few modifications he wanted in Brian's proposal\, but then Brian vanished without replying\, and is just now starting to catch up on his emails\, being on summer break from school.

I do note that his patches showed that we have a bug currently in that we have the same code mostly\, but not entirely\, repeated in 2 places\, and so the parsing results differ based on no good reason.

  On 07/08/2012 03​:00 PM\, Father Chrysostomos via RT wrote​:

On Sun Jul 01 16​:02​:45 2012\, Hugmeir wrote​:

On Sun\, Jul 1\, 2012 at 2​:03 PM\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue Jun 26 15​:18​:01 2012\, Hugmeir wrote​:

https://github.com/Hugmeir/utf8mess/tree/restrict_variable_names

So\, I've taken a few liberties implementing this. Here's the executive summary of the branch​: Length-one variables must match (?​: (?=Word) [\p{XIDS}_] | [\p{POSIX_Punct}\p{POSIX_Digit}\p{POSIX_Cntrl}] ). This is irregardless of whenever 'use utf8;' is in effect\, so $ÂŁ is now always illegal\, though expanding this to use the some broader definition of punctuation/controls should be simple\, it's just changing one macro.

Did you see the last few messages in this thread? I think we should be restricting it to the Latin-1 range\, allowing all 0-255 characters as punct vars\, including \xad\, unless they are id chars.

I did. I apologize\, I didn't mean to ignore you\, but we've gone through at least four iterations of this same thread over the last year\, and we always get stuck deciding on a sane set of rules for identifiers; then the thread dies and a stable release goes out\, with the worst possible scenario implemented. So I just wanted to get a patch out; ironing out the rules can happen on top of an implementation\, rather than in the vacuum.

In that case\, thank you for the patch.

You are saying that we should allow 0-255 for punctuation/length-one variables\, that I understand\, but what about your normal\, run of the mill identifiers under 'no utf8'? Should $asdÂŁasd be a legal variable\, or _Ă _ a legal bareword?

No.

The rules in the branch are\, more or less​:

qr/ (?(DEFINE) (?\ (?&sigil) (?​: (?&specials) | (?&ident) | (?&bracket_ident) ) ) (?\ (?​: :​: )* (?&normal_ident) (?​: (?​: :​: | ' ) (?&ident) )* (?​: :​: )* )

I would change ' to '(?!​::)

 \(?\<bracket\_ident>  \\\{ \\s\* \(?&ident\)  \\s\* \\\}
        \)
 \(?\<normal\_ident>    \(?&#8203;: \(?&Perl\_XIDS\) \\p\{XIDC\}\* \)\+

The XIDC part worries me. I’ll explain below.

) (?\ (?​: (?&length_one) | (?&other_specials) ) (?!​::) ) (?\<length_one> (?&Perl_XIDS) | [\p{POSIX_Digit}\p{POSIX_Cntrl}] ) (?\<other_specials> \^[A-Z] | \{ \^ [A-Z][A-Z_]+ \} | [0-9]+ ) (?\<Perl_XIDS> (?=\p{Word}) [\p{XIDS}_]

I don’t understand (or maybe I just don’t remember) the (?=\p{Word}) part. I thought Karl Williamson changed that.

I thought it was FC who changed this. But in any case\, it was because of http​://rt.perl.org/rt3/Ticket/Display.html?id=74022 As a result\, there are now non-public properties \p{_Perl_IDStart} and \p{_Perl_IDCont} that take the standard Unicode ones and intersect each with \p{Word}. _Perl_IDStart also adds the underscore character (but no other connector punctuation). The net result of this is that it differs from the Unicode IDStart only by allowing the underscore\, and by disallowing the two characters U+2118 WEIERSTRASS ELLIPTIC FUNCTION and U+212E ESTIMATED SYMBOL.

The Perl version of IDCont has a few more differences. It doesn't match U+B7 MIDDLE DOT U+87 GREEK ANO TELEIA U+A9..U+B1 ETHIOPIC DIGIT ONE .. ETHIOPIC DIGIT NINE U+9A NEW TAI LUE THAM DIGIT ONE

in addition to not matching U+2118 and U+212E.

toke.c isn't using the Perl version of IDCont\, and I think it should. What it does looks very suspicious to me.

 \(?\<sigil>           \[&%\*\\@&#8203;\\$\]

) ) /x;

I tried to write a similar regex for 5.16\, but there's honestly so many edge cases that I've never mustered the motivation to finish it.

Basing myself on that\, here's the modifications that I think you're going for​:

 \(?\<normal\_ident>    \(?\(?\{UTF\}\)
                            \(?&#8203;:\(?&Perl\_XIDS\) \\p\{XIDC\}\*\)\+
                        |   \(?aa\) \[\\w\\x80\-\\xff\]\+

Change ‘[\w\x80-\xff]+’ to ‘(?!\d)\w+’.

                     \)
 \)
 \(?\<length\_one>      \(?&Perl\_XIDS\) |

[\p{POSIX_Digit}\p{POSIX_Cntrl}] | [\x80-\xff] )

Yes\, exactly.

Although I'm a bit confused on whenever you want the change to the normal_ident case. I suppose not\, since that's not backwards compatible. But then I have to ask​: What SHOULD be the rules for "normal" identifiers under no utf8? DO those apply the same to variables and barewords?

Uhm\, in fact\, if you have the time/motivation\, if you could modify (or rewrite from scratch\, whatever works for you) the regex above to show what you mean\, that would be extremely helpful to me.

I believe I have just done that.

And like mentioned before\, valid characters in an identifier no longer vary depending on 'use utf8'\, except for the obvious restriction that under 'no utf8;' the characters belong solely to the Latin-1 range.

I am not sure that is such a good idea. Formerly\, any Latin-1 characters could be used as pyoq delimiters. There are many old scripts still in used that have never needed to be rewritten.

‘use utf8’ is a pragma after all. :-)

pyoq?

This might be less of a problem that it may appear to be​: it only affects latin1 characters that are also XIDS characters; There's 65 of those\, yes\, but they all also match \w (well\, you know\, sometimes : D)\, which is how we've always offhandedly defined identifiers! So those delimiters were buggy by our own definition (again\, sometimes). And on a personal note\, they do strike me as unusual choices for delimiters​: qqÇthings like thisÇ. I'll grant you that the prospect of rewriting scripts -- and I have several that use qqĂ Ă \, because they could -- isn't exactly fun\, but the cost to make them work in every version of perl\, present and future\, is just a space\, which seems entirely reasonable to me.

I think Nicholas Clark answered this well enough.

For those of you who don't want to dig up his response (which is in the thread for this bug report)\, I believe the crux of it to be​:

"Effectively\, (some subset of) code points 1-31 and 127-255 (outside of utf8) are all treated consistently\, as non-printing control characters. And\, unless and until deprecated\, I think this should be how behaviour stays when outside use utf8; I also think that it would be sane to deprecate the use of (at least) literal code points 128-255 in the source for punctuation variables and delimiters (outside of use utf8) in the scope of use 5.018; and later."

pod/perldata.pod has a section streamlining the rules. As a side effect\, 'no utf8; use strict; $Ă ' now has to declare $Ă  with my()\, as it well should.

I am the backward compatibility police\, so I disagree. :-)

Well\, why? The current situation is that $y dies under strict\, and so does $Ăż.. if you're under use utf8. Oh\, except that in 5.16\, because of totally unrelated changes\, one XIDS character in the latin1 range started dying too.

I concede this point.

For future reference\, the isIDFIRST_lazy() in gv_fetchpvn_flags is what caused this. The branch fixed it by turning that into isIDFIRST_lazy_if()\, and then by redefining isIDFIRST_lazy_if to use isIDFIRST_L1 if not under use utf8. So to get the behavior that Father C wants\, that's the conditional that needs changing.

The branch also fixes a bug in word and identifier parsing\, where ASCII alphanumerics would be eaten up without checking if the next character matched \p{XIDC}. This lead to qq\N{MIDDLE DOT} test \N{MIDDLE DOT} to work in previous versions\, but MIDDLE DOT is an XIDC character\, so now that's parsed as bareword( qq\N{MIDDLE DOT} )\, bareword( test )\, ???? XIDC character on it's own\, syntax error. To get the previous behavior\, you need a space before the delimiter\, which is consistent with how 'q mfoom' works.

As I mentioned some time before\, changing the rules for what is an identifier is not a backward-compatible change. If we want to make Perl’s syntax conform more closely with Unicode recommendations\, we should do it all at once (ids\, whitespace\, and Pattern_Syntax for delimiters) with a single feature feature.

But the patch doesn't change any rules\, only enforces what we've been saying for years were the rules. perlvar and perldata both say alphanumerics & letters.

But XIDC contains things that are neither. It contains punctuation marks. Yes\, I do think Unicode screwed up here. And I do think there is a high possibility for breakage.

It turns out that the Greek ANO TELEIA is the only punctuation character in XIDC. It is not correct for it to be a continuation character\, and it is a screw-up\, based on the usual\, historical reasons\, which Unicode people point out predate its inception\, and they had to follow to be compatible. But note that the Perl version of XIDC doesn't include this code point.

But instead of trying to fight with Unicode\, we can just put it under a pragma. Outside of ‘use 5.018’ we can stick with our current rules\, but under ‘use 5.018’ follow Unicode.

That is exactly what is being enforced by the patch.

Someone may mention (and someone has mentioned) that Perl isn’t ‘doing it right’. But Perl has been ‘doing it’ since before the current definition of ‘right’ existed.

Sure\, but Perl hasn't been doing it right by our own definition of right either. It's a big mess.

Yes\, but so is /d. We have to live with it.

p5pRT commented 11 years ago

From @rjbs

* Karl Williamson \public@&#8203;khwilliamson\.com [2013-01-14T16​:00​:51]

I think we need to do something on this for 5.18. We can't\, IMO\, continue\, for example\, to allow surrogates to be identifier names.

I agree.

A release of 5.18.0 with this awful behavior would not be a catastrophe\, but it would be a shame\, especially since it looks like we can avoid it.

If you read below\, it looks to me that consensus was being approached. FC had a few modifications he wanted in Brian's proposal\, but then Brian vanished without replying\, and is just now starting to catch up on his emails\, being on summer break from school.

I do note that his patches showed that we have a bug currently in that we have the same code mostly\, but not entirely\, repeated in 2 places\, and so the parsing results differ based on no good reason.

Yes\, I supported Brian's proposal. I think FC's amendments are fine. Nick's points about 127-255 are also important.

Please let me know how I can help\, if it's anything more than saying\, "I agree\," which I have now done.

Good night!

-- rjbs

p5pRT commented 11 years ago

From @Hugmeir

On Fri\, Jan 25\, 2013 at 1​:06 AM\, Ricardo Signes via RT \perlbug\-followup@&#8203;perl\.org wrote​:

* Karl Williamson \public@&#8203;khwilliamson\.com [2013-01-14T16​:00​:51]

I think we need to do something on this for 5.18. We can't\, IMO\, continue\, for example\, to allow surrogates to be identifier names.

I agree.

A release of 5.18.0 with this awful behavior would not be a catastrophe\, but it would be a shame\, especially since it looks like we can avoid it.

If you read below\, it looks to me that consensus was being approached. FC had a few modifications he wanted in Brian's proposal\, but then Brian vanished without replying\, and is just now starting to catch up on his emails\, being on summer break from school.

I do note that his patches showed that we have a bug currently in that we have the same code mostly\, but not entirely\, repeated in 2 places\, and so the parsing results differ based on no good reason.

Yes\, I supported Brian's proposal. I think FC's amendments are fine. Nick's points about 127-255 are also important.

Please let me know how I can help\, if it's anything more than saying\, "I agree\," which I have now done.

Someone kicking my university's arse for leeching my time would be welcome\, of course\, but failing that.. : )

That aside\, apologies for the overlong wait. I'm changing the branch right now\, so as a sort of recap\, here's how things should stand in a bit​:

Both​: No more single colons when inside ${}. Several other pretty obscure parsing bugs regarding :​: and ' should be mostly fixed. Now they basically work like (?​:(?​: :​:)* '? (?&ident))+ (?​: :​:)* That is\, any number of double colons\, with at most a single quote either separating packages\, or after the sigil / braces\, or after a series of colons\, but not at the end of the identifier.

Under nothing / no utf8\, or evalbytes​: We treat the source as ASCII + 128 controls\, not as Latin1 (which Nicholas pointed out is how we've done it\, historically) Variables of length 2 and more can only have characters that match /\w/aa Variables of length 1 should match /(?[ \p{POSIX_Cntrl} + \p{POSIX_Punct} + \w + [\x80-\xff] ])/aa Outside of length 1 idents\, comments\, regexen and strings\, you can only use the extra controls as quotes for qq and friends.

Under use utf8\, or evaling a UTF-8 flagged string​: Identifiers (both variables and barewords) match /\p{Perl_XIDS}+\p{XIDC}*/ Variables of length 1 match /(?[ \p{POSIX_Cntrl} + \p{POSIX_Punct} + \p{Perl_XIDS} + [0-9] ])

Another issue that Father C objected against was that ./perl -Ilib -le 'use utf8; print eval qq{qq\xc2\xb7 test \xc2\xb7};' Would no longer work\, since the qq would get parsed alongside the \N{MIDDLE DOT} as a bareword. I'm looking into this right now\, but looks like I may have spoken too soon\, since that's currently working. I do think that parsing it as a bareword is the way to go\, but since it's contentious it could be left until 5.20.

p5pRT commented 11 years ago

From @Hugmeir

On Mon\, Feb 18\, 2013 at 9​:55 PM\, Brian Fraser \fraserbn@&#8203;gmail\.com wrote​:

On Fri\, Jan 25\, 2013 at 1​:06 AM\, Ricardo Signes via RT \perlbug\-followup@&#8203;perl\.org wrote​:

* Karl Williamson \public@&#8203;khwilliamson\.com [2013-01-14T16​:00​:51]

I think we need to do something on this for 5.18. We can't\, IMO\, continue\, for example\, to allow surrogates to be identifier names.

I agree.

A release of 5.18.0 with this awful behavior would not be a catastrophe\, but it would be a shame\, especially since it looks like we can avoid it.

If you read below\, it looks to me that consensus was being approached. FC had a few modifications he wanted in Brian's proposal\, but then Brian vanished without replying\, and is just now starting to catch up on his emails\, being on summer break from school.

I do note that his patches showed that we have a bug currently in that we have the same code mostly\, but not entirely\, repeated in 2 places\, and so the parsing results differ based on no good reason.

Yes\, I supported Brian's proposal. I think FC's amendments are fine. Nick's points about 127-255 are also important.

Please let me know how I can help\, if it's anything more than saying\, "I agree\," which I have now done.

Someone kicking my university's arse for leeching my time would be welcome\, of course\, but failing that.. : )

That aside\, apologies for the overlong wait. I'm changing the branch right now\, so as a sort of recap\, here's how things should stand in a bit​:

Both​: No more single colons when inside ${}. Several other pretty obscure parsing bugs regarding :​: and ' should be mostly fixed. Now they basically work like (?​:(?​: :​:)* '? (?&ident))+ (?​: :​:)* That is\, any number of double colons\, with at most a single quote either separating packages\, or after the sigil / braces\, or after a series of colons\, but not at the end of the identifier.

Under nothing / no utf8\, or evalbytes​: We treat the source as ASCII + 128 controls\, not as Latin1 (which Nicholas pointed out is how we've done it\, historically) Variables of length 2 and more can only have characters that match /\w/aa Variables of length 1 should match /(?[ \p{POSIX_Cntrl} + \p{POSIX_Punct} + \w + [\x80-\xff] ])/aa Outside of length 1 idents\, comments\, regexen and strings\, you can only use the extra controls as quotes for qq and friends.

Under use utf8\, or evaling a UTF-8 flagged string​: Identifiers (both variables and barewords) match /\p{Perl_XIDS}+\p{XIDC}*/ Variables of length 1 match /(?[ \p{POSIX_Cntrl} + \p{POSIX_Punct} + \p{Perl_XIDS} + [0-9] ])

Another issue that Father C objected against was that ./perl -Ilib -le 'use utf8; print eval qq{qq\xc2\xb7 test \xc2\xb7};' Would no longer work\, since the qq would get parsed alongside the \N{MIDDLE DOT} as a bareword. I'm looking into this right now\, but looks like I may have spoken too soon\, since that's currently working. I do think that parsing it as a bareword is the way to go\, but since it's contentious it could be left until 5.20.

Only tangentially related to the above\, I've been doing some digging as to why we even allow these​:

$​::'foo foo​::​::bar foo​::​::​::​::bar foo​::'bar

Or why $​::1​:: is legal\, but $1​:: is not.

I can't find anything beyond "because the parser allows it." We have a handful of tests for things like this​:

package Foo​::; sub bar{1} package main; sub foo​::​::​::bar{1} Foo​::​::bar(); foo​::​::​::bar();

But besides the novelty\, is that of any worth?

I bring this up because it should be rather simple to remove the insanity\, so that we go from this​:

  (?\<normal_identifier>   (?​: :​: )* '?   (?&basic_identifier)   (?​: (?= (?​: :​: )+ '? | (?​: :​: )* ' ) (?&normal_identifier) )*   (?​: :​: )*   )

To this​:

  (?\<normal_identifier>   (?​: :​: | ' )?   (?&basic_identifier)   (?​: (?= :​: | ' ) (?&normal_identifier) )*   (?​: :​: )?   )

That being said\, maintainability-wise\, either way doesn't impact anything\, although removing these weird edge cases might make the eventual refactoring of gv_fetchpvn_flags less painful.