Guarantee 0-9, A-Z, a-z character classes

p5pRT commented 10 years ago

Migrated from rt.perl.org#122853 (status was 'resolved')

Searchable as RT122853$

p5pRT commented 10 years ago

From @epa

For a long time Perl programmers have used [A-Z] in regexps to match an uppercase letter and [a-z] for lowercase. For almost as long\, there have been admonishments in various places that these constructs won't work on non-ASCII systems where the letters of the alphabet don't have consecutive character codes\, and that something like \w or [[:upper:]] should be used instead.

However\, \w changes behaviour depending on locale and the /a and /aa modifiers; [[:upper:]] and friends seem okay\, but not everyone will be familiar with the POSIX spec or know where to look them up. A-Z and a-z seem like a clearer way to say what you mean.

There is also the question of whether to use 0-9 instead of \d. Again\, \d changes based on locale or ASCII flags\, while 0-9 always matches exactly the ten digits and nothing else. However there's a nagging feeling that some system out there might (in principle) not use consecutive character codes for the digits\, so it might not work.

This bug report is to request that Perl guarantee in its documentation the following equivalences:

[A-Z] [ABCDEFGHIJKLMNOPQRSTUVWXYZ] [a-z] [abcdefghijklmnopqrstuvwxyz] [0-9] [0123456789]

If the EBCDIC port is still active\, then some programming work might be needed to make sure these ranges do as documented on EBCDIC. That will have the nice side-effect of making plenty of existing code that does use A-Z work correctly on EBCDIC systems.

If there are currently no Perl ports to non-ASCII systems (and there aren't likely to be any new ones in the future)\, then no code or behaviour change is needed\, just a note in the documentation.

Note that currently perlre does talk about

"\w" means the 63 characters "[A-Za-z0-9_]"

So it is implicit that this character class means the alphanumeric characters\, unless the documentation is really saying that \w does something funny on EBCDIC systems.

Perl Info

``` Flags: category=core severity=wishlist Site configuration information for perl 5.18.2: Configured by Red Hat, Inc. at Tue Jan 7 14:45:19 UTC 2014. Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=3.11.9-200.fc19.x86_64, archname=x86_64-linux-thread-multi uname='linux buildvm-12.phx2.fedoraproject.org 3.11.9-200.fc19.x86_64 #1 smp wed nov 20 21:22:24 utc 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dccdlflags=-Wl,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.18.2 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.8.2 20131212 (Red Hat 4.8.2-7)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags =' -fstack-protector' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.18' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro ' Locally applied patches: Fedora Patch1: Removes date check, Fedora/RHEL specific Fedora Patch3: support for libdir64 Fedora Patch4: use libresolv instead of libbind Fedora Patch5: USE_MM_LD_RUN_PATH Fedora Patch6: Skip hostname tests, due to builders not being network capable Fedora Patch7: Dont run one io test due to random builder failures Fedora Patch9: Fix find2perl to translate ? glob properly (RT#113054) Fedora Patch10: Update h2ph(1) documentation (RT#117647) Fedora Patch11: Update pod2html(1) documentation (RT#117623) Fedora Patch12: Disable ornaments on perl5db AutoTrace tests (RT#118817) Fedora Patch14: Do not use system Term::ReadLine::Gnu in tests (RT#118821) Fedora Patch15: Define SONAME for libperl.so Fedora Patch16: Install libperl.so to -Dshrpdir value Fedora Patch18: Fix crash with \\&$glob_copy (RT#119051) Fedora Patch19: Fix coreamp.t rand test (RT#118237) Fedora Patch20: Reap child in case where exception has been thrown (RT#114722) Fedora Patch21: Fix using regular expressions containing multiple code blocks (RT#117917) Fedora Patch200: Link XS modules to libperl.so with EU::CBuilder on Linux Fedora Patch201: Link XS modules to libperl.so with EU::MM on Linux @INC for perl 5.18.2: /home/eda/lib/perl5/ /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . Environment for perl 5.18.2: HOME=/home/eda LANG=en_GB.UTF-8 LANGUAGE (unset) LC_COLLATE=C LC_CTYPE=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_NUMERIC=en_GB.UTF-8 LC_TIME=en_GB.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/eda/bin:/home/eda/bin:/usr/local/bin:/usr/bin:/sbin:/usr/sbin:/sbin:/usr/sbin PERL5LIB=/home/eda/lib/perl5/ PERL_BADLANG (unset) SHELL=/bin/bash -- Ed Avis ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ```

p5pRT commented 10 years ago

From @rgarcia

On 26 September 2014 10:49\, Ed Avis \perlbug\-followup@perl\.org wrote:

This bug report is to request that Perl guarantee in its documentation the following equivalences:

[A-Z] [ABCDEFGHIJKLMNOPQRSTUVWXYZ] [a-z] [abcdefghijklmnopqrstuvwxyz] [0-9] [0123456789]

If the EBCDIC port is still active\, then some programming work might be needed to make sure these ranges do as documented on EBCDIC.

Actually perlebcdic documents those special cases already ([0-9] not being a problem there):

=head1 REGULAR EXPRESSION DIFFERENCES

As of perl 5.005_03 the letter range regular expressions such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example\, characters such as E\ C\ that lie between I and J would not be matched by the regular expression range C\</[H-K]/>. [...]

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @epa

Ah\, so even on EBCDIC the A-Z range will match 26 alphabet letters and only that. In that case\, perhaps only a brief note is needed in perlre to reassure those who learned Perl a long time ago\, back in the days when A-Z was a nonportable construct.

(I've never used an EBCDIC system and so I have never read perlebcdic\, but I still want to write portable code.)

-- Ed Avis \eda@waniasset\.com

p5pRT commented 10 years ago

From @abigail

On Fri\, Sep 26\, 2014 at 09:03:15AM +0000\, Ed Avis wrote:

Ah\, so even on EBCDIC the A-Z range will match 26 alphabet letters and only that. In that case\, perhaps only a brief note is needed in perlre to reassure those who learned Perl a long time ago\, back in the days when A-Z was a nonportable construct.

(I've never used an EBCDIC system and so I have never read perlebcdic\, but I still want to write portable code.)

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Abigail

p5pRT commented 10 years ago

From @epa

Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add

Digit sequences are and will be consecutive on all platforms Perl supports\, so C\<\< [0-3] >> always matches the digits 0123\, and so on.

just to cover all the bases.

-- Ed Avis \eda@waniasset\.com

p5pRT commented 10 years ago

From @abigail

On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:

Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
Digit sequences are and will be consecutive on all platforms Perl
supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and so on\.
just to cover all the bases.

I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.

Abigail

p5pRT commented 10 years ago

From @epa

OK\, how about this wording:

The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense they always match exactly the 26 upper/lower case letters\, regardless of the platform (this only effects EBCDIC\, which would otherwise include some non-letters). This only applies to the whole alphabet A-Z; a shorter range needs to be written out in full\, as [abcde]\, to be portable.

Digit sequences are and will be consecutive on all platforms Perl supports\, so C\<\< [0-3] >> always matches the digits 0123\, and so on.

-- Ed Avis \eda@waniasset\.com

p5pRT commented 10 years ago

From @demerphq

On 29 September 2014 12:43\, Abigail \abigail@abigail\.be wrote:

On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:
Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
Digit sequences are and will be consecutive on all platforms Perl
supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and so on\.
just to cover all the bases.
I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.

But it probably should.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @demerphq

On 29 September 2014 12:47\, Ed Avis \eda@waniasset\.com wrote:

OK\, how about this wording:

The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense they always match exactly the 26 upper/lower case letters\, regardless of the platform (this only effects EBCDIC\, which would otherwise include some non-letters). This only applies to the whole alphabet A-Z; a shorter range needs to be written out in full\, as [abcde]\, to be portable.

IMO this is horrible. Instead of making this change lets just fix the underlying issue.

Interpreting all character classes to mean the same thing everywhere is much simpler rule than the various bodges discussed in this thread.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @epa

demerphq \<demerphq \ gmail.com> writes:

IMO this is horrible. Instead of making this change lets just fix the underlying issue.

I agree\, but could I suggest

1. apply the documentation patch since it correctly documents the current horrible behaviour;

2. file a second bug to change the behaviour (at which point the doc will also change).

3. perhaps create a new warning on nonportable character classes?

-- Ed Avis \eda@waniasset\.com

p5pRT commented 10 years ago

From @abigail

On Mon\, Sep 29\, 2014 at 12:55:15PM +0200\, demerphq wrote:

On 29 September 2014 12:43\, Abigail \abigail@abigail\.be wrote:
On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:
Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
Digit sequences are and will be consecutive on all platforms Perl
supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and so on\.
just to cover all the bases.
I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.
But it probably should.

Well\, that's another whole kettle of fish.

For now\, I'm just concerned about documenting what Perl currently does\, and if it does something DWIM for [A-Z] and [a-z] on EBCDIC\, than it should be documented\, independent on whether we want to change to meaning of [D-N] in the future or no.

Abigail

p5pRT commented 10 years ago

From @khwilliamson

On 09/29/2014 07:13 AM\, Abigail wrote:

On Mon\, Sep 29\, 2014 at 12:55:15PM +0200\, demerphq wrote:
On 29 September 2014 12:43\, Abigail \abigail@abigail\.be wrote:
On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:
Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
 Digit sequences are and will be consecutive on all platforms Perl
 supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and so on\.
just to cover all the bases.
I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.
But it probably should.
Well\, that's another whole kettle of fish.

For now\, I'm just concerned about documenting what Perl currently does\, and if it does something DWIM for [A-Z] and [a-z] on EBCDIC\, than it should be documented\, independent on whether we want to change to meaning of [D-N] in the future or no.

Abigail

[D-N] means [DEFGHIJKLMN] on EBCDIC platforms\, and that is how it has worked\, according to perlebcdic\, since 5.005_03

p5pRT commented 10 years ago

From @khwilliamson

On 09/29/2014 10:53 AM\, Karl Williamson wrote:

On 09/29/2014 07:13 AM\, Abigail wrote:
On Mon\, Sep 29\, 2014 at 12:55:15PM +0200\, demerphq wrote:
On 29 September 2014 12:43\, Abigail \abigail@abigail\.be wrote:
On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:
Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit 2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the sense +they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
 Digit sequences are and will be consecutive on all platforms Perl
 supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and
so on.

just to cover all the bases.
I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.
But it probably should.
Well\, that's another whole kettle of fish.

For now\, I'm just concerned about documenting what Perl currently does\, and if it does something DWIM for [A-Z] and [a-z] on EBCDIC\, than it should be documented\, independent on whether we want to change to meaning of [D-N] in the future or no.

Abigail
[D-N] means [DEFGHIJKLMN] on EBCDIC platforms\, and that is how it has worked\, according to perlebcdic\, since 5.005_03

I'm not understanding where the idea that we currently have horrible behavior is coming from.

Any subset of the ranges [a-z] and [A-Z] is (and has been) specially handled to match on EBCDIC platforms the same equivalent characters it matches on ASCII platforms. Hence qr/[i-j]/i\, matches [ijIJ] on both ASCII and EBCDIC platforms.

The special handling is only valid if both ends of the range are literals. In EBCDIC\, \xC9 is 'I' and \xD1 is 'J'. If you specify any of [\xC9-J]\, [I-\xD1] \, or [\xC9-\xD1]\, you get all the code points C9\, CA\, CB\, CC\, CD\, CE\, CF\, and D1. This is how it has worked since apparently 5.005_03\, and is how I think it should continue to work. In other words\, I think we got the design right.

No special handling is required for 0-9\, as they are contiguous on both ASCII and EBCDIC. This is likely true in any native character set. The POSIX standard effectively mandates that the digits in any locale should be in 1 or 2 groups of 10 consecutive code points whose numerical values are also consecutive\, starting with zero. Unicode now does the same. (There was an exception to this that I brought to their attention\, and they quickly changed it\, without the usual dramas.)

p5pRT commented 10 years ago

From @demerphq

On 29 September 2014 19:34\, Karl Williamson \public@khwilliamson\.com wrote:

On 09/29/2014 10:53 AM\, Karl Williamson wrote:
On 09/29/2014 07:13 AM\, Abigail wrote:
On Mon\, Sep 29\, 2014 at 12:55:15PM +0200\, demerphq wrote:
On 29 September 2014 12:43\, Abigail \abigail@abigail\.be wrote:

On Mon\, Sep 29\, 2014 at 10:13:21AM +0000\, Ed Avis wrote:
Abigail \<abigail \ abigail.be> writes:

I've added a remark in perlrecharclass.pod. See commit

2a2f23e4f8a50bdcdd10563dc5d933684cb70954

Thanks. That adds

+The classes C\<\< [A-Z] >> and C\<\< [a-z] >> are special cased\, in the

sense

+they always match exactly the 26 upper/lower case letters\, regardless +of the platform (this only effects EBCDIC\, which would otherwise include +some non-letters).

I would also add
 Digit sequences are and will be consecutive on all platforms Perl
 supports\, so C\<\< \[0\-3\] >> always matches the digits 0123\, and
so on.

just to cover all the bases.
I disagree.

Because that gives the expectation that C\<\< [D-N] >> will do that as well\, but it does not.

But it probably should.
Well\, that's another whole kettle of fish.

For now\, I'm just concerned about documenting what Perl currently does\, and if it does something DWIM for [A-Z] and [a-z] on EBCDIC\, than it should be documented\, independent on whether we want to change to meaning of [D-N] in the future or no.

Abigail

[D-N] means [DEFGHIJKLMN] on EBCDIC platforms\, and that is how it has worked\, according to perlebcdic\, since 5.005_03
I'm not understanding where the idea that we currently have horrible behavior is coming from.
The docs aren't very clear on this. I dont see anything that spells this issue out like you have below.

Any subset of the ranges [a-z] and [A-Z] is (and has been) specially handled to match on EBCDIC platforms the same equivalent characters it matches on ASCII platforms. Hence qr/[i-j]/i\, matches [ijIJ] on both ASCII and EBCDIC platforms.

I think this is the problem. Why does this apply to [a-z] and [A-Z] only? Why not to all literals?

The special handling is only valid if both ends of the range are literals. In EBCDIC\, \xC9 is 'I' and \xD1 is 'J'. If you specify any of [\xC9-J]\, [I-\xD1] \, or [\xC9-\xD1]\, you get all the code points C9\, CA\, CB\, CC\, CD\, CE\, CF\, and D1. This is how it has worked since apparently 5.005_03\, and is how I think it should continue to work. In other words\, I think we got the design right.

For ranges involving non-literals I agree. But I don't think this design is sane for literals.

In other words\, I think a rule that said that "literals in character classes will be interpreted according to the Unicode specification" is a better rule than what you described.

I don't suppose we can change it now but the current rules seem unnecessarily confusing.

The docs on ranges in perlrecharclass.pod say this:

Character Ranges

It is not uncommon to want to match a range of characters. Luckily\, instead of listing all characters in the range\, one may use the hyphen ("-"). If inside a bracketed character class you have two characters separated by a hyphen\, it's treated as if all characters between the two were in the class. For instance\, "[0-9]" matches any ASCII digit\, and "[a-m]" matches any lowercase letter from the first half of the old ASCII alphabet.

Note that the two characters on either side of the hyphen are not necessarily both letters or both digits. Any character is possible\, although not advisable. "['-?]" contains a range of characters\, but most people will not know which characters that means. Furthermore\, such ranges may lead to portability problems if the code has to run on a platform that uses a different character set\, such as EBCDIC.

If a hyphen in a character class cannot syntactically be part of a range\, for instance because it is the first or the last character of the character class\, or if it immediately follows a range\, the hyphen isn't special\, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range\, you must escape that hyphen with a backslash.

Examples:

[a-z] # Matches a character that is a lower case ASCII letter. [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or # the letter 'z'. [-z] # Matches either a hyphen ('-') or the letter 'z'. [a-f-m] # Matches any letter between 'a' and 'f' (inclusive)\, the # hyphen ('-')\, or the letter 'm'. ['-?] # Matches any of the characters '()*+\,-./0123456789:;\<=>? # (But not on an EBCDIC platform).

If I read this carefully\, with your mails fully in mind\, I can see how what you say and what it say agree\, or perhaps better\, do not disagree. However a quick reading of the second paragraph might lead someone to think that character class ranges are in general not portable. Or might miss the significance of ASCII in the descriptions.

Also in perlre:

(The following all specify the same class of three characters: "[-az]"\, "[az-]"\, and "[a\-z]". All are different from "[a-z]"\, which specifies a class containing twenty-six characters\, even on EBCDIC-based character sets.) Also\, if you try to use the character classes "\w"\, "\W"\, "\s"\, "\S"\, "\d"\, or "\D" as endpoints of a range\, the "-" is understood literally.

Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabetics of equal case ([a-e]\, [A-E])\, or digits ([0-9]). Anything else is unsafe. If in doubt\, spell out the character sets in full.

Now again\, when I read that with what you said in mind I understand that they are in agreement.

But your mail spelled it out a whole lot clearer than any of the docs I found.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @khwilliamson

On 09/29/2014 12:26 PM\, demerphq wrote:

Any subset of the ranges \[a\-z\] and \[A\-Z\] is $and has been$ specially
handled to match on EBCDIC platforms the same equivalent characters
it matches on ASCII platforms\.  Hence qr/\[i\-j\]/i\, matches \[ijIJ\] on
both ASCII and EBCDIC platforms\.
I think this is the problem. Why does this apply to [a-z] and [A-Z] only? Why not to all literals?
The special handling is only valid if both ends of the range are
literals\.  In EBCDIC\, \\xC9 is 'I' and \\xD1 is 'J'\.  If you specify
any of \[\\xC9\-J\]\, \[I\-\\xD1\] \, or \[\\xC9\-\\xD1\]\, you get all the code
points C9\, CA\, CB\, CC\, CD\, CE\, CF\, and D1\.  This is how it has
worked since apparently 5\.005\_03\, and is how I think it should
continue to work\.  In other words\, I think we got the design right\.
For ranges involving non-literals I agree. But I don't think this design is sane for literals.

In other words\, I think a rule that said that "literals in character classes will be interpreted according to the Unicode specification" is a better rule than what you described.

I don't suppose we can change it now but the current rules seem unnecessarily confusing.

I'm not sure I understand your point here. [%] matches an ASCII percent on an ASCII platform\, and an EBCDIC percent on an EBCDIC platform. The code is perfectly portable. All literal characters match properly on both platforms\, and would continue to do so if Perl were ever ported to yet another platform. (The odds of that happening are infinitesimal\, I realize.)

But there are only three cases where it is obvious what should be in a range of literals. Those are any subsets of A-Z\, a-z\, and 0-9. Perl takes special action to handle those as DWIM.

The only other ASCII literal characters are punctuation and space. There is no natural language intrinsic ordering of them\, and hence ranges with these as end points are obfuscations of what is really happening.

Perl need not take special efforts to handle obfuscated code. I doubt that there is anybody on this list who knows immediately what [%-{] matches\, or [|-&]. These match differently on EBCDIC than ASCII. It would be too late to change this behavior\, nor do I think it would be desirable to do so.

This from the docs you quoted is right: "A sound principle is to use only ranges that begin from and end at either alphabetics of equal case ([a-e]\, [A-E])\, or digits ([0-9])" Perl should support doing that\, but no more\, at least in the ASCII range.

Above ASCII\, there may be scripts where there are ranges that might benefit from similar handling. One possibility is Greek\, where there is a tradition of viewing things as a range ("I am the alpha and the omega"\, for example). And there is a hole in the upper case version of these\, which Perl could exclude from matches in subsets of [Α-Ω]. But we run into trouble with the lowercase ones\, as there are two versions of sigma in the middle (which are really glyph variants of each other\, and so should not have been encoded separately in Unicode\, but were for compatibility with earlier standards). I think that probably the number of scripts where this makes sense is relatively small\, so it might create more confusion than it's worth to take special action for just those. So\, I'm certainly not going to propose doing it.

p5pRT commented 10 years ago

From @cpansprout

On Wed Oct 01 20:42:11 2014\, public@khwilliamson.com wrote:

But we run into trouble with the lowercase ones\, as there are two versions of sigma in the middle (which are really glyph variants of each other\, and so should not have been encoded separately in Unicode\, but were for compatibility with earlier standards).

On the contrary\, the distinction between σ and ς is crucial when it comes to abbreviations. Κος. and Κοσ. are not interchangeable.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @demerphq

On 2 October 2014 05:41\, Karl Williamson \public@khwilliamson\.com wrote:

On 09/29/2014 12:26 PM\, demerphq wrote:
Any subset of the ranges \[a\-z\] and \[A\-Z\] is $and has been$ specially
handled to match on EBCDIC platforms the same equivalent characters
it matches on ASCII platforms\.  Hence qr/\[i\-j\]/i\, matches \[ijIJ\] on
both ASCII and EBCDIC platforms\.
I think this is the problem. Why does this apply to [a-z] and [A-Z] only? Why not to all literals?
The special handling is only valid if both ends of the range are
literals\.  In EBCDIC\, \\xC9 is 'I' and \\xD1 is 'J'\.  If you specify
any of \[\\xC9\-J\]\, \[I\-\\xD1\] \, or \[\\xC9\-\\xD1\]\, you get all the code
points C9\, CA\, CB\, CC\, CD\, CE\, CF\, and D1\.  This is how it has
worked since apparently 5\.005\_03\, and is how I think it should
continue to work\.  In other words\, I think we got the design right\.
For ranges involving non-literals I agree. But I don't think this design is sane for literals.

In other words\, I think a rule that said that "literals in character classes will be interpreted according to the Unicode specification" is a better rule than what you described.

I don't suppose we can change it now but the current rules seem unnecessarily confusing.
I'm not sure I understand your point here. [%] matches an ASCII percent on an ASCII platform\, and an EBCDIC percent on an EBCDIC platform. The code is perfectly portable. All literal characters match properly on both platforms\, and would continue to do so if Perl were ever ported to yet another platform. (The odds of that happening are infinitesimal\, I realize.)

But there are only three cases where it is obvious what should be in a range of literals. Those are any subsets of A-Z\, a-z\, and 0-9. Perl takes special action to handle those as DWIM.

The only other ASCII literal characters are punctuation and space. There is no natural language intrinsic ordering of them\, and hence ranges with these as end points are obfuscations of what is really happening.

Whether or not they are an obfuscation is a personal aesthetic opinion. And since there are many natural language ordering of characters in A-Z I dont feel you are particularly firm ground suggesting there is something intrinsically more sensible about A-Z than %-{.

Perl need not take special efforts to handle obfuscated code.

I think this is a terrible justification for the language not being well defined.

I mean\, this case is rather different from "The CPU does math in a different endianness than your code expects" type undefined behaviour that cannot be avoided. With character class ranges the damage is self inflicted. I think that is sad an unnecessary.

I doubt that there is anybody on this list who knows immediately what [%-{] matches\, or [|-&].

I dont think whether people offhand know how many characters are in the unicode character set [%-{] is relevant. The point is that once you looked it up you should be able to rely on it everywhere Perl runs. And if you took this kind of argument to the extreme it would lead to seriously bizarre consequences.

Heck\, Im not sure that many people could tell you how many characters there are between "P" and "W" off the top of their head\, and I bet a lot of people from non-english backgrounds would *disagree* on the subject.

IOW\, I think the position you take differentiating between A-Z and %-{ is rooted in the fact that you and ASCII share a common cultural background. If you were Icelandic you would expect to find "á" after "a"\, but ASCII doesn't do that. In fact strictly speaking ASCII can't even represent "á".

So I think you are manufacturing a distinction between A-Z and %-{ that is not really there\, and to the extent that it does exist\, is culturally specific.

I think that is a pretty terrible basis to decide that one part of a regex pattern is well defined and others are not.

These match differently on EBCDIC than ASCII.

Yes\, well that is the problem right? They are only poorly defined *because* they are different on EBCDIC and ASCII.

It would be too late to change this behavior\, nor do I think it would be desirable to do so.

Yes\, I suspect you are right. Sadly.

On the other hand what would we do if we targeted a different platform that also used a different native character set? IMO we would be *nuts* to repeat this design decision for said hypothetical platform.

This from the docs you quoted is right: "A sound principle is to use only ranges that begin from and end at either alphabetics of equal case ([a-e]\, [A-E])\, or digits ([0-9])" Perl should support doing that\, but no more\, at least in the ASCII range.

In an ideal world we would delete that sentence and replace it with "character class ranges composed of literals are always interpreted according to the unicode standard\, so [%-{] will always match 88 characters regardless of native encoding\, although the actual codepoints matched may differ from unicode where appropriate".

IOW\, the problem here is that when we ported the regex engine to EBCDIC we did not properly separate out "code points in the pattern as expressed as literals" and "native representation of those code points". Which I suppose is natural given our EBCDIC port predates Unicode\, but it is still unfortunate.

I do not think we should have any platform specific behaviour other than that which is forced upon us.

And I do not think it is good that a *scripting* language like Perl has portability issues which are not forced upon us.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @abigail

On Thu\, Oct 02\, 2014 at 09:30:09AM +0200\, demerphq wrote:

[Much discussion snipped]

I dont think whether people offhand know how many characters are in the unicode character set [%-{] is relevant. The point is that once you looked it up you should be able to rely on it everywhere Perl runs. And if you took this kind of argument to the extreme it would lead to seriously bizarre consequences.

Let's not blow things out of proportion.

How much code is actually effected by this? EBCDIC isn't exactly a major platform\, and I'd bet that most Perl programmers never have written code that needs to run on both EBCDIC and non-EBCDIC platforms. On top of that\, [%-{] and friends isn't that common either. I doubt there's a serious sized corpus of code that's affected by this.

Sure\, that [%-{] matches a different set of characters on EBCDIC and non-EBCDIC is unfortunate\, but in practise\, has that ever lead to problems? Do we have a list of bug reports related to this? Are there (m)any questions related to this on Perlmonks\, Stackoverflow\, Usenet?

In theory\, we could "fix" this\, but is that worth the effort? Hasn't EBCDIC been on edge of being dropped as a supported platform for quite some time now\, due to the unability of testing the platform?

My advice: if you write code that you think will run on EBCDIC\, don't use [%-{]. (I would say\, not using [%-{] ever in serious code is a smart thing to do anyway).

Abigail

p5pRT commented 10 years ago

From @ap

* Abigail \abigail@abigail\.be [2014-10-02 13:45]:

Let's not blow things out of proportion. How much code is actually effected by this? […] in practise\, has that ever lead to problems? […] In theory\, we could "fix" this\, but is that worth the effort?

I think everyone is agreed that this ship has sailed\, Yves included.

I still agree with Yves that if one were designing this from scratch now\, the only reasonable approach would be to say that the semantics for ranges between verbatim characters always follow Unicode\, rather than special-casing particular classes of them.

p5pRT commented 10 years ago

From @khwilliamson

On 10/01/2014 10:48 PM\, Father Chrysostomos via RT wrote:

On Wed Oct 01 20:42:11 2014\, public@khwilliamson.com wrote:

But we run into trouble with the lowercase ones\, as there are two versions of sigma in the middle (which are really glyph variants of each other\, and so should not have been encoded separately in Unicode\, but were for compatibility with earlier standards).

On the contrary\, the distinction between σ and ς is crucial when it comes to abbreviations. Κος. and Κοσ. are not interchangeable.

I stand corrected

p5pRT commented 10 years ago

From @khwilliamson

I have now documented the behavior in 09e4339761388239d17da23bf3fa0c882a0b04bf.

In looking at this\, I found some bugs in the EBCDIC handling\, which are now fixed.

I also changed things so that /[\N{LATIN SMALL LETTER I}-j]/ (and similar) will receive the special handling that /[i-j]/ receives to guarantee the behavior. This includes the \N{U+...} form\, as the "U" means Unicode\, and so the behavior should be like what would happen with Unicode. -- Karl Williamson

p5pRT commented 10 years ago

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT commented 10 years ago

From @khwilliamson

On 10/02/2014 01:30 AM\, demerphq wrote:

On 2 October 2014 05:41\, Karl Williamson \<public@khwilliamson.com \mailto:public@khwilliamson\.com> wrote:
On 09/29/2014 12&#8203;:26 PM\, demerphq wrote&#8203;:

         Any subset of the ranges \[a\-z\] and \[A\-Z\] is $and has been$
    specially
         handled to match on EBCDIC platforms the same equivalent
    characters
         it matches on ASCII platforms\.  Hence qr/\[i\-j\]/i\, matches
    \[ijIJ\] on
         both ASCII and EBCDIC platforms\.

    I think this is the problem\. Why does this apply to \[a\-z\] and \[A\-Z\]
    only? Why not to all literals?

         The special handling is only valid if both ends of the
    range are
         literals\.  In EBCDIC\, \\xC9 is 'I' and \\xD1 is 'J'\.  If you
    specify
         any of \[\\xC9\-J\]\, \[I\-\\xD1\] \, or \[\\xC9\-\\xD1\]\, you get all the
    code
         points C9\, CA\, CB\, CC\, CD\, CE\, CF\, and D1\.  This is how it has
         worked since apparently 5\.005\_03\, and is how I think it should
         continue to work\.  In other words\, I think we got the
    design right\.

    For ranges involving non\-literals I agree\. But I don't think
    this design
    is sane for literals\.

    In other words\, I think a rule that said that "literals in character
    classes will be interpreted according to the Unicode
    specification" is a
    better rule than what you described\.

    I don't suppose we can change it now but the current rules seem
    unnecessarily confusing\.

I'm not sure I understand your point here\.  \[%\] matches an ASCII
percent on an ASCII platform\, and an EBCDIC percent on an EBCDIC
platform\.  The code is perfectly portable\.  All literal characters
match properly on both platforms\, and would continue to do so if
Perl were ever ported to yet another platform\.  $The odds of that
happening are infinitesimal\, I realize\.$

But there are only three cases where it is obvious what should be in
a range of literals\.  Those are any subsets of A\-Z\, a\-z\, and 0\-9\.
Perl takes special action to handle those as DWIM\.

The only other ASCII literal characters are punctuation and space\.
There is no natural language intrinsic ordering of them\, and hence
ranges with these as end points are obfuscations of what is really
happening\.
Whether or not they are an obfuscation is a personal aesthetic opinion. And since there are many natural language ordering of characters in A-Z I dont feel you are particularly firm ground suggesting there is something intrinsically more sensible about A-Z than %-{.
Perl need not take special efforts to handle obfuscated code\.
I think this is a terrible justification for the language not being well defined.

I mean\, this case is rather different from "The CPU does math in a different endianness than your code expects" type undefined behaviour that cannot be avoided. With character class ranges the damage is self inflicted. I think that is sad an unnecessary.
I doubt that there is anybody on this list who knows immediately
what \[%\-\{\] matches\, or \[|\-&\]\.
I dont think whether people offhand know how many characters are in the unicode character set [%-{] is relevant. The point is that once you looked it up you should be able to rely on it everywhere Perl runs. And if you took this kind of argument to the extreme it would lead to seriously bizarre consequences.

Heck\, Im not sure that many people could tell you how many characters there are between "P" and "W" off the top of their head\, and I bet a lot of people from non-english backgrounds would *disagree* on the subject.

IOW\, I think the position you take differentiating between A-Z and %-{ is rooted in the fact that you and ASCII share a common cultural background. If you were Icelandic you would expect to find "á" after "a"\, but ASCII doesn't do that. In fact strictly speaking ASCII can't even represent "á".

So I think you are manufacturing a distinction between A-Z and %-{ that is not really there\, and to the extent that it does exist\, is culturally specific.

I think that is a pretty terrible basis to decide that one part of a regex pattern is well defined and others are not.
These match differently on EBCDIC than ASCII\.
Yes\, well that is the problem right? They are only poorly defined *because* they are different on EBCDIC and ASCII.
It would be too late to change this behavior\, nor do I think it
would be desirable to do so\.
Yes\, I suspect you are right. Sadly.

On the other hand what would we do if we targeted a different platform that also used a different native character set? IMO we would be *nuts* to repeat this design decision for said hypothetical platform.
This from the docs you quoted is right&#8203;: "A sound principle is to use
only ranges that begin from and end at either alphabetics of equal
case $\[a\-e\]\, \[A\-E\]$\, or digits $\[0\-9\]$"  Perl should support doing
that\, but no more\, at least in the ASCII range\.
In an ideal world we would delete that sentence and replace it with "character class ranges composed of literals are always interpreted according to the unicode standard\, so [%-{] will always match 88 characters regardless of native encoding\, although the actual codepoints matched may differ from unicode where appropriate".

IOW\, the problem here is that when we ported the regex engine to EBCDIC we did not properly separate out "code points in the pattern as expressed as literals" and "native representation of those code points". Which I suppose is natural given our EBCDIC port predates Unicode\, but it is still unfortunate.

I do not think we should have any platform specific behaviour other than that which is forced upon us.

And I do not think it is good that a *scripting* language like Perl has portability issues which are not forced upon us.

Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

I agree that it would be nice to be able to portably specify ranges. But before I get to that\, I have a couple of points to make\, moot as they might be.

If one has to look up what's exactly in a range when coding\, then that person is unfairly burdening whomever might take up the maintenance of that code in the future.

You may very well be right about my cultural bias about what's in A-Z. I've tried to imagine what I would think if my first language had had other characters\, but I can't really.

But your idealized solution effectively says to people on EBCDIC that they have to use a foreign character set\, and that is just as chauvinistic as my A-Z bias. There are people who code solely on and for EBCDIC\, and Perl should accommodate their native way of thinking. So \x04 has to mean the character whose code point is natively 4 on whatever platform the code is being run on. If you want to specify the character whose *Unicode* code point is 4\, you can use \N{U+04}.

But then what about this range?

[\N{U+04}-\N{U+09}]

It seems obvious to me that what the coder meant is

[\N{U+04}\N{U+05}\N{U+06}\N{U+07}\N{U+08}\N{U+09}]

But on EBCDIC it currently doesn't mean that; it is an error because \N{U+04} is 0x37 and \N{U+09} is 0x05\, so we have a range whose first value is larger than the second value\, which is not allowed. I think this is a bug\, and I propose to fix it. The fix is not hard. The paradigm is that a range in any platform which is specified in terms of Unicode end-points should follow Unicode rules. That gives portability across all platforms.

By extension\, I think that using the Unicode name syntax should act identically as the U+ syntax. The above range could be specified using that syntax as

[\N{EOT}-\N{HT}]

and should include EOT (4 on ASCII)\, HT (9 on ASCII) plus U+05..U+08 (ENQ\, ACK\, BEL and BS (5\, 6\, 7\, 8 respectively in ASCII).

So\, by specifying a range in Unicode terminology\, one could get the portability Yves wants. [\N{PERCENT SIGN}-\N{LEFT CURLY BRACKET}] would match the same characters on all platforms that [%-{] does on ASCII.

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\, and in the meantime\, the non-Unicode endpoint be considered to be the Unicode value. There are no such usages currently in CPAN. In fact\, there are only 2 modules that use \N{} in ranges\, and both look to be wanting the behavior I'm proposing here.

http://grep.cpan.me/?q=\[.*\\N{[^}]*}-+-file%3A%22\.pod%24%22 http://grep.cpan.me/?q=-\\N{[^}]*}+-file%3A%22\.pod%24%22

p5pRT commented 10 years ago

From @cpansprout

On Wed Oct 29 21:44:19 2014\, public@khwilliamson.com wrote:

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\,

I don’t think it should be deprecated. Most of us don’t care whether our code runs on EBCDIC\, so things that just work on ASCII platforms should not be deprecated or removed because of EBCDIC-accommodating reasoning.

Everything else in your post I agree with.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @epa

I suggest that "Most of us don’t care whether our code runs on EBCDIC\," is not the best way to frame the issue. For application code this may be true\, but library code usually has to be written more cuatiously\, as you don't know where it will end up. I am not saying that nonportable constructs should be disallowed\, but that it should be an explicit choice for the programmer to use them\, where reasonably possible.

If deprecating [\N{U+04}-\x{09}] is not acceptable then just define it to treat both sides as Unicode. That leaves \x{}-\x{} as the explicit way to request a native (nonportable) range\, while everything else will be the same on all platforms.

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

p5pRT commented 10 years ago

From @ap

* Karl Williamson \public@khwilliamson\.com [2014-10-30 05:45]:

You may very well be right about my cultural bias about what's in A-Z. I've tried to imagine what I would think if my first language had had other characters\, but I can't really.

But your idealized solution effectively says to people on EBCDIC that they have to use a foreign character set\, and that is just as chauvinistic as my A-Z bias.

This is conflating 2 arguments.

It’s cultural bias to give special rules to ranges in the Latin alphabet but nothing else. You could simply remove the special case if you wanted to be egalitarian.

Of course that would make the meaning of Perl programs more ambiguous than it is already. The reason the special case was added is so that Perl programs don’t mean one thing on ASCII/Unicode machines and another completely different one on EBCDIC machines. But they do mean different things – the special case just papers over the most glaring symptom. But to make Perl programs mean one thing\, universally\, you inherently have to pick one charset over every other as their character model. Unicode is only the obvious choice. (Heck\, z/OS has capitulated (re wrapper lib for porting Unicode-based programs); pretty much anything that comes in contact with the internet will have to capitulate eventually.)

But those two parts of the argument are separate points.

There are people who code solely on and for EBCDIC\, and Perl should accommodate their native way of thinking. So \x04 has to mean the character whose code point is natively 4 on whatever platform the code is being run on.

I’d say “all’s fair if you predeclare”\, as the Perl 6 do\, except\, well encoding.pm tried to offer that and it ended in tears. There would have to be a reason that it would turn out differently in this case.

So\, by specifying a range in Unicode terminology\, one could get the portability Yves wants.

Sounds good.

Absent the existing special case\, this would not suffice; I cannot imagine a lot of people would spell A-Z as \N{U+0041}-N{U+005A} – not to mention that if clarity is your aim\, this is not the way to achieve it. And the clear way\, \N{LATIN CAPITAL LETTER A}-\N{LATIN CAPITAL LETTER Z}\, err\, well…

But since people can write the most common ranges portably anyway (even if only due to a culturally biased rule)\, this would only be needed for the harder-to-understand cases\, where it would at worst be no worse than the existing situation.

So\, given where we are\, it makes sense.

Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>

p5pRT commented 10 years ago

From @ap

* Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 06:05]:

On Wed Oct 29 21:44:19 2014\, public@khwilliamson.com wrote:

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\,

I don’t think it should be deprecated. Most of us don’t care whether our code runs on EBCDIC\, so things that just work on ASCII platforms should not be deprecated or removed because of EBCDIC-accommodating reasoning.

Are you arguing a principle here or do you have code that would break? (In which case\, how much?)

To me the principle behind this deprecation is not “this would not port to EBCDIC so you should not be doing this” but “we are making \x and \N mean different things that cannot semantically be mixed”.

Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 01:25:13 2014\, aristotle wrote:

* Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 06:05]:

On Wed Oct 29 21:44:19 2014\, public@khwilliamson.com wrote:

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\,

I don’t think it should be deprecated. Most of us don’t care whether our code runs on EBCDIC\, so things that just work on ASCII platforms should not be deprecated or removed because of EBCDIC-accommodating reasoning.

Are you arguing a principle here

That.

or do you have code that would break? (In which case\, how much?)

To me the principle behind this deprecation is not “this would not port to EBCDIC so you should not be doing this” but “we are making \x and \N mean different things that cannot semantically be mixed”.

But on ASCII systems character ranges are simple (start at the Unicode codepoint specified by the left-hand character and iterate through them to the right-hand character). I don’t think making them more complex brings any benefit. On EBCDIC\, due to the model that Perl follows\, they are naturally complex\, but that complexity needn’t affect code and programmers that never come in contact with EBCDIC.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 04:19:49 2014\, sprout wrote:

On Thu Oct 30 01:25:13 2014\, aristotle wrote:

* Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 06:05]:

On Wed Oct 29 21:44:19 2014\, public@khwilliamson.com wrote:

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\,

I don’t think it should be deprecated. Most of us don’t care whether our code runs on EBCDIC\, so things that just work on ASCII platforms should not be deprecated or removed because of EBCDIC-accommodating reasoning.

Are you arguing a principle here

That.

or do you have code that would break? (In which case\, how much?)

To me the principle behind this deprecation is not “this would not port to EBCDIC so you should not be doing this” but “we are making \x and \N mean different things that cannot semantically be mixed”.

But on ASCII systems character ranges are simple (start at the Unicode codepoint specified by the left-hand character and iterate through them to the right-hand character). I don’t think making them more complex brings any benefit. On EBCDIC\, due to the model that Perl follows\, they are naturally complex\, but that complexity needn’t affect code and programmers that never come in contact with EBCDIC.

I’m ignoring locales. I don’t know whether they have any bearing on this issue.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @jhi

[\x{04}-\N{U+09}]

I think people who ask for weird things like this should be expecting weird results. In other words\, I wouldn't feel bad outlawing them.

The start of the range says "the 0x4 in native"\, the end of the range is "the U+09\, in Unicode". It makes no sense. If they wanted native-native\, they can write that. If they wanted Unicode-Unicode\, they can write that.

Similarly\, think of ranges like [A-z] (that's upper-A-to-lower-z)\, or [0-z] (zero-to-lower-z). Just think in ASCII. Should these mean 0x41-0x7a\, and 0x30-0x7a? If so\, they *will* contain the [[\\\]_`] in the first case\, and the [:;\<=>?@\[\\\]_`] in the second.

There's a lot of magic in Perl\, but I think there are limits in trying to always understand what the heck the user meant. Aborting or warning at least lets the user be more explicit (many a time the better solution is to use character classes\, like \p{Alpha})\, instead of relying on our guesswork.

As for the non-English speaking view\, and how locales would affect things. Well\, it's complicated... (surprised?): a-z *could* probably mean "all the lowercase letters" for the languages where z sorts last. But for languages where z doesn't come last\, a-z doesn't feel like "all the lowercase letters".

On Thu\, Oct 30\, 2014 at 7:19 AM\, Father Chrysostomos via RT \perlbug\-followup@perl\.org wrote:

On Thu Oct 30 01:25:13 2014\, aristotle wrote:

* Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 06:05]:

On Wed Oct 29 21:44:19 2014\, public@khwilliamson.com wrote:

The remaining question I have is what happens if only one end of the range is a Unicode construct?

[\N{U+04}-\x{09}] [\x{04}-\N{U+09}]

I think this should be deprecated\,

I don’t think it should be deprecated. Most of us don’t care whether our code runs on EBCDIC\, so things that just work on ASCII platforms should not be deprecated or removed because of EBCDIC-accommodating reasoning.

Are you arguing a principle here

That.

or do you have code that would break? (In which case\, how much?)

To me the principle behind this deprecation is not “this would not port to EBCDIC so you should not be doing this” but “we are making \x and \N mean different things that cannot semantically be mixed”.

But on ASCII systems character ranges are simple (start at the Unicode codepoint specified by the left-hand character and iterate through them to the right-hand character). I don’t think making them more complex brings any benefit. On EBCDIC\, due to the model that Perl follows\, they are naturally complex\, but that complexity needn’t affect code and programmers that never come in contact with EBCDIC.

--

Father Chrysostomos

--- via perlbug: queue: perl5 status: resolved https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122853

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 06:08:46 2014\, jhi wrote:

[\x{04}-\N{U+09}]

I think people who ask for weird things like this should be expecting weird results. In other words\, I wouldn't feel bad outlawing them.

The start of the range says "the 0x4 in native"\, the end of the range is "the U+09\, in Unicode". It makes no sense. If they wanted native-native\, they can write that. If they wanted Unicode-Unicode\, they can write that.

As a native ASCII speaker\, I might not understand the native/Unicode distinction.

Similarly\, think of ranges like [A-z] (that's upper-A-to-lower-z)\, or [0-z] (zero-to-lower-z). Just think in ASCII. Should these mean 0x41-0x7a\, and 0x30-0x7a? If so\, they *will* contain the [[\\\]_`] in the first case\, and the [:;\<=>?@\[\\\]_`] in the second.

Perl lets people do stupid things. That is one of its strengths.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @jhi

So as a native ASCIIer\, when you see ABC you are actually seeing 0x41 0x42 0x43?

\Fascinating.\

You also cannot see characters with diacritics.

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

p5pRT commented 10 years ago

From @jhi

... and if we step outside ASCII\, what do you think X-Χ should match?

On Thu\, Oct 30\, 2014 at 1:30 PM\, Jarkko Hietaniemi \jhi@iki\.fi wrote:

So as a native ASCIIer\, when you see ABC you are actually seeing 0x41 0x42 0x43?

\Fascinating.\

You also cannot see characters with diacritics.

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 15:26:11 2014\, jhi wrote:

... and if we step outside ASCII\, what do you think X-Χ should match?

If that is the same character at both ends (assuming it is in a character class\, and that your message was not somehow scrambled)\, then it would match that one character.

On Thu\, Oct 30\, 2014 at 1:30 PM\, Jarkko Hietaniemi \jhi@iki\.fi wrote:

So as a native ASCIIer\, when you see ABC you are actually seeing 0x41 0x42 0x43?

\Fascinating.\

You also cannot see characters with diacritics.

By ‘native ASCII speaker’\, I meant one who works with ASCII all the time\, for whom Unicode is just ASCII extended to a phenomenal degree\, being a superset of ASCII.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @jhi

On Thursday-201410-30\, 19:12\, Father Chrysostomos via RT wrote:

... and if we step outside ASCII\, what do you think X-Χ should match? If that is the same character at both ends (assuming it is in a character class\, and that your message was not somehow scrambled)\, then it would match that one character.

Look closer.

p5pRT commented 10 years ago

From gdg@zplane.com

Jarkko Hietaniemi \jhi@iki\.fi [2014-10-30 19:14:25 -0400]:

On Thursday-201410-30\, 19:12\, Father Chrysostomos via RT wrote:

... and if we step outside ASCII\, what do you think X-Χ should match?

[X-X] 'ɹǝʇʇɐɯ ʇɐɥʇ ɹoɟ 'ɹO

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 16:14:54 2014\, jhi wrote:

On Thursday-201410-30\, 19:12\, Father Chrysostomos via RT wrote:

... and if we step outside ASCII\, what do you think X-Χ should match? If that is the same character at both ends (assuming it is in a character class\, and that your message was not somehow scrambled)\, then it would match that one character.

Look closer.

Duh. It should match all Unicode codepoints between 0x58 and 0x3a7.

--

Father Chrysostomos

p5pRT commented 10 years ago

From @cpansprout

On Thu Oct 30 16:28:09 2014\, gdg@zplane.com wrote:

Jarkko Hietaniemi \jhi@iki\.fi [2014-10-30 19:14:25 -0400]:

On Thursday-201410-30\, 19:12\, Father Chrysostomos via RT wrote:

... and if we step outside ASCII\, what do you think X-Χ should match?

[X-X] 'ɹǝʇʇɐɯ ʇɐɥʇ ɹoɟ 'ɹO

·ǝɔᴉʍʇ ɹǝʇɔɐɹɐɥɔ ǝɯɐs ǝɥʇ pǝsn no⅄

--

Father Chrysostomos

p5pRT commented 10 years ago

From gdg@zplane.com

Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 16:37:43 -0700]:

On Thu Oct 30 16:28:09 2014\, gdg@zplane.com wrote:

Jarkko Hietaniemi \jhi@iki\.fi [2014-10-30 19:14:25 -0400]:

On Thursday-201410-30\, 19:12\, Father Chrysostomos via RT wrote:

... and if we step outside ASCII\, what do you think X-Χ should match?

[X-X] 'ɹǝʇʇɐɯ ʇɐɥʇ ɹoɟ 'ɹO

·ǝɔᴉʍʇ ɹǝʇɔɐɹɐɥɔ ǝɯɐs ǝɥʇ pǝsn no⅄

¿(⇂)po ɥʇıʍ ɹo ʎʃʃɐnsıʌ pǝuıɯɹǝʇǝp ʇɐɥʇ sɐʍ ʇnq 'ɥƃnouǝ ǝnɹ⊥

In any case\, it's not obvious (to a human reader of the source code) whether X is or isn't the 24th element of [A-Z].

Otoh\, as a complete outsider\, I probably shouldn't even be injecting my uninformed opinion\, since I likely don't appreciate many of the subtleties. But the crystal clarity of ignorance is compelling\, so will simply opine that\, imo\, supporting non-ascii ranges using legacy syntax seems to make the same amount of sense as supporting non-ascii method names and operators (also recently debated): Superficially attractive -- egalitarian\, easy to root for because it's so fair-sounding and non-chauvinistic -- but in the long run\, potentially leading to source balkanization and maintenance issues\, which are non-obvious\, difficult to anticipate\, and probably tricky (perhaps even impossible) to resolve cleanly in the future.

Just my 2c. Please educate me if I'm off base.

p5pRT commented 10 years ago

From @abigail

On Thu\, Oct 30\, 2014 at 07:19:07AM +0000\, Ed Avis wrote:

I suggest that "Most of us don’t care whether our code runs on EBCDIC\," is not the best way to frame the issue. For application code this may be true\, but library code usually has to be written more cuatiously\, as you don't know where it will end up. I am not saying that nonportable constructs should be disallowed\, but that it should be an explicit choice for the programmer to use them\, where reasonably possible.

I write tons of library code. In fact\, that's my job.

And I know damn well it's never ever going to end up on an EBCDIC platform.

Forbidding or warning that would have worked as intended on ASCII platforms just to accomodate the few coders that have to write code that needs to run on both ASCII and EBCDIC platforms is the wrong tradeoff IMO.

If you want your code to run on both ASCII and EBCDIC\, go ahead\, and don't mix [\x{..}-\N{..}] ranges. But there's no need to forbid for people who don't (have to) care.

Abigail

p5pRT commented 10 years ago

From @ap

* Father Chrysostomos via RT \perlbug\-followup@perl\.org [2014-10-30 12:25]:

On Thu Oct 30 01:25:13 2014\, aristotle wrote:

To me the principle behind this deprecation is not “this would not port to EBCDIC so you should not be doing this” but “we are making \x and \N mean different things that cannot semantically be mixed”.

But on ASCII systems character ranges are simple (start at the Unicode codepoint specified by the left-hand character and iterate through them to the right-hand character). I don’t think making them more complex brings any benefit. On EBCDIC\, due to the model that Perl follows\, they are naturally complex\, but that complexity needn’t affect code and programmers that never come in contact with EBCDIC.

How are they naturally complex? They are not any different in principle in EBCDIC than in ASCII and so don’t have to be any more complex. The complexity with EBCDIC is a choice\, made in the design of Perl\, out of the desire to preserve (some of!) the meaning of programs written under assumptions based on ASCII.

And here\, the proposed solution (which seems the only sensible one too) is that if you use two \x{}s\, then \x{} means one thing\, but if you use one \x{} and \N{} then \x{} means another thing. On ASCII platforms that is a distinction without a difference\, but on EBCDIC platforms it’s not.

Hm.

I wonder if there’s a case for just allowing such mixed ranges on ASCII systems but warning about them on EBCDIC systems?

That way\, that group of users who are possibly affected at least get a chance to notice\, and can patch the code if they own it or else ask for a patch if e.g. they got it from CPAN.

OTOH\, if \x{} in mixed ranges is a synonym for \N{U+}\, then in 99.9% of case the response will be to replace the \x{} with a \N{U+} because that’s what it did before\, so nothing about the program actually changes and so it ultimately is a pointless make-the-user-say-it-right warning.

So then we’re left with a lone \x{} meaning something distinct from an \x{} partnered with another \x{}\, which I can’t bring myself to like – even though it will admittedly be a distinction without a difference for all but a tiny minority of users.

Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>

p5pRT commented 10 years ago

From @epa

Perhaps the answer is to introduce a new escape \X which is explicitly non-portable and looks up in the native character set. \x and \N look up in Unicode on all platforms. Mixing \X with the others in a single range is not allowed.

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

p5pRT commented 9 years ago

From @abigail

On Sat\, Nov 01\, 2014 at 09:14:33AM +0000\, Ed Avis wrote:

Perhaps the answer is to introduce a new escape \X which is explicitly non-portable and looks up in the native character set. \x and \N look up in Unicode on all platforms. Mixing \X with the others in a single range is not allowed.

Before we all go changing stuff (outlawing constructs\, throwing warnings\, change meaning\, whatever else has been suggested)\, does anyone have to actual examples of code (that is\, real code\, not a constructed cases in a theoretical scenario) that doesn't behave as the programmer intended\, and is caused by the ASCII-EBCDIC differences? Any bug reports?

In other words\, is there an actual problem that needs to be tackled?

Abigail

p5pRT commented 9 years ago

From @khwilliamson

On 11/06/2014 02:42 AM\, Abigail wrote:

On Sat\, Nov 01\, 2014 at 09:14:33AM +0000\, Ed Avis wrote:

Perhaps the answer is to introduce a new escape \X which is explicitly non-portable and looks up in the native character set. \x and \N look up in Unicode on all platforms. Mixing \X with the others in a single range is not allowed.

Before we all go changing stuff (outlawing constructs\, throwing warnings\, change meaning\, whatever else has been suggested)\, does anyone have to actual examples of code (that is\, real code\, not a constructed cases in a theoretical scenario) that doesn't behave as the programmer intended\, and is caused by the ASCII-EBCDIC differences? Any bug reports?

In other words\, is there an actual problem that needs to be tackled?

Abigail

On the contrary\, the absence of any real-world examples would argue for at least a warning\, as opposed to doing nothing.

First\, it means that little will break should we change behavior\, so it's pretty safe to do so.

Second\, it also means that people don't tend to do this in real life\, and so when it happens\, it is likely to be a mistake rather than intentional\, hence a likely bug that the programmer should be warned about. There are people with time to kill that don't want their mistakes pointed out sooner rather than being bitten by them later\, but the vast majority of programmers aren't like that.

As I said\, the only instances in cpan of using \N{} in a range have both ends be \N{}.

Note also that recently a [A-z] was introduced into blead. And it was a typo\, not the intent of the programmer. I think that people don't tend to think in such ranges\, so when found\, it's much more likely to be a mistake\, worthy of warning about.

p5pRT commented 9 years ago

From @khwilliamson

On 10/30/2014 09:24 AM\, Father Chrysostomos via RT wrote:

On Thu Oct 30 06:08:46 2014\, jhi wrote:

[\x{04}-\N{U+09}]

I think people who ask for weird things like this should be expecting weird results. In other words\, I wouldn't feel bad outlawing them.

The start of the range says "the 0x4 in native"\, the end of the range is "the U+09\, in Unicode". It makes no sense. If they wanted native-native\, they can write that. If they wanted Unicode-Unicode\, they can write that.

As a native ASCII speaker\, I might not understand the native/Unicode distinction.

Similarly\, think of ranges like [A-z] (that's upper-A-to-lower-z)\, or [0-z] (zero-to-lower-z). Just think in ASCII. Should these mean 0x41-0x7a\, and 0x30-0x7a? If so\, they *will* contain the [[\\\]_`] in the first case\, and the [:;\<=>?@\[\\\]_`] in the second.

Perl lets people do stupid things. That is one of its strengths.

And it is one of its weaknesses. I believe this is a big part of the reason that Perl has the reputation of being just for toy programs\, and not for production use.

If we as a project really thought that not warning for stupid things is the right thing for production code\, we wouldn't compile perl itself with -Wall\, and even -Wextra. No\, we want all the warnings the compiler reasonably can give us\, even if some of them are bogus.

The discipline of software engineering is to try to get the best code with the fewest bugs with the least effort. The rule of thumb I was taught was (and may still be) that an error detected at a given stage in a product life-cycle is an order of magnitude more expensive to fix than one found at the immediately prior stage. As a developer with a long todo list\, I want the compiler to tell me that I'm doing something iffy\, along with a way to suppress the warning if I decide to do it anyway\, perhaps because the compiler is wrong. But the compiler should err towards more\, rather than less warning.

I don't think in terms of ASCII in such ranges\, except for a-z\, A-Z\, and 0-9. I don't believe that most programmers do either. An example is the recent introduction of [A-z] into blead. It was a typo\, rather than the intent of the programmer.

p5pRT commented 9 years ago

From @ilmari

Karl Williamson \public@khwilliamson\.com writes:

On Thu Oct 30 06:08:46 2014\, jhi wrote:

Similarly\, think of ranges like [A-z] (that's upper-A-to-lower-z)\, or [0-z] (zero-to-lower-z). Just think in ASCII. Should these mean 0x41-0x7a\, and 0x30-0x7a? If so\, they *will* contain the [[\\\]_`] in the first case\, and the [:;\<=>?@\[\\\]_`] in the second.

[…] I don't think in terms of ASCII in such ranges\, except for a-z\, A-Z\, and 0-9. I don't believe that most programmers do either. An example is the recent introduction of [A-z] into blead. It was a typo\, rather than the intent of the programmer.

There's a bunch of uses of [ -~] and [!-~] to mean all printable ASCII characters (with or without space) on CPAN\, and I know of several DarkPAN uses too.

http://grep.cpan.me/?q=\[[+!]-~\]

-- "A disappointingly low fraction of the human race is\, at any given time\, on fire." - Stig Sandbeck Mathisen

p5pRT commented 9 years ago

From @cpansprout

On Thu Nov 13 21:29:51 2014\, public@khwilliamson.com wrote:

If we as a project really thought that not warning for stupid things is the right thing for production code\, we wouldn't compile perl itself with -Wall\, and even -Wextra. No\, we want all the warnings the compiler reasonably can give us\, even if some of them are bogus.

The discipline of software engineering is to try to get the best code with the fewest bugs with the least effort. The rule of thumb I was taught was (and may still be) that an error detected at a given stage in a product life-cycle is an order of magnitude more expensive to fix than one found at the immediately prior stage. As a developer with a long todo list\, I want the compiler to tell me that I'm doing something iffy\, along with a way to suppress the warning if I decide to do it anyway\, perhaps because the compiler is wrong. But the compiler should err towards more\, rather than less warning.

This is where I disagree with you. When you compile a C program\, if there are a few harmless warnings\, you can just ignore them. With Perl\, you get the same warnings every time you run the program. So we ought to err on the side of caution and avoid false positives. If there are too many nagging warnings\, people will just turn warnings off.

--

Father Chrysostomos

p5pRT commented 9 years ago

From @khwilliamson

On 11/14/2014 11:10 PM\, Father Chrysostomos via RT wrote:

On Thu Nov 13 21:29:51 2014\, public@khwilliamson.com wrote:

If we as a project really thought that not warning for stupid things is the right thing for production code\, we wouldn't compile perl itself with -Wall\, and even -Wextra. No\, we want all the warnings the compiler reasonably can give us\, even if some of them are bogus.

The discipline of software engineering is to try to get the best code with the fewest bugs with the least effort. The rule of thumb I was taught was (and may still be) that an error detected at a given stage in a product life-cycle is an order of magnitude more expensive to fix than one found at the immediately prior stage. As a developer with a long todo list\, I want the compiler to tell me that I'm doing something iffy\, along with a way to suppress the warning if I decide to do it anyway\, perhaps because the compiler is wrong. But the compiler should err towards more\, rather than less warning.

This is where I disagree with you. When you compile a C program\, if there are a few harmless warnings\, you can just ignore them. With Perl\, you get the same warnings every time you run the program. So we ought to err on the side of caution and avoid false positives. If there are too many nagging warnings\, people will just turn warnings off.

This is a reasonable position that I had not considered.

p5pRT commented 9 years ago

From @rjbs

* Karl Williamson \public@khwilliamson\.com [2014-11-14T00:29:27]

I don't think in terms of ASCII in such ranges\, except for a-z\, A-Z\, and 0-9. I don't believe that most programmers do either. An example is the recent introduction of [A-z] into blead. It was a typo\, rather than the intent of the programmer.

I agree with you about issuing warnings when they can eliminate bugs early\, and that does mean warnings during compilation. I also agree with FC that we don't want to spew warnings a billion times during execution.

Possibly this could be overcome in practice\, for this warning\, but I think that in this case we'd annoy more people than we'd help. If this language was being built from the start\, I think it might have been a useful addition. Starting from here\, I think it will not be worth the gain.

MOZNION recently announced https://metacpan.org/pod/Regexp::Lexer -- maybe we'll be able to have a decent regexp linter in the near future...

-- rjbs

Perl / perl5