Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

[PATCH] Use "UTF-8" consistently in perldelta #14710

Closed p5pRT closed 9 years ago

p5pRT commented 9 years ago

Migrated from rt.perl.org#125221 (status was 'resolved')

Searchable as RT125221$

p5pRT commented 9 years ago

From @ilmari

Created by @ilmari

The document was using "utf8"\, "UTF8" and "UTF-8" interchangably\, with the latter by far the most common. Make them all consistently "UTF-8"\, except of course when referring to the actual names of things (e.g. macros and functions).

Perl Info ``` Flags: category=docs severity=low Type=Patch PatchStatus=HasPatch Site configuration information for perl 5.20.2: Configured by Debian Project at Sun Mar 29 16:39:05 UTC 2015. Summary of my perl5 (revision 5 version 20 subversion 2) configuration: Platform: osname=linux, osvers=3.2.0-4-amd64, archname=x86_64-linux-gnu-thread-multi uname='linux babin 3.2.0-4-amd64 #1 smp debian 3.2.65-1+deb7u2 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.9.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking DEBPKG:fixes/respect_umask - Respect umask during installation DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3 in patchlevel.h DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require @INC for perl 5.20.2: /home/ilmari/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /home/ilmari/perl5/lib/perl5 /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl . Environment for perl 5.20.2: HOME=/home/ilmari LANG=en_GB.utf8 LANGUAGE=en_GB:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ilmari/perl5/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games PERL5LIB=/home/ilmari/perl5/lib/perl5 PERLDOC_PAGER=less -Ri PERL_AUTOINSTALL_PREFER_CPAN=1 PERL_BADLANG (unset) PERL_LOCAL_LIB_ROOT=/home/ilmari/perl5 PERL_MB_OPT=--install_base "/home/ilmari/perl5" PERL_MM_OPT=INSTALL_BASE=/home/ilmari/perl5 PERL_MM_PREFER_CPAN=1 PERL_MM_USE_DEFAULT=1 PERL_TEST_MEMORY=4 SHELL=/bin/bash ```
p5pRT commented 9 years ago

From @ilmari

0001-Use-UTF-8-consistently-in-perldelta.patch ```diff From 5f97f4504c5319d257c28ab6804b95d6a8ec9f60 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Wed, 20 May 2015 02:00:59 +0100 Subject: [PATCH] Use "UTF-8" consistently in perldelta Except when referring to actual names of things. --- pod/perldelta.pod | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/pod/perldelta.pod b/pod/perldelta.pod index b14a9b6..eacf210 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -116,10 +116,10 @@ C, and C. -=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness On platforms that implement neither the C99 standard nor the POSIX 2001 -standard, determining if the current locale is UTF8 or not depends on +standard, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release. =head2 Aliasing via reference @@ -1174,7 +1174,7 @@ L<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by <-- H (D deprecated) The C<< /\C/ >> character class was deprecated in v5.20, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character, breaks encapsulation, and can corrupt utf8 +multi-byte character, breaks encapsulation, and can corrupt UTF-8 strings. =item * @@ -1468,7 +1468,7 @@ L (W locale) While in a single-byte locale (I, a non-UTF-8 one), a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example, in the ISO 8859-7 (Greek) locale, the code point 0xC3 represents a Capital Gamma. But so @@ -2133,7 +2133,7 @@ David Mitchell for future work on vtables. =item * The C function accepts C and C -flags, which specify whether the appended string is bytes or utf8, +flags, which specify whether the appended string is bytes or UTF-8, respectively. (These flags have in fact been present since 5.16.0, but were formerly not regarded as part of the API.) @@ -2240,7 +2240,7 @@ L<[perl #123223]|https://rt.perl.org/Ticket/Display.html?id=123223>. =item * -Pad names are now always UTF8. The C macro always returns +Pad names are now always UTF-8. The C macro always returns true. Previously, this was effectively the case already, but any support for two different internal representations of pad names has now been removed. @@ -2525,9 +2525,9 @@ L<[perl #108276]|https://rt.perl.org/Ticket/Display.html?id=108276>. =item * -In Perl 5.20.0, C<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0, C<$^N> accidentally had the internal UTF-8 flag turned off if accessed from a code block within a regular expression, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L<[perl #123135]|https://rt.perl.org/Ticket/Display.html?id=123135>. =item * @@ -2653,8 +2653,8 @@ contrary to the documentation, Now C always prevents inlining. =item * On some systems, such as VMS, C can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously, then C -would not turn off the UTF8 flag, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously, then C +would not turn off the UTF-8 flag, thus corrupting the return value. This would happen with C<$lexical = crypt ...>. =item * @@ -2749,7 +2749,7 @@ mirror character. =item * -C<< s///e >> on tainted utf8 strings corrupted C<< pos() >>. This bug, +C<< s///e >> on tainted UTF-8 strings corrupted C<< pos() >>. This bug, introduced in 5.20, is now fixed. L<[perl #122148]|https://rt.perl.org/Ticket/Display.html?id=122148>. @@ -2903,7 +2903,7 @@ false at compile time and true at run time. =item * -Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L<[perl #122747]|https://rt.perl.org/Ticket/Display.html?id=122747> -- 2.1.4 ```
p5pRT commented 9 years ago

From @ilmari

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

-- "I use RMS as a guide in the same way that a boat captain would use a lighthouse. It's good to know where it is\, but you generally don't want to find yourself in the same spot." - Tollef Fog Heen

p5pRT commented 9 years ago

From @ilmari

Once more\, with attachment!

ilmari@​ilmari.org (Dagfinn Ilmari MannsĂ„ker) writes​:

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

-- "I use RMS as a guide in the same way that a boat captain would use a lighthouse. It's good to know where it is\, but you generally don't want to find yourself in the same spot." - Tollef Fog Heen

p5pRT commented 9 years ago

From @ilmari

0001-Use-UTF-8-consistently-in-perldelta.patch ```diff From d3ee16907fd66b5dc8f0a632ec648b5378df8674 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= Date: Wed, 20 May 2015 02:00:59 +0100 Subject: [PATCH] Use "UTF-8" consistently in perldelta Except when referring to actual names of things. Also update the diagnostic description in perldiag. --- pod/perldelta.pod | 24 ++++++++++++------------ pod/perldiag.pod | 2 +- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/pod/perldelta.pod b/pod/perldelta.pod index ec03317..f2371d2 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -116,10 +116,10 @@ C, and C. -=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness On platforms that implement neither the C99 standard nor the POSIX 2001 -standard, determining if the current locale is UTF8 or not depends on +standard, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release. =head2 Aliasing via reference @@ -1174,7 +1174,7 @@ L<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by <-- H (D deprecated) The C<< /\C/ >> character class was deprecated in v5.20, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character, breaks encapsulation, and can corrupt utf8 +multi-byte character, breaks encapsulation, and can corrupt UTF-8 strings. =item * @@ -1468,7 +1468,7 @@ L (W locale) While in a single-byte locale (I, a non-UTF-8 one), a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example, in the ISO 8859-7 (Greek) locale, the code point 0xC3 represents a Capital Gamma. But so @@ -2133,7 +2133,7 @@ David Mitchell for future work on vtables. =item * The C function accepts C and C -flags, which specify whether the appended string is bytes or utf8, +flags, which specify whether the appended string is bytes or UTF-8, respectively. (These flags have in fact been present since 5.16.0, but were formerly not regarded as part of the API.) @@ -2240,7 +2240,7 @@ L<[perl #123223]|https://rt.perl.org/Ticket/Display.html?id=123223>. =item * -Pad names are now always UTF8. The C macro always returns +Pad names are now always UTF-8. The C macro always returns true. Previously, this was effectively the case already, but any support for two different internal representations of pad names has now been removed. @@ -2525,9 +2525,9 @@ L<[perl #108276]|https://rt.perl.org/Ticket/Display.html?id=108276>. =item * -In Perl 5.20.0, C<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0, C<$^N> accidentally had the internal UTF-8 flag turned off if accessed from a code block within a regular expression, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L<[perl #123135]|https://rt.perl.org/Ticket/Display.html?id=123135>. =item * @@ -2653,8 +2653,8 @@ contrary to the documentation, Now C always prevents inlining. =item * On some systems, such as VMS, C can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously, then C -would not turn off the UTF8 flag, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously, then C +would not turn off the UTF-8 flag, thus corrupting the return value. This would happen with C<$lexical = crypt ...>. =item * @@ -2749,7 +2749,7 @@ mirror character. =item * -C<< s///e >> on tainted utf8 strings corrupted C<< pos() >>. This bug, +C<< s///e >> on tainted UTF-8 strings corrupted C<< pos() >>. This bug, introduced in 5.20, is now fixed. L<[perl #122148]|https://rt.perl.org/Ticket/Display.html?id=122148>. @@ -2903,7 +2903,7 @@ false at compile time and true at run time. =item * -Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L<[perl #122747]|https://rt.perl.org/Ticket/Display.html?id=122747> diff --git a/pod/perldiag.pod b/pod/perldiag.pod index ab95152..93ae13b 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -7105,7 +7105,7 @@ filehandle with an encoding, see L and L. (W locale) While in a single-byte locale (I, a non-UTF-8 one), a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example, in the ISO 8859-7 (Greek) locale, the code point 0xC3 represents a Capital Gamma. But so -- 2.1.4 ```
p5pRT commented 9 years ago

From @tonycoz

On Tue May 19 19​:31​:05 2015\, ilmari wrote​:

Once more\, with attachment!

ilmari@​ilmari.org (Dagfinn Ilmari MannsĂ„ker) writes​:

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

Thanks\, applied as 50ea4745c8ab3dc6c2e7bfcf895c892b27dae6b4.

Tony

p5pRT commented 9 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 9 years ago

@tonycoz - Status changed from 'open' to 'resolved'

p5pRT commented 9 years ago

From @demerphq

On 20 May 2015 at 05​:26\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue May 19 19​:31​:05 2015\, ilmari wrote​:

Once more\, with attachment!

ilmari@​ilmari.org (Dagfinn Ilmari MannsĂ„ker) writes​:

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

Thanks\, applied as 50ea4745c8ab3dc6c2e7bfcf895c892b27dae6b4.

I dont know about this change actually. Sorry to say.

See perldoc Encode and look for the section "UTF-8 vs. utf8 vs. UTF8" (quoted below)

In short all three term have subtly different definitions.

At least one of the changes in this patch appears to change the correct use of "utf8" to the incorrect "UTF-8"​:

@​@​ -1174\,7 +1174\,7 @​@​ L\<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by \<-- H (D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

The original was correct. The regex engine does not care about UTF-8\, it cares about utf8.

I think this patch should be reverted until each change can be reviewed to see if it refers to "UTF-8" the formal definition from Unicode\, or "utf8" the internal encoding used by Perl (a supeset of UTF-8) or if it refers to the UTF8 flag (which indicates the scalar contains "utf8" not "UTF-8".)

cheers\, Yves

UTF-8 vs. utf8 vs. UTF8   ....We now view strings not as sequences of bytes\, but as sequences   of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit   computers\, 0 .. 2**64-1) -- Programming Perl\, 3rd ed.

  That has historically been Perl's notion of UTF-8\, as that is how UTF-8   was first conceived by Ken Thompson when he invented it. However\,   thanks to later revisions to the applicable standards\, official UTF-8   is now rather stricter than that. For example\, its range is much   narrower (0 .. 0x10_FFFF to cover only 21 bits instead of 32 or 64   bits) and some sequences are not allowed\, like those used in surrogate   pairs\, the 31 non-character code points 0xFDD0 .. 0xFDEF\, the last two   code points in any plane (0xXX_FFFE and 0xXX_FFFF)\, all non-shortest   encodings\, etc.

  The former default in which Perl would always use a loose   interpretation of UTF-8 has now been overruled​:

  From​: Larry Wall \larry@&#8203;wall\.org   Date​: December 04\, 2004 11​:51​:58 JST   To​: perl-unicode@​perl.org   Subject​: Re​: Make Encode.pm support the real UTF-8   Message-Id​: \20041204025158\.GA28754@&#8203;wall\.org

  On Fri\, Dec 03\, 2004 at 10​:12​:12PM +0000\, Tim Bunce wrote​:   : I've no problem with 'utf8' being perl's unrestricted uft8 encoding\,   : but "UTF-8" is the name of the standard and should give the   : corresponding behaviour.

  For what it's worth\, that's how I've always kept them straight in my   head.

  Also for what it's worth\, Perl 6 will mostly default to strict but   make it easy to switch back to lax.

  Larry

  Got that? As of Perl 5.8.7\, "UTF-8" means UTF-8 in its current sense\,   which is conservative and strict and security-conscious\, whereas "utf8"   means UTF-8 in its former sense\, which was liberal and loose and lax.   "Encode" version 2.10 or later thus groks this subtle but critically   important distinction between "UTF-8" and "utf8".

  encode("utf8"\, "\x{FFFF_FFFF}"\, 1); # okay   encode("UTF-8"\, "\x{FFFF_FFFF}"\, 1); # croaks

  In the "Encode" module\, "UTF-8" is actually a canonical name for   "utf-8-strict". That hyphen between the "UTF" and the "8" is critical;   without it\, "Encode" goes "liberal" and (perhaps overly-)permissive​:

  find_encoding("UTF-8")->name # is 'utf-8-strict'   find_encoding("utf-8")->name # ditto. names are case insensitive   find_encoding("utf_8")->name # ditto. "_" are treated as "-"   find_encoding("UTF8")->name # is 'utf8'.

  Perl's internal UTF8 flag is called "UTF8"\, without a hyphen. It   indicates whether a string is internally encoded as "utf8"\, also   without a hyphen.

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

@tonycoz - Status changed from 'resolved' to 'open'

p5pRT commented 9 years ago

From @tonycoz

On Wed\, May 20\, 2015 at 07​:16​:52AM +0200\, demerphq wrote​:

On 20 May 2015 at 05​:26\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue May 19 19​:31​:05 2015\, ilmari wrote​:

Once more\, with attachment!

ilmari@​ilmari.org (Dagfinn Ilmari MannsĂ„ker) writes​:

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

Thanks\, applied as 50ea4745c8ab3dc6c2e7bfcf895c892b27dae6b4.

I dont know about this change actually. Sorry to say.

See perldoc Encode and look for the section "UTF-8 vs. utf8 vs. UTF8" (quoted below)

In short all three term have subtly different definitions.

Using minor typographical differences to differentiate them is kind of dumb.

At least one of the changes in this patch appears to change the correct use of "utf8" to the incorrect "UTF-8"​:

@​@​ -1174\,7 +1174\,7 @​@​ L\<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by \<-- H (D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

The original was correct. The regex engine does not care about UTF-8\, it cares about utf8.

Anyone reading it should realize it's working with perl's internal unicode representation\, which is an extended UTF-8. For anyone who doesn't the difference is likely meaningless anyway.

I've reverted the commit.

Tony

p5pRT commented 9 years ago

From @demerphq

On 20 May 2015 at 08​:27\, Tony Cook \tony@&#8203;develop\-help\.com wrote​:

On Wed\, May 20\, 2015 at 07​:16​:52AM +0200\, demerphq wrote​:

On 20 May 2015 at 05​:26\, Tony Cook via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue May 19 19​:31​:05 2015\, ilmari wrote​:

Once more\, with attachment!

ilmari@​ilmari.org (Dagfinn Ilmari MannsĂ„ker) writes​:

TonyC pointed out on IRC that the description of the «Wide character (U+%X) in %s» warning was copied from perldiag\, so update it there too.

There are a few more instances of "UTF8" in perldiag\, but those should be in a separate ticket/patch (possibly together with a more wide-reaching overhaul).

Thanks\, applied as 50ea4745c8ab3dc6c2e7bfcf895c892b27dae6b4.

I dont know about this change actually. Sorry to say.

See perldoc Encode and look for the section "UTF-8 vs. utf8 vs. UTF8" (quoted below)

In short all three term have subtly different definitions.

Using minor typographical differences to differentiate them is kind of dumb.

I agree\, more or less. On the other hand the flag being name UTF8 is kinda unavoidable\, as hyphens aren't allowed in C identifiers.

At least one of the changes in this patch appears to change the correct use of "utf8" to the incorrect "UTF-8"​:

@​@​ -1174\,7 +1174\,7 @​@​ L\<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by \<-- H (D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

The original was correct. The regex engine does not care about UTF-8\, it cares about utf8.

Anyone reading it should realize it's working with perl's internal unicode representation\, which is an extended UTF-8. For anyone who doesn't the difference is likely meaningless anyway.

Sorry\, I didnt have time to review it closely to see how much it really matters. My point was just that changing all 'utf8'\, 'UTF8' to 'UTF-8' will subtly change the meaning of what was written.

I've reverted the commit.

Sorry about that. I somehow feel like a party pooper for bringing this up.

cheers. Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @tonycoz

Oops\, forgot to push the revert\, I'll hold off on it for now.

On Wed\, May 20\, 2015 at 07​:16​:39PM +0200\, demerphq wrote​:

Sorry about that. I somehow feel like a party pooper for bringing this up.

I may have overreacted\, sorry.

Here's the way I think about it​:

- unless we need to specifically distinguish between them (as Encode   does)\, calling perl's internal encoding UTF-8 is no big deal\, since   its intent is to represent Unicode. If we do need to distinguish   between them in perldelta then something like "perl's extended   UTF-8" is more useful to most readers than "utf8".

- the name of the flag is SVf_UTF8\, but it can be described as the   "UTF-8 flag"\, consider the comment in the source​:

#define SVf_UTF8 0x20000000 /* SvPV is UTF-8 encoded   This is also set on RVs whose overloaded   stringification is UTF-8. This might   only happen as a side effect of SvPV() */

  Using "the UTF8 flag" seems silly to me - name it or describe it\,   not something half-way between.

Here's the chunks and my rationale​:

-=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness

On platforms that implement neither the C99 standard nor the POSIX 2001 -standard\, determining if the current locale is UTF8 or not depends on +standard\, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release.

In this case we're talking about whether the locales support UTF-8 or not. This has nothing to do with perl's internal SVf_UTF8 flag or internal encoding.

I think it belongs.

(D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

This is probably a mistake if perldelta needs to distinguish utf8 vs UTF-8.

(W locale) While in a single-byte locale (I\<i.e.>\, a non-UTF-8 one)\, a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example\, in the ISO 8859-7 (Greek) locale\, the code point 0xC3 represents a Capital Gamma. But so @​@​ -2133\,7 +2133\,7 @​@​ David Mitchell for future work on vtables.

We're talking about whether locales are UTF-8 or not again\, and the paragraph is inconsistent.

I think it belongs.

-Pad names are now always UTF8. The C\ macro always returns +Pad names are now always UTF-8. The C\ macro always returns true. Previously\, this was effectively the case already\, but any support for two different internal representations of pad names has now been removed.

This might need to be "utf8" instead of "UTF8" under the canon according to Encode\, but I think "UTF-8" is better.

-In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF-8 flag turned off

Per my attitude above\, I think this change is correct. Or be "had the C\<SVf_UTF8> flag turned off".

if accessed from a code block within a regular expression\, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L\<[perl #123135]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123135>.

This would need to be "utf8-encoding".

On some systems\, such as VMS\, C\ can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously\, then C\ -would not turn off the UTF8 flag\, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously\, then C\ +would not turn off the UTF-8 flag\, thus corrupting the return value. This would happen with C\<$lexical = crypt ...>.

Under canon the first UTF8 was wrong and the second was correct. I think they should both be "UTF-8".

-C\<\< s///e >> on tainted utf8 strings corrupted C\<\< pos() >>. This bug\, +C\<\< s///e >> on tainted UTF-8 strings corrupted C\<\< pos() >>. This bug\, introduced in 5.20\, is now fixed. L\<[perl #122148]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122148>.

Correct under canon.

-Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L\<[perl #122747]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122747>

This one may have been just plain incorrect. If I understand correctly we load tables that map unicode code points to properties\, not UTF-8 or perl-UTF-8 to properties.

So this should refer to "Loading Unicode tables".

Tony

p5pRT commented 9 years ago

From @khwilliamson

On 05/20/2015 06​:44 PM\, Tony Cook wrote​:

Oops\, forgot to push the revert\, I'll hold off on it for now.

On Wed\, May 20\, 2015 at 07​:16​:39PM +0200\, demerphq wrote​:

Sorry about that. I somehow feel like a party pooper for bringing this up.

I may have overreacted\, sorry.

Here's the way I think about it​:

- unless we need to specifically distinguish between them (as Encode does)\, calling perl's internal encoding UTF-8 is no big deal\, since its intent is to represent Unicode. If we do need to distinguish between them in perldelta then something like "perl's extended UTF-8" is more useful to most readers than "utf8".

+1

- the name of the flag is SVf_UTF8\, but it can be described as the "UTF-8 flag"\, consider the comment in the source​:

#define SVf_UTF8 0x20000000 /* SvPV is UTF-8 encoded This is also set on RVs whose overloaded stringification is UTF-8. This might only happen as a side effect of SvPV() */

Using "the UTF8 flag" seems silly to me - name it or describe it\, not something half-way between.

Here's the chunks and my rationale​:

-=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness

On platforms that implement neither the C99 standard nor the POSIX 2001 -standard\, determining if the current locale is UTF8 or not depends on +standard\, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release.

In this case we're talking about whether the locales support UTF-8 or not. This has nothing to do with perl's internal SVf_UTF8 flag or internal encoding.

I think it belongs.

+1

(D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

This is probably a mistake if perldelta needs to distinguish utf8 vs UTF-8.

I don't think perldelta needs to so distinguish. And in particular\, the above should be "UTF-8"

(W locale) While in a single-byte locale (I\<i.e.>\, a non-UTF-8 one)\, a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example\, in the ISO 8859-7 (Greek) locale\, the code point 0xC3 represents a Capital Gamma. But so @​@​ -2133\,7 +2133\,7 @​@​ David Mitchell for future work on vtables.

We're talking about whether locales are UTF-8 or not again\, and the paragraph is inconsistent.

I think it belongs.

+1

-Pad names are now always UTF8. The C\ macro always returns +Pad names are now always UTF-8. The C\ macro always returns true. Previously\, this was effectively the case already\, but any support for two different internal representations of pad names has now been removed.

This might need to be "utf8" instead of "UTF8" under the canon according to Encode\, but I think "UTF-8" is better.

UTF-8 is better.

-In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF-8 flag turned off

Per my attitude above\, I think this change is correct.

+1

  Or be "had the

C\<SVf_UTF8> flag turned off".

if accessed from a code block within a regular expression\, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L\<[perl #123135]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123135>.

This would need to be "utf8-encoding".

I hate the sentence anyway. It doesn't make intuitive sense that turning off a flag is the same thing as 'encoding'. To me 'encoding' and 'decoding' have arbitrary non-intuitive meanings which I always have to look up. It's better to not use the terms\, but say something that makes sense to most of the readers who I don't believe have the definitions ingrained.

On some systems\, such as VMS\, C\ can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously\, then C\ -would not turn off the UTF8 flag\, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously\, then C\ +would not turn off the UTF-8 flag\, thus corrupting the return value. This would happen with C\<$lexical = crypt ...>.

Under canon the first UTF8 was wrong and the second was correct. I think they should both be "UTF-8".

+1

-C\<\< s///e >> on tainted utf8 strings corrupted C\<\< pos() >>. This bug\, +C\<\< s///e >> on tainted UTF-8 strings corrupted C\<\< pos() >>. This bug\, introduced in 5.20\, is now fixed. L\<[perl #122148]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122148>.

Correct under canon.

I prefer UTF-8.

-Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L\<[perl #122747]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122747>

This one may have been just plain incorrect. If I understand correctly we load tables that map unicode code points to properties\, not UTF-8 or perl-UTF-8 to properties.

So this should refer to "Loading Unicode tables".

Yes

The bottom line is I think we should say UTF-8 in almost every circumstance. The whole Encode thing was a big mistake that should be corrected in 5.24. We now know the perils of not checking input UTF-8 for well-formedness\, and at the time those decisions were made\, those perils were not understood. To put it in terms currently in the news\, we should issue a safety recall on the Encode API in this regard.

p5pRT commented 9 years ago

From @demerphq

On 21 May 2015 at 04​:01\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

On 05/20/2015 06​:44 PM\, Tony Cook wrote​:

Oops\, forgot to push the revert\, I'll hold off on it for now.

On Wed\, May 20\, 2015 at 07​:16​:39PM +0200\, demerphq wrote​:

Sorry about that. I somehow feel like a party pooper for bringing this up.

I may have overreacted\, sorry.

Here's the way I think about it​:

- unless we need to specifically distinguish between them (as Encode does)\, calling perl's internal encoding UTF-8 is no big deal\, since its intent is to represent Unicode. If we do need to distinguish between them in perldelta then something like "perl's extended UTF-8" is more useful to most readers than "utf8".

+1

I don't agree really. This is a long held distinction.

- the name of the flag is SVf_UTF8\, but it can be described as the "UTF-8 flag"\, consider the comment in the source​:

#define SVf_UTF8 0x20000000 /* SvPV is UTF-8 encoded This is also set on RVs whose overloaded stringification is UTF-8. This might only happen as a side effect of SvPV() */

Using "the UTF8 flag" seems silly to me - name it or describe it\, not something half-way between.

I would argue the comment is wrong and should be changed to "utf8".

Here's the chunks and my rationale​:

-=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness

On platforms that implement neither the C99 standard nor the POSIX 2001 -standard\, determining if the current locale is UTF8 or not depends on +standard\, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release.

In this case we're talking about whether the locales support UTF-8 or not. This has nothing to do with perl's internal SVf_UTF8 flag or internal encoding.

I think it belongs.

+1

No argument on this one.

(D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

This is probably a mistake if perldelta needs to distinguish utf8 vs UTF-8.

I don't think perldelta needs to so distinguish. And in particular\, the above should be "UTF-8"

I disagree.

(W locale) While in a single-byte locale (I\<i.e.>\, a non-UTF-8 one)\, a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example\, in the ISO 8859-7 (Greek) locale\, the code point 0xC3 represents a Capital Gamma. But so @​@​ -2133\,7 +2133\,7 @​@​ David Mitchell for future work on vtables.

We're talking about whether locales are UTF-8 or not again\, and the paragraph is inconsistent.

I think it belongs.

+1

I have no objection to this.

-Pad names are now always UTF8. The C\ macro always returns +Pad names are now always UTF-8. The C\ macro always returns true. Previously\, this was effectively the case already\, but any support for two different internal representations of pad names has now been removed.

This might need to be "utf8" instead of "UTF8" under the canon according to Encode\, but I think "UTF-8" is better.

UTF-8 is better.

If we enforce that varnames must be valid UTF-8 (and I think we should) then fine. If we don't then not fine.

For the record (Karl I know you know this)\, UTF-8 is both an encoding\, and also a specification of which codepoints are legal. Not all utf8 sequences are valid UTF-8. I think the distinction is important.

-In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF-8 flag turned off

Per my attitude above\, I think this change is correct.

+1

Disagree. It should be "utf8" unless $^N is guaranteed to contain valid UTF-8.

Or be "had the

C\<SVf_UTF8> flag turned off".

if accessed from a code block within a regular expression\, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L\<[perl #123135]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123135>.

This would need to be "utf8-encoding".

I hate the sentence anyway. It doesn't make intuitive sense that turning off a flag is the same thing as 'encoding'. To me 'encoding' and 'decoding' have arbitrary non-intuitive meanings which I always have to look up. It's better to not use the terms\, but say something that makes sense to most of the readers who I don't believe have the definitions ingrained.

I dont mind using a more descriptive sentence. I do mind conflating UTF-8 and utf8. I can just see someone saying "why does perl let me put surrogate pair code points in a UTF-8 string?".

On some systems\, such as VMS\, C\ can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously\, then C\ -would not turn off the UTF8 flag\, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously\, then C\ +would not turn off the UTF-8 flag\, thus corrupting the return value. This would happen with C\<$lexical = crypt ...>.

Under canon the first UTF8 was wrong and the second was correct. I think they should both be "UTF-8".

+1

Disagree.

-C\<\< s///e >> on tainted utf8 strings corrupted C\<\< pos() >>. This bug\, +C\<\< s///e >> on tainted UTF-8 strings corrupted C\<\< pos() >>. This bug\, introduced in 5.20\, is now fixed. L\<[perl #122148]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122148>.

Correct under canon.

I prefer UTF-8.

Disagree.

-Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L\<[perl #122747]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122747>

This one may have been just plain incorrect. If I understand correctly we load tables that map unicode code points to properties\, not UTF-8 or perl-UTF-8 to properties.

So this should refer to "Loading Unicode tables".

Yes

The bottom line is I think we should say UTF-8 in almost every circumstance. The whole Encode thing was a big mistake that should be corrected in 5.24. We now know the perils of not checking input UTF-8 for well-formedness\,

When you say that do you mean that the sequences are wellformed\, or that the codepoints that they map to are properly validated?

I agree about sequence well formedness\, i dont agree about validation. I consider the following to be a perfectly valid program​:

perl -wle'my $s=chr(0x10000);'

However $s will not contain UTF-8\, but instead contain utf8.

and at the time those decisions were made\, those perils were not understood. To put it in terms currently in the news\, we should issue a safety recall on the Encode API in this regard.

I think this is really going too far\, and goes against *years* of practice in the Perl community. This is the age old argument about whether strings are arrays of Unicode Codepoints\, or are they packed arrays of integers which happen to us the same encoding rules as that of UTF-8. I don't think we will ever settle that argument. And I don't think you can throw away a decade or more of this distinction just like that.

In short\, as long as we allow UTF-8 forbidden codepoints (eg\, surrogate pairs and codepoints higher than Unicode allows) in our strings then I don't think we should call it UTF-8.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @khwilliamson

Top posting to cut to the chase​:

The use of capitalization and presence or absence of a dash to indicate whether we accept malformed utf8 or not was wrong. Subtle distinctions\, especially ones like these that can be easily overlooked\, shouldn't have such severe consequences. The same argument applies as to whether we accept wellformed utf8 that is in one of the 3 problematic Unicode classes (surrogates\, non-characters\, and above-Unicode code points). That API should be fixed before more damage is done.

I think we should apply "UTF-8" to everything\, and forget about the distinctions. I wouldn't object to uniformly getting rid of the dash. I don't believe we would get sued for doing these things.

The fact that these different spellings have shown up in our documentation is proof that even perl porters don't pay much attention to the distinctions\, so the average programmer is not going to notice at all.

I have no doubt that if Unicode ran out of code points\, that they would simply increase the number available\, with all previous protestations to the contrary becoming null and void. And there are discussions in the Unicode mailing list about doing this that come up from time to time. But that's not going to happen anytime soon. They've assigned about a quarter of the 2**21 so far allocated\, in more than 20 years. At that rate\, it would be more than 60 more years before they would fill up\, even ignoring the fact that the rate of assignment has been decreasing.   There could always be some new technology that would gobble up code points​: emoji might end up doing that\, but Unicode is hoping to get out of the new emoji business\, and there are prospects of new technology allowing this to happen.

The bottom line IMO is that saying utf8 to mean one thing and UTF-8 to mean a more restricted thing is outrageously wrong\, although try as I might\, I can't quite blame World War I on this decision ;)

On 05/20/2015 11​:39 PM\, demerphq wrote​:

On 21 May 2015 at 04​:01\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

On 05/20/2015 06​:44 PM\, Tony Cook wrote​:

Oops\, forgot to push the revert\, I'll hold off on it for now.

On Wed\, May 20\, 2015 at 07​:16​:39PM +0200\, demerphq wrote​:

Sorry about that. I somehow feel like a party pooper for bringing this up.

I may have overreacted\, sorry.

Here's the way I think about it​:

- unless we need to specifically distinguish between them (as Encode does)\, calling perl's internal encoding UTF-8 is no big deal\, since its intent is to represent Unicode. If we do need to distinguish between them in perldelta then something like "perl's extended UTF-8" is more useful to most readers than "utf8".

+1

I don't agree really. This is a long held distinction.

- the name of the flag is SVf_UTF8\, but it can be described as the "UTF-8 flag"\, consider the comment in the source​:

#define SVf_UTF8 0x20000000 /* SvPV is UTF-8 encoded This is also set on RVs whose overloaded stringification is UTF-8. This might only happen as a side effect of SvPV() */

Using "the UTF8 flag" seems silly to me \- name it or describe it\,
not something half\-way between\.

I would argue the comment is wrong and should be changed to "utf8".

Here's the chunks and my rationale​:

-=head2 Better heuristics on older platforms for determining locale UTF8ness +=head2 Better heuristics on older platforms for determining locale UTF-8ness

On platforms that implement neither the C99 standard nor the POSIX 2001 -standard\, determining if the current locale is UTF8 or not depends on +standard\, determining if the current locale is UTF-8 or not depends on heuristics. These are improved in this release.

In this case we're talking about whether the locales support UTF-8 or not. This has nothing to do with perl's internal SVf_UTF8 flag or internal encoding.

I think it belongs.

+1

No argument on this one.

(D deprecated) The C\<\< /\C/ >> character class was deprecated in v5.20\, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character\, breaks encapsulation\, and can corrupt utf8 +multi-byte character\, breaks encapsulation\, and can corrupt UTF-8 strings.

This is probably a mistake if perldelta needs to distinguish utf8 vs UTF-8.

I don't think perldelta needs to so distinguish. And in particular\, the above should be "UTF-8"

I disagree.

(W locale) While in a single-byte locale (I\<i.e.>\, a non-UTF-8 one)\, a multi-byte character was encountered. Perl considers this -character to be the specified Unicode code point. Combining non-UTF8 +character to be the specified Unicode code point. Combining non-UTF-8 locales and Unicode is dangerous. Almost certainly some characters will have two different representations. For example\, in the ISO 8859-7 (Greek) locale\, the code point 0xC3 represents a Capital Gamma. But so @​@​ -2133\,7 +2133\,7 @​@​ David Mitchell for future work on vtables.

We're talking about whether locales are UTF-8 or not again\, and the paragraph is inconsistent.

I think it belongs.

+1

I have no objection to this.

-Pad names are now always UTF8. The C\ macro always returns +Pad names are now always UTF-8. The C\ macro always returns true. Previously\, this was effectively the case already\, but any support for two different internal representations of pad names has now been removed.

This might need to be "utf8" instead of "UTF8" under the canon according to Encode\, but I think "UTF-8" is better.

UTF-8 is better.

If we enforce that varnames must be valid UTF-8 (and I think we should) then fine. If we don't then not fine.

For the record (Karl I know you know this)\, UTF-8 is both an encoding\, and also a specification of which codepoints are legal. Not all utf8 sequences are valid UTF-8. I think the distinction is important.

-In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF8 flag turned off +In Perl 5.20.0\, C\<$^N> accidentally had the internal UTF-8 flag turned off

Per my attitude above\, I think this change is correct.

+1

Disagree. It should be "utf8" unless $^N is guaranteed to contain valid UTF-8.

Or be "had the

C\<SVf_UTF8> flag turned off".

if accessed from a code block within a regular expression\, effectively -UTF8-encoding the value. This has been fixed. +UTF-8-encoding the value. This has been fixed. L\<[perl #123135]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123135>.

This would need to be "utf8-encoding".

I hate the sentence anyway. It doesn't make intuitive sense that turning off a flag is the same thing as 'encoding'. To me 'encoding' and 'decoding' have arbitrary non-intuitive meanings which I always have to look up. It's better to not use the terms\, but say something that makes sense to most of the readers who I don't believe have the definitions ingrained.

I dont mind using a more descriptive sentence. I do mind conflating UTF-8 and utf8. I can just see someone saying "why does perl let me put surrogate pair code points in a UTF-8 string?".

On some systems\, such as VMS\, C\ can return a non-ASCII string. If a -scalar assigned to had contained a UTF8 string previously\, then C\ -would not turn off the UTF8 flag\, thus corrupting the return value. This +scalar assigned to had contained a UTF-8 string previously\, then C\ +would not turn off the UTF-8 flag\, thus corrupting the return value. This would happen with C\<$lexical = crypt ...>.

Under canon the first UTF8 was wrong and the second was correct. I think they should both be "UTF-8".

+1

Disagree.

-C\<\< s///e >> on tainted utf8 strings corrupted C\<\< pos() >>. This bug\, +C\<\< s///e >> on tainted UTF-8 strings corrupted C\<\< pos() >>. This bug\, introduced in 5.20\, is now fixed. L\<[perl #122148]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122148>.

Correct under canon.

I prefer UTF-8.

Disagree.

-Loading UTF8 tables during a regular expression match could cause assertion +Loading UTF-8 tables during a regular expression match could cause assertion failures under debugging builds if the previous match used the very same regular expression. L\<[perl #122747]|https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122747>

This one may have been just plain incorrect. If I understand correctly we load tables that map unicode code points to properties\, not UTF-8 or perl-UTF-8 to properties.

So this should refer to "Loading Unicode tables".

Yes

The bottom line is I think we should say UTF-8 in almost every circumstance. The whole Encode thing was a big mistake that should be corrected in 5.24. We now know the perils of not checking input UTF-8 for well-formedness\,

When you say that do you mean that the sequences are wellformed\, or that the codepoints that they map to are properly validated?

I agree about sequence well formedness\, i dont agree about validation. I consider the following to be a perfectly valid program​:

perl -wle'my $s=chr(0x10000);'

However $s will not contain UTF-8\, but instead contain utf8.

and at the time those decisions were made\, those perils were not understood. To put it in terms currently in the news\, we should issue a safety recall on the Encode API in this regard.

I think this is really going too far\, and goes against *years* of practice in the Perl community. This is the age old argument about whether strings are arrays of Unicode Codepoints\, or are they packed arrays of integers which happen to us the same encoding rules as that of UTF-8. I don't think we will ever settle that argument. And I don't think you can throw away a decade or more of this distinction just like that.

In short\, as long as we allow UTF-8 forbidden codepoints (eg\, surrogate pairs and codepoints higher than Unicode allows) in our strings then I don't think we should call it UTF-8.

cheers\, Yves

p5pRT commented 9 years ago

From @tonycoz

On Wed May 20 22​:39​:33 2015\, demerphq wrote​:

I think this is really going too far\, and goes against *years* of practice in the Perl community. This is the age old argument about whether strings are arrays of Unicode Codepoints\, or are they packed arrays of integers which happen to us the same encoding rules as that of UTF-8. I don't think we will ever settle that argument. And I don't think you can throw away a decade or more of this distinction just like that.

In short\, as long as we allow UTF-8 forbidden codepoints (eg\, surrogate pairs and codepoints higher than Unicode allows) in our strings then I don't think we should call it UTF-8.

How about the attached?

These only touch the sentences modified by the original patch that you didn't agree with and makes no attempt to correct other uses of "UTF-8".

Tony no attempt to update other

p5pRT commented 9 years ago

From @tonycoz

0001-perhaps-a-sentence-khw-won-t-hate.patch ```diff From 22d7e886408fce9067b935243253c1edfaa9c614 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 25 May 2015 14:52:28 +1000 Subject: perhaps a sentence khw won't hate --- pod/perldelta.pod | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pod/perldelta.pod b/pod/perldelta.pod index c61596f..2549310 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -2657,9 +2657,9 @@ L<[perl #108276]|https://rt.perl.org/Ticket/Display.html?id=108276>. =item * -In Perl 5.20.0, C<$^N> accidentally had the internal UTF-8 flag turned off -if accessed from a code block within a regular expression, effectively -UTF-8-encoding the value. This has been fixed. +In Perl 5.20.0, the C flag was not properly propagated to C<$^N> +if accessed from a code block within a regular expression, making the +underlying extended UTF-8 encoding visible to perl code. This has been fixed. L<[perl #123135]|https://rt.perl.org/Ticket/Display.html?id=123135>. =item * -- 1.7.10.4 ```
p5pRT commented 9 years ago

From @tonycoz

0002-be-more-explicit-about-perl-s-UTF-8-being-extended.patch ```diff From a1aa16dd4ce94aa4bfc0ba756a7e3bbaff8039fc Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Mon, 25 May 2015 14:54:13 +1000 Subject: be more explicit about perl's UTF-8 being extended --- pod/perldelta.pod | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 2549310..9d8257a 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -1308,8 +1308,8 @@ L<\C is deprecated in regex|perldiag/"\C is deprecated in regex; marked by <-- H (D deprecated) The C<< /\C/ >> character class was deprecated in v5.20, and now emits a warning. It is intended that it will become an error in v5.24. This character class matches a single byte even if it appears within a -multi-byte character, breaks encapsulation, and can corrupt UTF-8 -strings. +multi-byte character, breaks encapsulation, and can corrupt strings using +perl's extended UTF-8 encoding. =item * @@ -2785,8 +2785,8 @@ contrary to the documentation, Now C always prevents inlining. =item * On some systems, such as VMS, C can return a non-ASCII string. If a -scalar assigned to had contained a UTF-8 string previously, then C -would not turn off the UTF-8 flag, thus corrupting the return value. This +scalar assigned has the C flag previously, then C +would not turn off the flag, thus corrupting the return value. This would happen with S>. =item * @@ -2881,8 +2881,8 @@ mirror character. =item * -C<< s///e >> on tainted UTF-8 strings corrupted C<< pos() >>. This bug, -introduced in 5.20, is now fixed. +C<< s///e >> on tainted C flagged strings corrupted C<< pos() >>. +This bug, introduced in 5.20, is now fixed. L<[perl #122148]|https://rt.perl.org/Ticket/Display.html?id=122148>. =item * -- 1.7.10.4 ```
p5pRT commented 9 years ago

From @rjbs

* Karl Williamson \public@&#8203;khwilliamson\.com [2015-05-21T23​:50​:55]

Top posting to cut to the chase​:

The use of capitalization and presence or absence of a dash to indicate whether we accept malformed utf8 or not was wrong. Subtle distinctions\, especially ones like these that can be easily overlooked\, shouldn't have such severe consequences.

I agree completely.

(Also\, although I do not think anybody is arguing this\, I want to get it said that I'm not proposing we take any direct action against 5.22.0 as a result of this agreement!) :)

I think we should apply "UTF-8" to everything\, and forget about the distinctions. I wouldn't object to uniformly getting rid of the dash. I don't believe we would get sued for doing these things.

Well


Okay\, I think it's fair to say that I basically agree with this\, but I think it's something we have to do correctly. I don't think it's enough to just do a mechanical replacement. (I don't mean to imply that you suggested this.)

Here is a very quick from-the-hip summary of thoughts​:

* we actually need to talk about UTF-8 pretty rarely\, since we normally only   need to talk about "strings" * when we talk about I/O boundaries where we truck in UTF-8\, we should be   explicit in what is allowed or not; "will decode UTF-8\, permitting overlong   sequences and trans-0x10FFFF characters" or other variants on that * when we talk about representation in memory\, we should do the same\, but   having a short name for the form of UTF-8-ish encoding used for all internal   representations seems useful; it just shouldn't be "utf8" * we absolutely must document the difference between UTF-8 and utf8 for the   sake of understanding documentation that still uses this distinction * I would *love* if we could always refer to "scalars with the SvUTF8 flag" as   such\, rather than anything that makes the flag sound like something useful to   the person writing Perl. If I thought we could rename SvUTF8 to SvWIMR or   something\, I would point hungrily at that solution. * probably we will find an impulse to agonize on each doc update; I predict   that in most cases\, it will not be worth it; the right answer will be   obvious\, if we have clearer terms

So\, I don't think this is a zero-effort change. I think it is worth it\, though\, because it is confusing to newcomers for no reason. This isn't a situation where "if you don't surprise the beginner sometimes\, you'll end up surprising the experts instead." This is just a shibboleth.

In short\, I think we can find clearer language\, and I think that in many cases\, we worry about distinctions where they are irrelevant\, and can probably settle on UTF-8 in many more cases than we do.

-- rjbs

p5pRT commented 9 years ago

From @tonycoz

On Tue\, May 26\, 2015 at 07​:53​:53AM -0400\, Ricardo Signes wrote​:

In short\, I think we can find clearer language\, and I think that in many cases\, we worry about distinctions where they are irrelevant\, and can probably settle on UTF-8 in many more cases than we do.

I posted a patch to the ticket hopefully doing that that\, but it didn't seem to make it through to the list​:

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=125221#txn-1350100

Tony

p5pRT commented 9 years ago

From @tonycoz

On Tue May 26 04​:54​:36 2015\, perl.p5p@​rjbs.manxome.org wrote​:

* Karl Williamson \public@&#8203;khwilliamson\.com [2015-05-21T23​:50​:55]

Top posting to cut to the chase​:

The use of capitalization and presence or absence of a dash to indicate whether we accept malformed utf8 or not was wrong. Subtle distinctions\, especially ones like these that can be easily overlooked\, shouldn't have such severe consequences.

I agree completely.

(Also\, although I do not think anybody is arguing this\, I want to get it said that I'm not proposing we take any direct action against 5.22.0 as a result of this agreement!) :)

I think we should apply "UTF-8" to everything\, and forget about the distinctions. I wouldn't object to uniformly getting rid of the dash. I don't believe we would get sued for doing these things.

Well


Okay\, I think it's fair to say that I basically agree with this\, but I think it's something we have to do correctly. I don't think it's enough to just do a mechanical replacement. (I don't mean to imply that you suggested this.)

Here is a very quick from-the-hip summary of thoughts​:

* we actually need to talk about UTF-8 pretty rarely\, since we normally only need to talk about "strings" * when we talk about I/O boundaries where we truck in UTF-8\, we should be explicit in what is allowed or not; "will decode UTF-8\, permitting overlong sequences and trans-0x10FFFF characters" or other variants on that * when we talk about representation in memory\, we should do the same\, but having a short name for the form of UTF-8-ish encoding used for all internal representations seems useful; it just shouldn't be "utf8" * we absolutely must document the difference between UTF-8 and utf8 for the sake of understanding documentation that still uses this distinction * I would *love* if we could always refer to "scalars with the SvUTF8 flag" as such\, rather than anything that makes the flag sound like something useful to the person writing Perl. If I thought we could rename SvUTF8 to SvWIMR or something\, I would point hungrily at that solution. * probably we will find an impulse to agonize on each doc update; I predict that in most cases\, it will not be worth it; the right answer will be obvious\, if we have clearer terms

So\, I don't think this is a zero-effort change. I think it is worth it\, though\, because it is confusing to newcomers for no reason. This isn't a situation where "if you don't surprise the beginner sometimes\, you'll end up surprising the experts instead." This is just a shibboleth.

In short\, I think we can find clearer language\, and I think that in many cases\, we worry about distinctions where they are irrelevant\, and can probably settle on UTF-8 in many more cases than we do.

The original purpose of this ticket is complete\, as the patch to the 5.22.0 perldelta has been applied\, and 5.22.0 has been released.

Should a new ticket be opened for discussing the future use of UTF-8/utf8/Unicode?

Tony

p5pRT commented 9 years ago

From @tonycoz

On Mon Jul 20 22​:37​:40 2015\, tonyc wrote​:

The original purpose of this ticket is complete\, as the patch to the 5.22.0 perldelta has been applied\, and 5.22.0 has been released.

So closing this ticket.

Should a new ticket be opened for discussing the future use of UTF- 8/utf8/Unicode?

No response\, no new ticket.

Tony

p5pRT commented 9 years ago

@tonycoz - Status changed from 'open' to 'resolved'