Rename utf8::is_utf8() (and other functions)

p5pRT commented 7 years ago

Migrated from rt.perl.org#131685 (status was 'open')

Searchable as RT131685$

p5pRT commented 7 years ago

From @pali

Hi!

This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code pattern:

use utf8;

my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }

In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.

Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.

Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.

I'm proposing following rename of functions:

utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()

Plus adding backward compatible aliases to make existing code works like before.

As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';

I'm attaching patches which:

* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation

p5pRT commented 7 years ago

From @pali

0001-Add-new-warning-category-experimental-internal.patch

```diff From c7b1fcfd26a2500662a10e345691eda3f3f32039 Mon Sep 17 00:00:00 2001 From: Pali Date: Sat, 1 Jul 2017 17:33:45 +0200 Subject: [PATCH 1/3] Add new warning category experimental::internal This category is for internal perl functions which should not be used in normal perl code, unless dealing with perl internals. --- lib/warnings.pm | 19 +++++++++++++------ regen/warnings.pl | 2 ++ warnings.h | 4 ++++ 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/lib/warnings.pm b/lib/warnings.pm index 2ae1bb4..7b27e4a 100644 --- a/lib/warnings.pm +++ b/lib/warnings.pm @@ -96,10 +96,13 @@ our %Offsets = ( # Warnings Categories added in Perl 5.025 'experimental::declared_refs' => 132, + + # Warnings Categories added in Perl 5.028 + 'experimental::internal' => 134, ); our %Bits = ( - 'all' => "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x15", # [0..66] + 'all' => "\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55\x55", # [0..67] 'ambiguous' => "\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29] 'bareword' => "\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30] 'closed' => "\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6] @@ -109,10 +112,11 @@ our %Bits = ( 'digit' => "\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31] 'exec' => "\x00\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7] 'exiting' => "\x40\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3] - 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x10", # [51..56,58..62,66] + 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x55\x51\x15\x50", # [51..56,58..62,66,67] 'experimental::bitwise' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00", # [58] 'experimental::const_attr' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00", # [59] 'experimental::declared_refs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10", # [66] + 'experimental::internal' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40", # [67] 'experimental::lexical_subs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00", # [52] 'experimental::postderef' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00", # [55] 'experimental::re_strict' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00", # [60] @@ -169,7 +173,7 @@ our %Bits = ( ); our %DeadBits = ( - 'all' => "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\x2a", # [0..66] + 'all' => "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa", # [0..67] 'ambiguous' => "\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [29] 'bareword' => "\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [30] 'closed' => "\x00\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [6] @@ -179,10 +183,11 @@ our %DeadBits = ( 'digit' => "\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [31] 'exec' => "\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [7] 'exiting' => "\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", # [3] - 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\x20", # [51..56,58..62,66] + 'experimental' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xaa\xa2\x2a\xa0", # [51..56,58..62,66,67] 'experimental::bitwise' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20\x00\x00", # [58] 'experimental::const_attr' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00", # [59] 'experimental::declared_refs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x20", # [66] + 'experimental::internal' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80", # [67] 'experimental::lexical_subs' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00", # [52] 'experimental::postderef' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00", # [55] 'experimental::re_strict' => "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00", # [60] @@ -240,8 +245,8 @@ our %DeadBits = ( # These are used by various things, including our own tests our $NONE = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"; -our $DEFAULT = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x10", # [2,4,22,23,25,52..56,58..63,66] -our $LAST_BIT = 134 ; +our $DEFAULT = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50", # [2,4,22,23,25,52..56,58..63,66,67] +our $LAST_BIT = 136 ; our $BYTES = 17 ; our $All = "" ; vec($All, $Offsets{'all'}, 2) = 3 ; @@ -732,6 +737,8 @@ The current hierarchy is: | | | +- experimental::declared_refs | | + | +- experimental::internal + | | | +- experimental::lexical_subs | | | +- experimental::postderef diff --git a/regen/warnings.pl b/regen/warnings.pl index 5721c17..36ce14b 100644 --- a/regen/warnings.pl +++ b/regen/warnings.pl @@ -107,6 +107,8 @@ my $tree = { [ 5.021, DEFAULT_ON ], 'experimental::declared_refs' => [ 5.025, DEFAULT_ON ], + 'experimental::internal' => + [ 5.028, DEFAULT_ON ], }], 'missing' => [ 5.021, DEFAULT_OFF], diff --git a/warnings.h b/warnings.h index 0166837..72e27a2 100644 --- a/warnings.h +++ b/warnings.h @@ -115,6 +115,10 @@ #define WARN_EXPERIMENTAL__DECLARED_REFS 66 +/* Warnings Categories added in Perl 5.028 */ + +#define WARN_EXPERIMENTAL__INTERNAL 67 + #define WARNsize 17 #define WARN_ALLstring "\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125\125" #define WARN_NONEstring "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0" -- 1.7.9.5 ```

p5pRT commented 7 years ago

From @pali

0002-Mark-functions-utf8-is_utf8-utf8-upgrade-utf8-downgr.patch

```diff From d763a8a4b85b53ebc5b05ba1b0a64daf9df6c2e2 Mon Sep 17 00:00:00 2001 From: Pali Date: Sat, 1 Jul 2017 17:41:15 +0200 Subject: [PATCH 2/3] Mark functions utf8::is_utf8(), utf8::upgrade(), utf8::downgrade() as Internal Move all those functions into Internals namespace, throw new warning experimental::internal warning when used and provide backward compatible deprecated aliases (for make existing code still work). In most cases all those functions are incorrectly used due to poor names and not proper documentation. Those functions are internal and should not be used unless debugging perl or dealing with broken XS modules. --- universal.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 55 insertions(+), 10 deletions(-) diff --git a/universal.c b/universal.c index be39310..20f1d53 100644 --- a/universal.c +++ b/universal.c @@ -422,10 +422,24 @@ XS(XS_UNIVERSAL_DOES) } } -XS(XS_utf8_is_utf8); /* prototype to pass -Wmissing-prototypes */ -XS(XS_utf8_is_utf8) +XS(XS_Internals_uses_string_wide_storage); /* prototype to pass -Wmissing-prototypes */ +XS(XS_Internals_uses_string_wide_storage) { - dXSARGS; + dXSARGS; + const GV *const gv = CvGV(cv); + const HV *const stash = gv ? GvSTASH(gv) : NULL; + const char *const hvname = stash ? HvNAME(stash) : NULL; + + if (hvname && strcmp(hvname, "utf8") == 0) { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_DEPRECATED), + "utf8::is_utf8() is internal and deprecated function, look into perldoc utf8"); + } else { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_INTERNAL), + "Internals::uses_string_wide_storage() is experimental internal perl function"); + } + if (items != 1) croak_xs_usage(cv, "sv"); else { @@ -485,10 +499,24 @@ XS(XS_utf8_decode) XSRETURN(1); } -XS(XS_utf8_upgrade); /* prototype to pass -Wmissing-prototypes */ -XS(XS_utf8_upgrade) +XS(XS_Internals_upgrade_string_to_wide_storage); /* prototype to pass -Wmissing-prototypes */ +XS(XS_Internals_upgrade_string_to_wide_storage) { dXSARGS; + const GV *const gv = CvGV(cv); + const HV *const stash = gv ? GvSTASH(gv) : NULL; + const char *const hvname = stash ? HvNAME(stash) : NULL; + + if (hvname && strcmp(hvname, "utf8") == 0) { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_DEPRECATED), + "utf8::upgrade() is internal and deprecated function, look into perldoc utf8"); + } else { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_INTERNAL), + "Internals::upgrade_string_to_wide_storage() is experimental internal perl function"); + } + if (items != 1) croak_xs_usage(cv, "sv"); else { @@ -502,10 +530,24 @@ XS(XS_utf8_upgrade) XSRETURN(1); } -XS(XS_utf8_downgrade); /* prototype to pass -Wmissing-prototypes */ -XS(XS_utf8_downgrade) +XS(XS_Internals_downgrade_string_from_wide_storage); /* prototype to pass -Wmissing-prototypes */ +XS(XS_Internals_downgrade_string_from_wide_storage) { dXSARGS; + const GV *const gv = CvGV(cv); + const HV *const stash = gv ? GvSTASH(gv) : NULL; + const char *const hvname = stash ? HvNAME(stash) : NULL; + + if (hvname && strcmp(hvname, "utf8") == 0) { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_DEPRECATED), + "utf8::downgrade() is internal and deprecated function, look into perldoc utf8"); + } else { + Perl_ck_warner_d(aTHX_ + packWARN(WARN_EXPERIMENTAL__INTERNAL), + "Internals::downgrade_string_from_wide_storage() is experimental internal function"); + } + if (items < 1 || items > 2) croak_xs_usage(cv, "sv, failok=0"); else { @@ -1000,14 +1042,17 @@ static const struct xsub_details details[] = { #define VXS_XSUB_DETAILS #include "vxs.inc" #undef VXS_XSUB_DETAILS - {"utf8::is_utf8", XS_utf8_is_utf8, NULL}, + {"utf8::is_utf8", XS_Internals_uses_string_wide_storage, NULL}, {"utf8::valid", XS_utf8_valid, NULL}, {"utf8::encode", XS_utf8_encode, NULL}, {"utf8::decode", XS_utf8_decode, NULL}, - {"utf8::upgrade", XS_utf8_upgrade, NULL}, - {"utf8::downgrade", XS_utf8_downgrade, NULL}, + {"utf8::upgrade", XS_Internals_upgrade_string_to_wide_storage, NULL}, + {"utf8::downgrade", XS_Internals_downgrade_string_from_wide_storage, NULL}, {"utf8::native_to_unicode", XS_utf8_native_to_unicode, NULL}, {"utf8::unicode_to_native", XS_utf8_unicode_to_native, NULL}, + {"Internals::uses_string_wide_storage", XS_Internals_uses_string_wide_storage, NULL}, + {"Internals::upgrade_string_to_wide_storage", XS_Internals_upgrade_string_to_wide_storage, NULL}, + {"Internals::downgrade_string_from_wide_storage", XS_Internals_downgrade_string_from_wide_storage, NULL}, {"Internals::SvREADONLY", XS_Internals_SvREADONLY, "\\[$%@];$"}, {"Internals::SvREFCNT", XS_Internals_SvREFCNT, "\\[$%@];$"}, {"Internals::hv_clear_placeholders", XS_Internals_hv_clear_placehold, "\\%"}, -- 1.7.9.5 ```

p5pRT commented 7 years ago

From @pali

0003-Update-documentation-in-perldoc-utf8.patch

```diff From e5b0bbd18075ea178708f5da32beee3570751f0e Mon Sep 17 00:00:00 2001 From: Pali Date: Sat, 1 Jul 2017 17:46:05 +0200 Subject: [PATCH 3/3] Update documentation in perldoc utf8 Add information about new internal functions and update documentation for wide string storage functions. --- lib/utf8.pm | 93 ++++++++++++++++++++++++++++++++++------------------------- 1 file changed, 54 insertions(+), 39 deletions(-) diff --git a/lib/utf8.pm b/lib/utf8.pm index 324cb87..84a96ae 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -31,14 +31,8 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code use utf8; no utf8; - # Convert the internal representation of a Perl scalar to/from UTF-8. - - $num_octets = utf8::upgrade($string); - $success = utf8::downgrade($string[, $fail_ok]); - # Change each character of a Perl scalar to/from a series of # characters that represent the UTF-8 bytes of each original character. - utf8::encode($string); # "\x{100}" becomes "\xc4\x80" utf8::decode($string); # "\xc4\x80" becomes "\x{100}" @@ -51,7 +45,6 @@ utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code # platforms; 193 on # EBCDIC - $flag = utf8::is_utf8($string); # since Perl 5.8.1 $flag = utf8::valid($string); =head1 DESCRIPTION @@ -105,39 +98,46 @@ you should not say that unless you really want to have UTF-8 source code. =item * C<$num_octets = utf8::upgrade($string)> -(Since Perl v5.8.0) -Converts in-place the internal representation of the string from an octet -sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The -logical character sequence itself is unchanged. If I<$string> is already -stored as UTF-8, then this is a no-op. Returns the -number of octets necessary to represent the string as UTF-8. Can be -used to make sure that the UTF-8 flag is on, so that C<\w> or C -work as Unicode on strings containing non-ASCII characters whose code points -are below 256. +[INTERNAL] (Since Perl v5.8.0) Deprecated compatibility-supporting alias of +C -B; -use L instead. +=item * C<$num_octets = Internals::upgrade_string_to_wide_storage($string)> -=item * C<$success = utf8::downgrade($string[, $fail_ok])> +[INTERNAL] (Since Perl v5.28.0) -(Since Perl v5.8.0) -Converts in-place the internal representation of the string from -UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 -or EBCDIC). The logical character sequence itself is unchanged. If -I<$string> is already stored as native 8 bit, then this is a no-op. Can -be used to -make sure that the UTF-8 flag is off, e.g. when you want to make sure -that the substr() or length() function works with the usually faster -byte algorithm. - -Fails if the original UTF-8 sequence cannot be represented in the -native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is -true, returns false. +Converts in-place the internal representation of the string to wide storage +(which can store characters above U+0000FF). The logical character sequence +itself is unchanged. If I<$string> is already stored in wide storage then +this is a no-op. Returns the number of bytes necessary to represent the +string in wide storage. + +Internal string storage is invisible for pure perl code and perl itself call +this function automatically when needed. Therefore there is no reason to call +this function unless you are debugging internal perl C or XS code. + +=item * C<$num_octets = Internals::downgrade_string_from_wide_storage($string[, $fail_ok])> + +[INTERNAL] (Since Perl v5.28.0) + +Converts in-place the internal representation of the string from wide storage +(which can store characters above U+0000FF) to small non-wide 8 bit storage +(which can store only 8 bit characters). The logical character sequence +itself is unchanged. If I<$string> is already stored in non-wide 8 bit storage, +then this is a no-op. + +Fails if the original I<$string> cannot be represented in the native 8 bit +encoding. On failure dies or, if the value of I<$fail_ok> is true, returns false. Returns true on success. -B; -use L instead. +Internal string storage is invisible for pure perl code and perl itself call +this function automatically when needed. Therefore there is no reason to call +this function unless you are debugging internal perl C or XS code. + +=item * C<$success = utf8::downgrade($string[, $fail_ok])> + +[INTERNAL] (Since Perl v5.8.0) Deprecated compatibility-supporting alias of +C =item * C @@ -207,17 +207,32 @@ platforms, so there is no performance hit in using it there. =item * C<$flag = utf8::is_utf8($string)> -(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in -UTF-8. Functionally the same as C. +[INTERNAL] (Since Perl v5.8.1) Deprecated compatibility-supporting (but poorly +named) alias of C. It does B check +if string is encoded in UTF-8. + +=item * C<$flag = Internals::uses_string_wide_storage($string)> + +[INTERNAL] (Since Perl v5.28.0) + +Test whether C<$string>'s internal representation storage is wide (which can +store characters above U+0000FF). Note that C<$string> can, but does not have +to contain wide characters. It bears no impact on whether that string is +actually utf8 or not. + +Internal string storage is invisible for pure perl code and perl itself call +change storage automatically when needed. This function should not be used +unless you are debugging internal perl C or XS code. =item * C<$flag = utf8::valid($string)> -[INTERNAL] Test whether I<$string> is in a consistent state regarding +[INTERNAL] (Since Perl v5.8.0) + +Test whether I<$string> is in a consistent state regarding UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag on B if I<$string> is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's test suite to check -that operations have left strings in a consistent state. You most -probably want to use C instead. +that operations have left strings in a consistent state. =back -- 1.7.9.5 ```

p5pRT commented 7 years ago

From @tux

On Sat\, 01 Jul 2017 09:03:18 -0700\, (via RT) \perlbug\-followup@perl\.org wrote:

# New Ticket Created by
# Please include the string: [perl #131685] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131685 >

Hi!

This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code pattern:

use utf8;

my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }

In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.

Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.

Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.

I'm proposing following rename of functions:

utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()

I am still objecting\, as this will also break code that uses those functions as intended and correctly.

As these are not XS\, Devel::PPPort won't help (assuming authors use D::P on XS modules to guarantee backward compat)

I'd loath to change/fix every occurrence of code that uses any of these three correctly\, as that code is brittle to start with and probably hard to fix when broken.

Plus adding backward compatible aliases to make existing code works like before.

Then why add new functions in the first place?

As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';

No\, please. Most correct uses will be in dark distant corners\, hidden in modules you don't want to touch anyway.

I'm attaching patches which:

* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 7 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @Leont

On Sat\, Jul 1\, 2017 at 6:03 PM\, via RT \perlbug\-followup@perl\.org wrote:

Hi!

This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html

Problem is that in more perl modules is used this incorrect code pattern:

use utf8;

my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }

In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.

Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.

Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.

I'm proposing following rename of functions:

utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()

Plus adding backward compatible aliases to make existing code works like before.

As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';

I'm attaching patches which:

* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation

I don't see how this is an option. I'll grant you that something like this would have been a better option back then but you're 15 years too late. "This would have been better" is no excuse to break a decade and a half of software.

Leon

p5pRT commented 7 years ago

From @pali

On Saturday 01 July 2017 19:13:30 you wrote:

to break a decade and a half of software.

Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names. Usage of old functions is just removed from documentation.

p5pRT commented 7 years ago

From @pali

On Saturday 01 July 2017 18:54:24 you wrote:

Plus adding backward compatible aliases to make existing code works like before.

Then why add new functions in the first place?

From discussion it was clear that current name utf8::is_utf8() is poor and is reason why it is incorrectly used.

p5pRT commented 7 years ago

From @Leont

On Sat\, Jul 1\, 2017 at 7:45 PM\, \pali@cpan\.org wrote:

On Saturday 01 July 2017 19:13:30 you wrote:

to break a decade and a half of software.

Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names. Usage of old functions is just removed from documentation.

Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.

Leon

p5pRT commented 7 years ago

From @xsawyerx

On 07/01/2017 01:52 PM\, Leon Timmermans wrote:

On Sat\, Jul 1\, 2017 at 7:45 PM\, \<pali@cpan.org \mailto:pali@cpan\.org> wrote:
On Saturday 01 July 2017 19&#8203;:13&#8203;:30 you wrote&#8203;:
> to break a decade and a half of software\.

Hm? What you mean with to break? Existing functions would still work\,
just there are also new functions under new names\. Usage of old
functions is just removed from documentation\.
Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.

You could support it with Devel::PPPort. It's a simple addition.

However\, the problem remains that if someone were to use these new functions without PPPort\, their code would not work on older versions. I can't see a way around that.

p5pRT commented 7 years ago

From @tonycoz

On Mon\, Jul 03\, 2017 at 01:03:37PM -0400\, Sawyer X wrote:

On 07/01/2017 01:52 PM\, Leon Timmermans wrote:
On Sat\, Jul 1\, 2017 at 7:45 PM\, \<pali@cpan.org \mailto:pali@cpan\.org> wrote:
On Saturday 01 July 2017 19&#8203;:13&#8203;:30 you wrote&#8203;:
> to break a decade and a half of software\.

Hm? What you mean with to break? Existing functions would still work\,
just there are also new functions under new names\. Usage of old
functions is just removed from documentation\.
Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.
You could support it with Devel::PPPort. It's a simple addition.

However\, the problem remains that if someone were to use these new functions without PPPort\, their code would not work on older versions. I can't see a way around that.

These are perl functions (as documented in utf8.pm)\, not C functions\, Devel::PPPort does nothing for us.

The patch retains the old names\, so that isn't an issue.

But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.

As a side note\, the original thread refers to:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.

Tony

p5pRT commented 7 years ago

From @grinnz

On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@develop\-help\.com wrote:

As a side note\, the original thread refers to:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.

-Dan

p5pRT commented 7 years ago

From @tonycoz

On Mon\, Jul 03\, 2017 at 09:35:06PM -0400\, Dan Book wrote:

On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@develop\-help\.com wrote:

As a side note\, the original thread refers to:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.

If the caller creates a file using the name they pass in\, encoding the name (which might not be utf-8 marked) may make the later -e or -l check fail.

Perl functions such as open and stat currently ignore the the UTF-8 flag\, which makes this pretty messy.

The code in Archive::Tar seems a reasonable workaround to me\, I don't think the author had much choice.

Tony

p5pRT commented 7 years ago

From @pali

On Monday 03 July 2017 21:35:06 Dan Book wrote:

On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@develop\-help\.com wrote:

As a side note\, the original thread refers to:

https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501

which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.

Tony

Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.

-Dan

See bug: https://rt.perl.org/Public/Bug/Display.html?id=130831

p5pRT commented 7 years ago

From @pali

On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:

But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.

And for old code can be defined this function easily:

*new_name = *old_name;

Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names

p5pRT commented 7 years ago

From @demerphq

On 4 July 2017 at 09:19\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:

But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.

And for old code can be defined this function easily:

*new_name = *old_name;

Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.

scalar::is_unicode_string() scalar::is_binary_string()

I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\, and the unicode flag has significance beyond the storage format; utf8-on strings get unicode semantics in case insensitive operations.

cheers\, Yves

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @pali

On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:

On 4 July 2017 at 09:19\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:

But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.

And for old code can be defined this function easily:

*new_name = *old_name;

Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.

I proposed Internals\, because that flag is internal for perl and invisible for pure perl code. But if more people are happy with scalar namespace\, I'm fine with it.

scalar::is_unicode_string() scalar::is_binary_string()

But this is wrong! SVf_UTF8 does not tell if scalar string is unicode or binary. It just tell type of internal storage.

Name is_binary_string is misleading in same way as current name is_utf8.

If you say that binary string is one with codes only in range 0x00-0xFF then you can have that binary string also with SVf_UTF8 flag and your function name "is_binary_string" would return false for your binary string. Such name would lead to another problems.

I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\,

Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 extension from ASCII) contains two bytes when encoded in UTF-8 and therefore are wide in UTF-8 too.

and the unicode flag has significance beyond the storage format; utf8-on strings get unicode semantics in case insensitive operations.

cheers\, Yves

p5pRT commented 7 years ago

From @demerphq

On 4 July 2017 at 11:03\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:

On 4 July 2017 at 09:19\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:

But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.

Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.

And for old code can be defined this function easily:

*new_name = *old_name;

Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names

I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.

I proposed Internals\, because that flag is internal for perl and invisible for pure perl code. But if more people are happy with scalar namespace\, I'm fine with it.

scalar::is_unicode_string() scalar::is_binary_string()

But this is wrong! SVf_UTF8 does not tell if scalar string is unicode or binary. It just tell type of internal storage.

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.

You can see the difference in the following:

"ba\x{DF}"=~/ss/i;

"ba\N{U+DF}"=~/ss/i;

The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.

Name is_binary_string is misleading in same way as current name is_utf8.

Erf\, maybe. We need a term for "not-unicode"\, and "binary" is as good as any. I don't mind other proposals.

If you say that binary string is one with codes only in range 0x00-0xFF then you can have that binary string also with SVf_UTF8 flag and your function name "is_binary_string" would return false for your binary string. Such name would lead to another problems.

The SVf_UTF8 flag being off means the string should be treated as ASCII when doing case-insensitive operations\, and as binary for other purposes\, and that the data is encoded as a series of discrete octets. It is not uncommon for people on this list to use the terms unicode and binary for this reason.

I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\,

Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 extension from ASCII) contains two bytes when encoded in UTF-8 and therefore are wide in UTF-8 too.

I spoke imprecisely\, I should have said ASCII\, not latin-1.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @pali

On Tuesday 04 July 2017 11:22:42 demerphq wrote:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.

You can see the difference in the following:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched

The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.

No\, both were matched under Perl 5.24.1.

p5pRT commented 7 years ago

From @demerphq

On 4 July 2017 at 12:04\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 11:22:42 demerphq wrote:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.

You can see the difference in the following:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.

No\, both were matched under Perl 5.24.1.

No\, they did not. If \x{DF} magically started matching 'ss' it would be a *MASSIVE* regression.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @pali

On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:

On 4 July 2017 at 12:04\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 11:22:42 demerphq wrote:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.

You can see the difference in the following:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

Ah\, right. I forgot that -E enables feature unicode_strings which basically means that both examples were equivalent.

Default behavior is a bit unpredicable as it is affected by the infamous Unicode Bug.

my $str1 = "\x{DF}"; my $str2 = "\N{U+DF}"; my $str3 = "\x{100}";

"ba$str1" =~ /ss/i; "ba$str2" =~ /ss/i;

"ba$str1$str3" =~ /ss/i;

To make it predicable either /aa or /u modifiers should be already used... It will prevent problems

"ba$str1" =~ /ss/aai; "ba$str2" =~ /ss/aai; "ba$str1$str3" =~ /ss/aai;

"ba$str1" =~ /ss/ui; "ba$str2" =~ /ss/ui; "ba$str1$str3" =~ /ss/ui;

p5pRT commented 7 years ago

From @demerphq

On 4 July 2017 at 13:14\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:

On 4 July 2017 at 12:04\, \pali@cpan\.org wrote:

On Tuesday 04 July 2017 11:22:42 demerphq wrote:

No. This is a myth. Plain and simply a myth.

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.

You can see the difference in the following:

"ba\x{DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched

"ba\N{U+DF}"=~/ss/i;

$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched

-E is not -e.

-E is enabling a pragma which changes the default behavior.

However it is *PRAGMA*. It is NOT the normal behavior of Perl.

Ah\, right. I forgot that -E enables feature unicode_strings which basically means that both examples were equivalent.

Default behavior is a bit unpredicable as it is affected by the infamous Unicode Bug.

It is only unpredictable if your model of strings is broken. I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @pali

On Tuesday 04 July 2017 13:32:26 demerphq wrote:

It is only unpredictable if your model of strings is broken.

I do not know what you mean if model of strings is broken\, but once you start receiving strings from other modules\, user input or whatever external resource\, plus you start combining/concatenating those strings you would hit the unicode bug. Therefore safe way is to use /aa or /u modifiers in regex matching in way how you want to do matching.

I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with.

I think this discussion is out of original request\, which is for better documentation of utf8.pm and better name for utf8::is_utf8() function.

p5pRT commented 7 years ago

From @xsawyerx

On 07/04/2017 07:38 AM\, pali@cpan.org wrote:

On Tuesday 04 July 2017 13:32:26 demerphq wrote:

It is only unpredictable if your model of strings is broken. I do not know what you mean if model of strings is broken\,

It is "broken" in that sense for probably more people than we would like. Do we have any documentation that clarifies this entire issue? (I know I trip on this frequently and never fully understood this issue myself.)

[...]

I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with. I think this discussion is out of original request\, which is for better documentation of utf8.pm and better name for utf8::is_utf8() function.

Agree.

For now we seem to have two points we agree on: * We want to document these functions * We want to give them better names * We want the old behavior to work

As long as the second clause does not break the third\, I think we should seek to move forward.

Yves mentioned that "Internals" namespace to be undesired place for it (which was discussed at P5H\, the last core hackathon) and I agree. "scalar" was the most popular one\, IIRC.

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

Thanks!

p5pRT commented 7 years ago

From zefram@fysh.org

demerphq wrote:

People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\,

Those are bugs. In some cases they are bugs that we've decided we can't just fix because of backcompat\, so we add a flag to enable non-buggy semantics and the bug lives on as default behaviour.

If a flag to distinguish between character strings and binary strings were an intentional semantic feature\, we'd need some rules to say how the flag is to be set by operations that generate string outputs. We've never done that.

-zefram

p5pRT commented 7 years ago

From zefram@fysh.org

Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

I didn't want to add to a mostly bikeshedding discussion\, but OK. I concur that the existing names are poor\, but I'm not much happier with the names that have been suggested on this thread. I reckon the best terminology we have for this flag\, at the user level\, is "upgraded"\, and so the name "is_utf8" would be better as "is_upgraded". The existing names "upgrade" and "downgrade" for the transforming operations are OK\, and the only change I'd potentially like to make to them would be to add something that explicates their rather unusual in-place side-effecting nature.

In fact you can see all my preferred names in my CPAN module Scalar::String. This module essentially attempts to be the sane version of utf8.pm\, attempting to impart the right mental model through its function names and documentation. (The "sclstr_" prefix on all the function names may be omitted if desired; the important part of the name is that which distinguishes these functions from each other.)

I think the names for these functions should be reasonably concise\, and in particular we should have a single-word adjective for "having the SvUTF8 flag on" if possible. We should also try to reuse existing terminology\, rather than invent anything new. We should also avoid any term that implies anything beyond the storage\, such as any reference to characters or Unicode\, because such implications are largely inaccurate\, and anywhere they are accurate is a bug. All of this leads me to prefer "upgraded" over "utf8"\, "unicode"\, "uses_wide_storage"\, and the like.

I don't have any strong opinion about which package any new names for these functions should appear in. I think on balance we should not remove the old names\, because the trouble that arises from maintaining them is small compared to the hassle that would arise from requiring existing correct programs to change. Not removing them implies that we wouldn't even be deprecating them\, as currently defined\, but we can fairly discourage the use of the old names in documentation.

-zefram

p5pRT commented 7 years ago

From @khwilliamson

On 07/10/2017 02:13 PM\, Zefram wrote:

Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

I didn't want to add to a mostly bikeshedding discussion\, but OK. I concur that the existing names are poor\, but I'm not much happier with the names that have been suggested on this thread. I reckon the best terminology we have for this flag\, at the user level\, is "upgraded"\, and so the name "is_utf8" would be better as "is_upgraded". The existing names "upgrade" and "downgrade" for the transforming operations are OK\, and the only change I'd potentially like to make to them would be to add something that explicates their rather unusual in-place side-effecting nature.

In fact you can see all my preferred names in my CPAN module Scalar::String. This module essentially attempts to be the sane version of utf8.pm\, attempting to impart the right mental model through its function names and documentation. (The "sclstr_" prefix on all the function names may be omitted if desired; the important part of the name is that which distinguishes these functions from each other.)

I think the names for these functions should be reasonably concise\, and in particular we should have a single-word adjective for "having the SvUTF8 flag on" if possible. We should also try to reuse existing terminology\, rather than invent anything new. We should also avoid any term that implies anything beyond the storage\, such as any reference to characters or Unicode\, because such implications are largely inaccurate\, and anywhere they are accurate is a bug. All of this leads me to prefer "upgraded" over "utf8"\, "unicode"\, "uses_wide_storage"\, and the like.

I don't have any strong opinion about which package any new names for these functions should appear in. I think on balance we should not remove the old names\, because the trouble that arises from maintaining them is small compared to the hassle that would arise from requiring existing correct programs to change. Not removing them implies that we wouldn't even be deprecating them\, as currently defined\, but we can fairly discourage the use of the old names in documentation.

-zefram

My view is that the current names could be improved\, and that there should be no technical nor social problem in creating new names while retaining the old ones\, but changing the docs to stress the new ones. I've done that a lot.

I don't know what namespace is best. At first blush Internals seems good to me\, for this and other things that people currently have hacks for\, like

$foo & ""

which trying to find out if $foo is a string or just a number. I don't fully understand the objection to 'Internals'

I have never liked upgrade and downgrade. When you upgrade something you are supposed to get something better\, like more legroom. I have never seen why a PV is better than a number\, or a UTF-8 string better than a non-one (it's far slower\, for example\, which is a downgrade in my estimation). The use of upgrade and downgrade is jargon based on the attitudes of the implementers\, which should be avoided. Maybe it's too baked in to change\, but I regret that it's there. UTF-8 itself is an implementation detail that should never have been exposed to the outside\, but 'use utf8' pretty much does that.

p5pRT commented 7 years ago

From @cpansprout

On Mon\, 10 Jul 2017 19:53:42 -0700\, public@khwilliamson.com wrote:

I don't know what namespace is best. At first blush Internals seems good to me\, for this and other things that people currently have hacks for\, like
$foo & ""
which trying to find out if $foo is a string or just a number. I don't fully understand the objection to 'Internals'

Adding new public functions to the Internals namespace would completely change its meaning. It contains functions that exist mainly for perl’s own functionality (for built-in modules like Hash::Util to use) and for testing perl itself. Users are not supposed to know about them. That the cat is out of the bag and we cannot remove them is unfortunate.

Since we already use ‘utf8’ to refer to Perl’s Unicode support\, why not continue to use that namespace?

--

Father Chrysostomos

p5pRT commented 7 years ago

From @cpansprout

On Mon\, 10 Jul 2017 19:53:42 -0700\, public@khwilliamson.com wrote:

I have never liked upgrade and downgrade. When you upgrade something you are supposed to get something better\, like more legroom.

Well\, er\, that is exactly what you get. You can stretch your legs beyond CLV.*

I have never seen why a PV is better than a number\, or a UTF-8 string better than a non-one (it's far slower\, for example\,

I think that is one of the best arguments in favour of ‘upgrade’. It is just like upgrading most commercial software!

which is a downgrade in my estimation). The use of upgrade and downgrade is jargon based on the attitudes of the implementers\, which should be avoided. Maybe it's too baked in to change\, but I regret that it's there. UTF-8 itself is an implementation detail that should never have been exposed to the outside\, but 'use utf8' pretty much does that.

* That is a Roman numeral.

--

Father Chrysostomos

p5pRT commented 7 years ago

From @iabyn

On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.

Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.

Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.

Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.

-- All wight. I will give you one more chance. This time\, I want to hear no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers. -- Life of Brian

p5pRT commented 7 years ago

From @cpansprout

On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:

On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.

Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.

Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.

Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.

I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)

--

Father Chrysostomos

p5pRT commented 7 years ago

From @tonycoz

On Mon\, 10 Jul 2017 09:46:48 -0700\, xsawyerx@gmail.com wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

I haven't seen names I prefer over the current names\, certainly none that are improved enough that it's worth having two names for the same thing.

Tony

p5pRT commented 7 years ago

From @tux

On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@perl\.org wrote:

On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:

On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.

Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.

Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.

Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.

I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)

Count me in: three. I like the way Dave has written down my feelings :)

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 7 years ago

From @khwilliamson

On 07/12/2017 12:36 AM\, H.Merijn Brand wrote:

On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@perl\.org wrote:

On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:

On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.

Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.

Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.

Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.

I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)

Count me in: three. I like the way Dave has written down my feelings :)

I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.

The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.

Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.

Specifically about av_top_index\, I don't believe that it is so poorly named that you have to keep consulting the documentation as to what it does.

It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.

Using av_len is a bug waiting to happen. It is a foreseeable problem. I believe that it would be unethical to not create a non-deceptive alternative. It's kind of like a safety recall.

Writing code using deceptively named things or with poor API's is slower and more error prone. Every time you use one\, you have to get out of your mental pipeline and recall that this is a gotcha\, and have to figure out how it is a gotcha and how you have to compensate. You are effectively flushing your mental instruction cache. In the case of av_len\, you have to remember which way is the off-by-one problem here.

Code reviews also are affected. It is just too easy to read the thing and forget that it doesn't do what you would want.

In researching the issue back when av_top_index was created\, I found published modules that used av_len\, as its name implies\, as a length. Others undoubtedly had caught the problem earlier\, say through their unit testing.

But all this could be avoided by the code using a non-deceptive name. Hopefully\, the coder won't even be aware that there exist deceptive ones for hysterical reasons.

It is foreseeable that av_len is going to cause problems. It would be irresponsible of us to not create a non-deceptive synonym when it is so easy to do.

No one was really happy with "av_top_index" as a name. So AvFILL was retained in the core. All occurrences of av_len were removed. If we could have come up with a short\, pithy synonym\, we would have replaced AvFILL as well\, and then people looking at the core would have seen that and gotten used to it\, and over time the memory of the less well-named versions would have faded.

Writing good APIs is hard. I have flattered myself at times into thinking I'm good at it. Maybe I am actually good\, but if so\, I'm still not good enough. And few\, if any\, are. If we have a poor API in some area\, we should not tie our hands and say tough to all those people who come along later\, and give them more reason to use some other language

p5pRT commented 7 years ago

From @tux

On Wed\, 12 Jul 2017 22:53:57 -0600\, Karl Williamson \public@khwilliamson\.com wrote:

It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.

The problem with av_top_index is that it hat not (yet) been ported to Devel::PPPort\, so I cannot change any XS code that uses av_len into using the new function if that XS is to support 5.16.0 or older

$ ack av_top_index ppport.h 1225:av_top_index||5.017009| $

I know I didn't quote all of your message and I understand your motivation\, but the problem for these misnamed functions is much wider than the scope of av_top_index\, which is *only* available to XS\, and XS is more or less easy to fix by adding stuff to Devel::PPPort

For the utf8 functions\, the scope is WAY wider: it is used from pure-perl\, and renaming them (with or without aliases) would cause major brain damage for all authors that use these functions (correct or incorrect) when their code has to work on a wide range of perl versions.

To be honest\, I do not see an easy way out of that dilemma. If you have one\, I'm open to change for the better.

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 7 years ago

From @pali

On Wednesday 12 July 2017 23:44:39 H. Merijn Brand via RT wrote:

On Wed\, 12 Jul 2017 22:53:57 -0600\, Karl Williamson \public@khwilliamson\.com wrote:

It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.

The problem with av_top_index is that it hat not (yet) been ported to Devel::PPPort\,

Devel::PPPort is probably unmaintained... It has open couple of bugs since 2015 without any comments. And also pull requests are not processed since 2016. Even those security released like this: https://github.com/mhx/Devel-PPPort/pull/47

Because of those problems\, I have no motivation to prepare any other patch for Devel::PPPort. For dead/unmaintained modules it is useless.

so I cannot change any XS code that uses av_len into using the new function if that XS is to support 5.16.0 or older

$ ack av_top_index ppport.h 1225:av_top_index||5.017009| $

I know I didn't quote all of your message and I understand your motivation\, but the problem for these misnamed functions is much wider than the scope of av_top_index\, which is *only* available to XS\, and XS is more or less easy to fix by adding stuff to Devel::PPPort

For the utf8 functions\, the scope is WAY wider: it is used from pure-perl\, and renaming them (with or without aliases) would cause major brain damage for all authors that use these functions (correct or incorrect) when their code has to work on a wide range of perl versions.

To be honest\, I do not see an easy way out of that dilemma. If you have one\, I'm open to change for the better.

Problem is that people very often use construct which I wrote in first comment. Or they read "is_utf8" means string is UTF-8 encoded and therefore I need to call utf8::decode() on it.

And all this happens just because of wrong name from which can be deduced by more people what it should do -- which involves in *no* reading documentation...

If we would not add better aliases\, then broken code would be still produced on cpan.

As utf8::is_utf8() is not needed too often\, backward compatibility can be achieved by:

*NEW_NAME = *utf8::is_utf8;

I think this is a good compromise. If you think that upgrade and downgrade function names are fine\, OK\, but at least please add better name for is_utf8(). In original email I suggested is_upgraded()\, so name would be bound with "upgrade()" function. Because it really checks if upgrade() was called or not.

p5pRT commented 7 years ago

From @cpansprout

On Wed\, 12 Jul 2017 21:55:03 -0700\, public@khwilliamson.com wrote:

I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len)\, it made sense to me\, and I had no objection to it. Later\, people wrote to p5p complaining that the new situation was more confusing; in addition\, *I* started to get confused. That was when I started to have second thoughts.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables\, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

I think the same applies even to poorly named functions. You just have to learn the gotcha once\, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source\, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for\, while it sounds good\, does not work in practice.

--

Father Chrysostomos

p5pRT commented 7 years ago

From @demerphq

On 14 July 2017 at 04:28\, Father Chrysostomos via RT \perlbug\-followup@perl\.org wrote:

On Wed\, 12 Jul 2017 21:55:03 -0700\, public@khwilliamson.com wrote:

I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len)\, it made sense to me\, and I had no objection to it. Later\, people wrote to p5p complaining that the new situation was more confusing; in addition\, *I* started to get confused. That was when I started to have second thoughts.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables\, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

I think the same applies even to poorly named functions. You just have to learn the gotcha once\, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source\, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for\, while it sounds good\, does not work in practice.

I think the reason that it sounds good is because it does make sense at a micro level. If you are working on company code for instance\, or a small code-base\, renaming poorly named things means that the old name is *gone*\, and cognitive burden is reduced.

But with something like Perl we can't just get rid of things\, if we want to rename we have to do something for all the older code out there. So we have to support both in some ways. Which means the cognitive burden is increased.

Despite this I think sometimes these things *can* be justified and managed\, but we have to be extremely careful about the choices we make\, and have real plans in place to deprecate the older use cases in some kind of way. So for instance if we were going to get rid of Internals then we can rename things it contained\, and then bundle an Internals.pm which does the right thing\, people needing back compat can add 'use Internals' and get the back-compat. So i could see us considering the ideas in this thread in the context of the proposed introduction of 'array'\, 'scalar'\, etc.

yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @xsawyerx

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier:

* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):

(Since Perl 5.8.1) Test whether $string is marked internally as encoded in UTF-8. Functionally the same as "Encode::is_utf8()".

This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:

[INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. If /CHECK/ is true\, also checks whether /STRING/ contains well-formed UTF-8. Returns true if successful\, false otherwise.

As of Perl 5.8.1\, utf8 \https://metacpan.org/pod/utf8 also has the |utf8::is_utf8| function.

I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).

If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void).

* Provide different functions and document all functions in all other functions

If we decide to have better named functions\, we will have additional cognitive load for both experienced core developers and new developers. For core developers\, it is a muscle memory to undo and two different sets of code to deal with - those with the old name and those with the new name. For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. However\, their muscle memory will be geared towards using a more descriptive name.

I mix those when it comes to English.pm. I use $_\, $@\, $!\, $#\, $/\, $^X\, $0 and a few more\, but I use English.pm for $\<\, $>\, $(\, $)\, $"\, and a few more. The reasoning is simple: $_\, $@\, $!\, and $# are so common it will be built into every muscle memory. On the other hand\, for many developers\, if they see $\<\, they will need to look it up in perlvar anyway. However\, $UID or $REAL_USER_ID is readable right away and no need to look it up.

One additional point about English is that\, unlike what we're suggesting here\, the punctuation variable names are the right name\, they're just not descriptive. is_utf8() is not about descriptive\, but misleading. It is a misnomer. It makes it an undesired pitfall.

I see value in adding proper names\, but then we would need to take care of at least making all possible names available in the documentation of all other names. If you're reading utf8.pm\, you need to find "is_upgraded" in "is_utf8" and "is_utf8" in "is_upgraded"[1]. This makes it easy to quickly find what they mean and differentiate when we see different names.

* Move all known usages in core to new functions

Another way to improve this new cognitive load is by reducing it in the codebase. Removing as many instances of the old name will reduce the mixture of names\, thus helping us move towards the new name. This is a much more intrusive change but has a high potential of helping seasoned developers to deal with the new name.

* Automated policies for improving CPAN code quality

This is beyond the scope of core\, but I think it's worthwhile taking into account the perspective of the community. Realizing the misused "is_utf8" brings with it a question of whether and how we could reduce this problem's scope outside the core\, and this could have been done with a kwalitee check (CPANTS[2]) that checked for "is_utf8" and recommends reviewing its use. This is far more complicated since there is a legitimate (but narrow) use for it\, and you might get false positives. I believe only a human could find the situations in which it's valuable.

Still\, it is worthwhile keeping in mind.

Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?

[1] Using "is_upgraded" as an example different name. [2] http://cpants.cpanauthors.org/

On 07/13/2017 06:53 AM\, Karl Williamson wrote:

On 07/12/2017 12:36 AM\, H.Merijn Brand wrote:

On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@perl\.org wrote:

On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:

On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:

Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)

My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.

Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.

Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.

Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).

Life is now harder.

(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.

I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)

Count me in: three. I like the way Dave has written down my feelings :)

I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.

The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better-named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.

Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.

Specifically about av_top_index\, I don't believe that it is so poorly named that you have to keep consulting the documentation as to what it does.

It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.

Using av_len is a bug waiting to happen. It is a foreseeable problem. I believe that it would be unethical to not create a non-deceptive alternative. It's kind of like a safety recall.

Writing code using deceptively named things or with poor API's is slower and more error prone. Every time you use one\, you have to get out of your mental pipeline and recall that this is a gotcha\, and have to figure out how it is a gotcha and how you have to compensate. You are effectively flushing your mental instruction cache. In the case of av_len\, you have to remember which way is the off-by-one problem here.

Code reviews also are affected. It is just too easy to read the thing and forget that it doesn't do what you would want.

In researching the issue back when av_top_index was created\, I found published modules that used av_len\, as its name implies\, as a length. Others undoubtedly had caught the problem earlier\, say through their unit testing.

But all this could be avoided by the code using a non-deceptive name. Hopefully\, the coder won't even be aware that there exist deceptive ones for hysterical reasons.

It is foreseeable that av_len is going to cause problems. It would be irresponsible of us to not create a non-deceptive synonym when it is so easy to do.

No one was really happy with "av_top_index" as a name. So AvFILL was retained in the core. All occurrences of av_len were removed. If we could have come up with a short\, pithy synonym\, we would have replaced AvFILL as well\, and then people looking at the core would have seen that and gotten used to it\, and over time the memory of the less well-named versions would have faded.

Writing good APIs is hard. I have flattered myself at times into thinking I'm good at it. Maybe I am actually good\, but if so\, I'm still not good enough. And few\, if any\, are. If we have a poor API in some area\, we should not tie our hands and say tough to all those people who come along later\, and give them more reason to use some other language

p5pRT commented 7 years ago

From @hvds

On Mon\, 17 Jul 2017 01:47:32 -0700\, xsawyerx@gmail.com wrote:

I have mixed thoughts about this.

Me too.

If we decide to have better named functions [...] For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. [...]

I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience\, most people working in a perl shop tend to read lots of code in their local codebase\, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up\, it may not happen that early for a good proportion of new developers.

Maybe you had in mind primarily historical threads googled up from perlmonks\, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.

Hugo

p5pRT commented 7 years ago

From @tonycoz

On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:

[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier:

* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
    $Since Perl 5\.8\.1$ Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).

utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.

If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?

Perhaps something like:

=item * C\<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.

If you're accepting bytes:

utf8::downgrade($string); # throws an exception if code point over 0xFF

utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.

\<\<

Are there any other cases someone might be tempted to call utf8::is_utf8()?

Tony

p5pRT commented 7 years ago

From @tux

On Tue\, 18 Jul 2017 10:53:53 +1000\, Tony Cook \tony@develop\-help\.com wrote:

On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier:

* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
    $Since Perl 5\.8\.1$ Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.

If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void).
... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?

Perhaps something like:

=item * C\<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.

If you're accepting bytes:

utf8::downgrade($string); # throws an exception if code point over 0xFF

utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.

\<\<

Are there any other cases someone might be tempted to call utf8::is_utf8()?

Tony

I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 7 years ago

From @grinnz

On Tue\, Jul 18\, 2017 at 3:04 AM\, H.Merijn Brand \h\.m\.brand@xs4all\.nl wrote:

I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.

This isn't something that you can guarantee. It always depends on knowing how you get your input. When people don't understand this they look for the magic bullet that is_utf8 appears to be\, but it is not.

p5pRT commented 7 years ago

From @tux

On Tue\, 18 Jul 2017 03:13:40 -0400\, Dan Book \grinnz@gmail\.com wrote:

On Tue\, Jul 18\, 2017 at 3:04 AM\, H.Merijn Brand \h\.m\.brand@xs4all\.nl wrote:

I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.

This isn't something that you can guarantee. It always depends on knowing how you get your input. When people don't understand this they look for the magic bullet that is_utf8 appears to be\, but it is not.

My point exactly. Just have a piece of text that tells the user why it isn't and what the best alternative *could* be.

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 7 years ago

From @xsawyerx

On 07/17/2017 10:09 PM\, Hugo van der Sanden via RT wrote:

On Mon\, 17 Jul 2017 01:47:32 -0700\, xsawyerx@gmail.com wrote:

I have mixed thoughts about this. Me too.

If we decide to have better named functions [...] For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. [...] I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience\, most people working in a perl shop tend to read lots of code in their local codebase\, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up\, it may not happen that early for a good proportion of new developers.

Maybe you had in mind primarily historical threads googled up from perlmonks\, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.

I meant people who will start hacking on Perl core.

p5pRT commented 7 years ago

From @tonycoz

On Tue\, Jul 18\, 2017 at 10:53:53AM +1000\, Tony Cook wrote:

On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier:

* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
    $Since Perl 5\.8\.1$ Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.

If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?

Perhaps something like:

=item * C\<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.

If you're accepting bytes:

utf8::downgrade($string); # throws an exception if code point over 0xFF

utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.

\<\<

Are there any other cases someone might be tempted to call utf8::is_utf8()?

Thinking about it further\, I'm pretty sure this doesn't all belong here.

L\<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns\, and the whole of perlunifaq covers some of the things the above tries to cover.

perlunicook largely works at a higher level than the functions in utf8::* work at.

One thing from the above that doesn't seem to be discussed well[1] is what I tried to cover briefly in:

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.

which could perhaps use some expansion in perlunicode.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*).

Tony

[1] perlunifaq briefly mentions some of the issues under "What about binary data\, like image?" and more detail in "What if I don't decode?"

p5pRT commented 7 years ago

From @xsawyerx

On 07/19/2017 08:58 AM\, Tony Cook wrote:

On Tue\, Jul 18\, 2017 at 10:53:53AM +1000\, Tony Cook wrote:
On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]

I have mixed thoughts about this.

I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.

A few ways to make such a situation easier:

* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
    $Since Perl 5\.8\.1$ Test whether $string is marked internally as
    encoded in UTF\-8\. Functionally the same as "Encode&#8203;::is\_utf8"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\.
If /CHECK/ is true\, also checks whether /STRING/ contains
well\-formed UTF\-8\. Returns true if successful\, false otherwise\.

As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the
|utf8&#8203;::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here). utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.

If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation? Perhaps something like:
=item * C\<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.

If you're accepting bytes:

utf8::downgrade($string); # throws an exception if code point over 0xFF

utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.

\<\<

Are there any other cases someone might be tempted to call utf8::is_utf8()? Thinking about it further\, I'm pretty sure this doesn't all belong here.
L\<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns\, and the whole of perlunifaq covers some of the things the above tries to cover.

perlunicook largely works at a higher level than the functions in utf8::* work at.

+1 on the suggested text.

I think this addition is useful\, even if it is also covered in more documents. We could also link to those documents for further learning.

p5pRT commented 7 years ago

From @tonycoz

On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:

which could perhaps use some expansion in perlunicode.

perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*).

Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.

The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places.

Tony

p5pRT commented 7 years ago

From @tonycoz

131685-various-changes.patch

```diff From bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 10:30:56 +1000 Subject: use utf8; doesn't force unicode semantics on all strings in scope eg. $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"' no match perhaps this should be removed, or completely re-worded, it's worded similarly to the next point which behaves differently. --- pod/perlunicode.pod | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ef02b0a..d3ccf44 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -233,7 +233,7 @@ Unicode: Within the scope of S> If the whole program is Unicode (signified by using 8-bit Bnicode -Bransformation Bormat), then all strings within it must be +Bransformation Bormat), then all literal strings within it must be Unicode. =item * -- 2.1.4 From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 10:45:33 +1000 Subject: encoding.pm no longer works --- pod/perlunicode.pod | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index d3ccf44..24102bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -60,10 +60,11 @@ filenames. Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L.) -=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. -See L. +The L module has been deprecated since perl 5.18 and the +perl internals it requires have been removed with perl 5.26. =item C still needed to enable L in scripts -- 2.1.4 From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 15:42:18 +1000 Subject: unfortunately sysread() tries to read characters --- pod/perluniintro.pod | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ad9dda..5e263b4 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed list see L. C reads characters and returns the number of characters. -C and C operate on byte counts, as do C -and C. +C and C operate on byte counts, as does C. + +C and C should not be used on file handles with +character encoding layers, they behave badly, and that behaviour has +been deprecated since perl 5.24. Notice that because of the default behaviour of not doing any conversion upon input if there is no default layer, -- 2.1.4 From fb22d08dd9f174ddc4007c8ca6ef0e379fe34874 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Thu, 20 Jul 2017 15:44:49 +1000 Subject: (perl #131685) improve utf8::* function documentation Splits the little cheat sheet I posted as a comment into pieces and puts them closer to where they belong - better document why you'd want to use utf8::upgrade() - similarly for utf8::downgrade() - try hard to convince people not to use utf8::is_utf8() - no, utf8::is_utf8() isn't what you want instead of utf8::valid() --- lib/utf8.pm | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 52 insertions(+), 9 deletions(-) diff --git a/lib/utf8.pm b/lib/utf8.pm index 324cb87..9abbd06 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -2,7 +2,7 @@ package utf8; $utf8::hint_bits = 0x00800000; -our $VERSION = '1.19'; +our $VERSION = '1.20'; sub import { $^H |= $utf8::hint_bits; @@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code. Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The logical character sequence itself is unchanged. If I<$string> is already -stored as UTF-8, then this is a no-op. Returns the -number of octets necessary to represent the string as UTF-8. Can be -used to make sure that the UTF-8 flag is on, so that C<\w> or C -work as Unicode on strings containing non-ASCII characters whose code points -are below 256. +upgraded, then this is a no-op. Returns the +number of octets necessary to represent the string as UTF-8. + +If your code needs to be compatible with versions of perl without +C, you can force Unicode semantics on +a given string: + + # force unicode semantics for $string without the + # "unicode_strings" feature + utf8::upgrade($string); + +For example: + + # without explicit or implicit use feature 'unicode_strings' + my $x = "\xDF"; # LATIN SMALL LETTER SHARP S + /ss/i; # won't match + my $y = uc($x); # won't comvert + utf8::upgrade($x); + /ss/i; # matches + my $z = uc($x); # converts to "SS" B; use L instead. @@ -136,6 +151,15 @@ true, returns false. Returns true on success. +If your code expects an octet sequence this can be used to validate +that you've received one: + + # throw an exception if not representable as octets + utf8::downgrade($string) + + # or do your own error handling + utf8::downgrade($string, 1) or die "string must be octets"; + B; use L instead. @@ -153,6 +177,11 @@ Returns nothing. # ASCII platforms) 0xc4 and 0x80. On EBCDIC # 1047, this would instead be 0x8C and 0x41. +Similar to: + + use Encode; + $a = Encode::encode("utf8", $a); + B; use L instead. @@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there. =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in -UTF-8. Functionally the same as C. +UTF-8. Functionally the same as C. + +Typically only necessary for debugging and testing, if you need to +dump the internals of an SV, L Dump() +provides more detail in a compact form. + +If you still think you need this outside of debugging, testing or +dealing with filenames, you should probably read L and +L. + +Don't use this flag as a marker to distinguish character and binary +data, that should be decided for each variable when you write your +code. + +To force unicode semantics in code portable to perl 5.8 and 5.10, call +C unconditionally. =item * C<$flag = utf8::valid($string)> @@ -216,8 +260,7 @@ UTF-8. Functionally the same as C. UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag on B if I<$string> is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's test suite to check -that operations have left strings in a consistent state. You most -probably want to use C instead. +that operations have left strings in a consistent state. =back -- 2.1.4 ```

Perl / perl5