Rename utf8::is_utf8() (and other functions)

Migrated from rt.perl.org#131685 (status was 'open')

Searchable as RT131685$

From @xsawyerx

On 07/20/2017 07:50 AM\, Tony Cook via RT wrote:

On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:

which could perhaps use some expansion in perlunicode. perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*). Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.

The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places.

Thank you\, Tony.

I have only two small nit-pickings on the patch: There's a typo for "convert" (says "comvert") and it uses "$a" in one of the examples which I think should be "$x" or some unreserved variable name\, to avoid confusion.

From @xsawyerx

On 07/20/2017 09:23 AM\, Sawyer X wrote:

On 07/20/2017 07:50 AM\, Tony Cook via RT wrote:

On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:

which could perhaps use some expansion in perlunicode. perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*). Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.

The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places. Thank you\, Tony.

I have only two small nit-pickings on the patch: There's a typo for "convert" (says "comvert") and it uses "$a" in one of the examples which I think should be "$x" or some unreserved variable name\, to avoid confusion.

For what it's worth\, this received an offline +1 from rgs. :)

From @tonycoz

On Thu\, Jul 20\, 2017 at 09:23:44AM +0200\, Sawyer X wrote:

On 07/20/2017 07:50 AM\, Tony Cook via RT wrote:

On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:

which could perhaps use some expansion in perlunicode. perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*). Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.

The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places.

Thank you\, Tony.

I have only two small nit-pickings on the patch: There's a typo for "convert" (says "comvert") and it uses "$a" in one of the examples which I think should be "$x" or some unreserved variable name\, to avoid confusion.

Updated patch attached.

Any opinions on whether the reference to C\<use utf8;> modified by the first patch should be removed?

It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8 marked)\, which isn't a big deal\, until we do "abc\xDF" which also isn't marked.

Tony

From @tonycoz

131685-various-changes.patch

```diff From bb94b5c97eb772aabac478a997537696cf953b39 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 10:30:56 +1000 Subject: use utf8; doesn't force unicode semantics on all strings in scope eg. $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"' no match perhaps this should be removed, or completely re-worded, it's worded similarly to the next point which behaves differently. --- pod/perlunicode.pod | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ef02b0a..d3ccf44 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -233,7 +233,7 @@ Unicode: Within the scope of S> If the whole program is Unicode (signified by using 8-bit Bnicode -Bransformation Bormat), then all strings within it must be +Bransformation Bormat), then all literal strings within it must be Unicode. =item * -- 2.1.4 From b8e048092606e8ab230e0915896cd44a1c900597 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 10:45:33 +1000 Subject: encoding.pm no longer works --- pod/perlunicode.pod | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index d3ccf44..24102bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -60,10 +60,11 @@ filenames. Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L.) -=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. -See L. +The L module has been deprecated since perl 5.18 and the +perl internals it requires have been removed with perl 5.26. =item C still needed to enable L in scripts -- 2.1.4 From b997306c58fa50d12a10a92b73ecc075100c8518 Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 19 Jul 2017 15:42:18 +1000 Subject: unfortunately sysread() tries to read characters --- pod/perluniintro.pod | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ad9dda..5e263b4 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -473,8 +473,11 @@ standardisation organisations are recognised; for a more detailed list see L. C reads characters and returns the number of characters. -C and C operate on byte counts, as do C -and C. +C and C operate on byte counts, as does C. + +C and C should not be used on file handles with +character encoding layers, they behave badly, and that behaviour has +been deprecated since perl 5.24. Notice that because of the default behaviour of not doing any conversion upon input if there is no default layer, -- 2.1.4 From bba883b879024faf30095f9f19b52ec5ce4d8aac Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Fri, 21 Jul 2017 11:29:39 +1000 Subject: (perl #131685) improve utf8::* function documentation Splits the little cheat sheet I posted as a comment into pieces and puts them closer to where they belong - better document why you'd want to use utf8::upgrade() - similarly for utf8::downgrade() - try hard to convince people not to use utf8::is_utf8() - no, utf8::is_utf8() isn't what you want instead of utf8::valid() - change some examples to use $x instead of the sort reserved $a --- lib/utf8.pm | 69 +++++++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 56 insertions(+), 13 deletions(-) diff --git a/lib/utf8.pm b/lib/utf8.pm index 324cb87..50a5b20 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -2,7 +2,7 @@ package utf8; $utf8::hint_bits = 0x00800000; -our $VERSION = '1.19'; +our $VERSION = '1.20'; sub import { $^H |= $utf8::hint_bits; @@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code. Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The logical character sequence itself is unchanged. If I<$string> is already -stored as UTF-8, then this is a no-op. Returns the -number of octets necessary to represent the string as UTF-8. Can be -used to make sure that the UTF-8 flag is on, so that C<\w> or C -work as Unicode on strings containing non-ASCII characters whose code points -are below 256. +upgraded, then this is a no-op. Returns the +number of octets necessary to represent the string as UTF-8. + +If your code needs to be compatible with versions of perl without +C, you can force Unicode semantics on +a given string: + + # force unicode semantics for $string without the + # "unicode_strings" feature + utf8::upgrade($string); + +For example: + + # without explicit or implicit use feature 'unicode_strings' + my $x = "\xDF"; # LATIN SMALL LETTER SHARP S + /ss/i; # won't match + my $y = uc($x); # won't convert + utf8::upgrade($x); + /ss/i; # matches + my $z = uc($x); # converts to "SS" B; use L instead. @@ -136,6 +151,15 @@ true, returns false. Returns true on success. +If your code expects an octet sequence this can be used to validate +that you've received one: + + # throw an exception if not representable as octets + utf8::downgrade($string) + + # or do your own error handling + utf8::downgrade($string, 1) or die "string must be octets"; + B; use L instead. @@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that represent the individual UTF-8 bytes of the character. The UTF8 flag is turned off. Returns nothing. - my $a = "\x{100}"; # $a contains one character, with ord 0x100 - utf8::encode($a); # $a contains two characters, with ords (on + my $x = "\x{100}"; # $a contains one character, with ord 0x100 + utf8::encode($x); # $a contains two characters, with ords (on # ASCII platforms) 0xc4 and 0x80. On EBCDIC # 1047, this would instead be 0x8C and 0x41. +Similar to: + + use Encode; + $x = Encode::encode("utf8", $x); + B; use L instead. @@ -167,9 +196,9 @@ turned on only if the source string contains multiple-byte UTF-8 characters. If I<$string> is invalid as UTF-8, returns false; otherwise returns true. - my $a = "\xc4\x80"; # $a contains two characters, with ords + my $x = "\xc4\x80"; # $a contains two characters, with ords # 0xc4 and 0x80 - utf8::decode($a); # On ASCII platforms, $a contains one char, + utf8::decode($x); # On ASCII platforms, $a contains one char, # with ord 0x100. Since these bytes aren't # legal UTF-EBCDIC, on EBCDIC platforms, $a is # unchanged and the function returns FALSE. @@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there. =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in -UTF-8. Functionally the same as C. +UTF-8. Functionally the same as C. + +Typically only necessary for debugging and testing, if you need to +dump the internals of an SV, L Dump() +provides more detail in a compact form. + +If you still think you need this outside of debugging, testing or +dealing with filenames, you should probably read L and +L. + +Don't use this flag as a marker to distinguish character and binary +data, that should be decided for each variable when you write your +code. + +To force unicode semantics in code portable to perl 5.8 and 5.10, call +C unconditionally. =item * C<$flag = utf8::valid($string)> @@ -216,8 +260,7 @@ UTF-8. Functionally the same as C. UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag on B if I<$string> is held as bytes (both these states are 'consistent'). Main reason for this routine is to allow Perl's test suite to check -that operations have left strings in a consistent state. You most -probably want to use C instead. +that operations have left strings in a consistent state. =back -- 2.1.4 ```

From @xsawyerx

(Except "$a" still appears in the comments next to the lines that now say "$x". Sorry.)

On 07/21/2017 03:40 AM\, Tony Cook wrote:

On Thu\, Jul 20\, 2017 at 09:23:44AM +0200\, Sawyer X wrote:

On 07/20/2017 07:50 AM\, Tony Cook via RT wrote:

On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:

which could perhaps use some expansion in perlunicode. perlunitut covers this reasonably well.

I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*). Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.

The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places. Thank you\, Tony.

I have only two small nit-pickings on the patch: There's a typo for "convert" (says "comvert") and it uses "$a" in one of the examples which I think should be "$x" or some unreserved variable name\, to avoid confusion. Updated patch attached.

Any opinions on whether the reference to C\<use utf8;> modified by the first patch should be removed?

It's still misleading ("abc" in the scope of use utf8; isn't SVf_UTF8 marked)\, which isn't a big deal\, until we do "abc\xDF" which also isn't marked.

Tony

From @tonycoz

On Fri\, 21 Jul 2017 02:02:08 -0700\, xsawyerx@gmail.com wrote:

+1

(Except "$a" still appears in the comments next to the lines that now say "$x". Sorry.)

Fixed and applied as e423fa83496ce7d83b137bd7f0852864b6073b36\, 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717\, ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 and 0397beb0d12565d70e168bfea7376e2612a6748a.

Is there anything else we should do to avoid mis-use of these functions?

I previously said:

Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function. which could perhaps use some expansion in perlunicode. perlunitut covers this reasonably well.

I'm referring to "I/O flow (the actual 5 minute tutorial)"\, should this be expanded elsewhere?

I don't think it should be expanded in perlunitut.

Tony

From @pali

On Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote:

On Fri\, 21 Jul 2017 02:02:08 -0700\, xsawyerx@gmail.com wrote:

+1

(Except "$a" still appears in the comments next to the lines that now say "$x". Sorry.)

Fixed and applied as e423fa83496ce7d83b137bd7f0852864b6073b36\, 01c3fbbc0d1b54bb0dd6fdc0abed7854e62c6717\, ee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 and 0397beb0d12565d70e168bfea7376e2612a6748a.

Just one note:

+Similar to: + + use Encode; + $x = Encode::encode("utf8"\, $x); +

Maybe instead of "utf8" we should show "UTF-8" to users/developers in examples. So if they are using Encode::encode they would get "correct" UTF-8 output and not perl's extended utf8.

In commit 8e179dd8df306c5088bf6c15b494826d48278928 was replaced usage of Encode "utf8" by "UTF-8" as it is better for people doing copy+paste without context.

From @pali

On Sunday 23 July 2017 18:57:43 Tony Cook via RT wrote:

Is there anything else we should do to avoid mis-use of these functions?

The most useful and legitimate are those functions: utf8::encode utf8::decode utf8::native_to_unicode utf8::unicode_to_native

What about moving them "upper" in synopsis and also in description? So first we show users those functions which they probably want to use in their code\, and after describe those upgrade/downgrade/is_utf8...

Probably adding "[INTERNAL]" description\, like is for utf8::valid could help too.

From @khwilliamson

On 07/13/2017 08:28 PM\, Father Chrysostomos via RT wrote:

On Wed\, 12 Jul 2017 21:55:03 -0700\, public@khwilliamson.com wrote:

I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.

I agree the disagreement is unfortunate.

The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.

When you first put forward this argument (specifically with regard to av_len)\, it made sense to me\, and I had no objection to it. Later\, people wrote to p5p complaining that the new situation was more confusing; in addition\, *I* started to get confused. That was when I started to have second thoughts.

I searched the archives of p5p for occurrences of av_top_index and av_tindex. There were two complaints I saw before the recent spate. One was Marc Lehmann; the other\, more recent was Dave Mitchell saying av_tindex didn't seem natural to him.

I myself am confused by the previous names\, and this helps *me*. There are times when I want to refer to the highest element. And there are times when the length is the more natural concept. I would like something for these occasions like 'av_true_len'. Again\, if I see av_len\, I realize it's problematic and I have to slow down to think about how it is. Life is more difficult.

I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables\, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.

That tells me that the names were not chosen well enough. It is an art\, and few coders are good at it. I still have learned only a few of the punctuation variables.

I think the same applies even to poorly named functions. You just have to learn the gotcha once\, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source\, then you are coming close to what I would call autopodotoxy.)

Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.

But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.

My personal experience is that what you are arguing for\, while it sounds good\, does not work in practice.

If you assume that new Perl XS programmers are mostly going to be reading old code that uses these constructs\, yes they will have to learn them at some point. And\, encountering those constructs will likely slow them down each time. But my hope is that there will be plenty of new Perl programmers programming Perl and XS on new projects\, and they shouldn't have to be burdened by the past.

My father was good at double-clutching. He used that\, the story goes\, to save a tourist bus whose brakes had failed that he was driving down\, a steep slope. He tried to teach me that art\, and I did it a few times\, but transmissions had gotten better\, and I never had to do it\, and couldn't do it now. Nowadays most people don't even know what it is\, nor should they have to be burdened by a skill that technology has made essentially obsolete.

Perl / perl5