Open p5pRT opened 7 years ago
Hi!
This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html
Problem is that in more perl modules is used this incorrect code pattern:
use utf8;
my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }
In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.
Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.
Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.
I'm proposing following rename of functions:
utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()
Plus adding backward compatible aliases to make existing code works like before.
As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';
I'm attaching patches which:
* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation
On Sat\, 01 Jul 2017 09:03:18 -0700\, (via RT) \perlbug\-followup@​perl\.org wrote:
# New Ticket Created by
# Please include the string: [perl #131685] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131685 >Hi!
This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html
Problem is that in more perl modules is used this incorrect code pattern:
use utf8;
my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }
In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.
Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.
Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.
I'm proposing following rename of functions:
utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()
I am still objecting\, as this will also break code that uses those functions as intended and correctly.
As these are not XS\, Devel::PPPort won't help (assuming authors use D::P on XS modules to guarantee backward compat)
I'd loath to change/fix every occurrence of code that uses any of these three correctly\, as that code is brittle to start with and probably hard to fix when broken.
Plus adding backward compatible aliases to make existing code works like before.
Then why add new functions in the first place?
As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';
No\, please. Most correct uses will be in dark distant corners\, hidden in modules you don't want to touch anyway.
I'm attaching patches which:
* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
The RT System itself - Status changed from 'new' to 'open'
On Sat\, Jul 1\, 2017 at 6:03 PM\, via RT \perlbug\-followup@​perl\.org wrote:
Hi!
This is continuation from original discussion about renaming utf8::is_utf8() to utf8::is_upgraded() which can be found at: https://www.nntp.perl.org/group/perl.perl5.porters/2017/02/msg243068.html
Problem is that in more perl modules is used this incorrect code pattern:
use utf8;
my $value = func(); if (utf8::is_utf8($value)) { utf8::encode($value); }
In most cases module developers think that utf8::is_utf8() returns true when it is needed to manually encode argument into UTF-8 bytes. Which is of course wrong.
Reason for this is poor name of function utf8::is_utf8() and also poor documentation about this function.
Functions utf8::is_utf8()\, utf8::upgrade() and utf8::downgrade() changes internal string representation\, which is fully invisible for pure perl code\, and therefore I think all those functions should be in Internals namespace.
I'm proposing following rename of functions:
utf8::is_utf8() --> Internals::uses_string_wide_storage() utf8::upgrade() --> Internals::upgrade_string_to_wide_storage() utf8::downgrade() --> Internals::downgrade_string_from_wide_storage()
Plus adding backward compatible aliases to make existing code works like before.
As all those functions should be used only for debugging purposes (e.g. test cases for XS code) or when dealing with buggy XS module\, I'm proposing starting to throw warning (e.g. since v5.28.0) when those functions are called. For those who are dealing with internals\, can turn warning off by no warnings 'experimental::internal';
I'm attaching patches which:
* Add new warning category 'experimental::internal' * Rename utf8 functions * Update perldoc utf8 documentation
I don't see how this is an option. I'll grant you that something like this would have been a better option back then but you're 15 years too late. "This would have been better" is no excuse to break a decade and a half of software.
Leon
On Saturday 01 July 2017 19:13:30 you wrote:
to break a decade and a half of software.
Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names. Usage of old functions is just removed from documentation.
On Saturday 01 July 2017 18:54:24 you wrote:
Plus adding backward compatible aliases to make existing code works like before.
Then why add new functions in the first place?
From discussion it was clear that current name utf8::is_utf8() is poor and is reason why it is incorrectly used.
On Sat\, Jul 1\, 2017 at 7:45 PM\, \pali@​cpan\.org wrote:
On Saturday 01 July 2017 19:13:30 you wrote:
to break a decade and a half of software.
Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names. Usage of old functions is just removed from documentation.
Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.
Leon
On 07/01/2017 01:52 PM\, Leon Timmermans wrote:
On Sat\, Jul 1\, 2017 at 7:45 PM\, \<pali@cpan.org \mailto​:pali@​cpan\.org> wrote:
On Saturday 01 July 2017 19​:13​:30 you wrote​: > to break a decade and a half of software\. Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names\. Usage of old functions is just removed from documentation\.
Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.
You could support it with Devel::PPPort. It's a simple addition.
However\, the problem remains that if someone were to use these new functions without PPPort\, their code would not work on older versions. I can't see a way around that.
On Mon\, Jul 03\, 2017 at 01:03:37PM -0400\, Sawyer X wrote:
On 07/01/2017 01:52 PM\, Leon Timmermans wrote:
On Sat\, Jul 1\, 2017 at 7:45 PM\, \<pali@cpan.org \mailto​:pali@​cpan\.org> wrote:
On Saturday 01 July 2017 19​:13​:30 you wrote​: > to break a decade and a half of software\. Hm? What you mean with to break? Existing functions would still work\, just there are also new functions under new names\. Usage of old functions is just removed from documentation\.
Then I misunderstood your proposal\, "rename" suggested to me that the old ones disappear. In that case I'm not sure I see the benefit of your proposal. Why would anyone want to use an interface that won't work on perls older than 5.28\, and could disappear in a future version of perl (since that's the point of Internals::)? This isn't making sense to me.
You could support it with Devel::PPPort. It's a simple addition.
However\, the problem remains that if someone were to use these new functions without PPPort\, their code would not work on older versions. I can't see a way around that.
These are perl functions (as documented in utf8.pm)\, not C functions\, Devel::PPPort does nothing for us.
The patch retains the old names\, so that isn't an issue.
But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.
As a side note\, the original thread refers to:
https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive-Tar/lib/Archive/Tar.pm#L1501
which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.
Tony
On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@​develop\-help\.com wrote:
As a side note\, the original thread refers to:
https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501
which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.
Tony
Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.
-Dan
On Mon\, Jul 03\, 2017 at 09:35:06PM -0400\, Dan Book wrote:
On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@​develop\-help\.com wrote:
As a side note\, the original thread refers to:
https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501
which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.
Tony
Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.
If the caller creates a file using the name they pass in\, encoding the name (which might not be utf-8 marked) may make the later -e or -l check fail.
Perl functions such as open and stat currently ignore the the UTF-8 flag\, which makes this pretty messy.
The code in Archive::Tar seems a reasonable workaround to me\, I don't think the author had much choice.
Tony
On Monday 03 July 2017 21:35:06 Dan Book wrote:
On Mon\, Jul 3\, 2017 at 8:38 PM\, Tony Cook \tony@​develop\-help\.com wrote:
As a side note\, the original thread refers to:
https://metacpan.org/source/SHAY/perl-5.24.1/cpan/Archive- Tar/lib/Archive/Tar.pm#L1501
which I could see as correct because of the way perl's unicode support (fails to) deal with filenames.
Tony
Not entirely correct IMO. If the intent is that filenames be encoded to UTF-8\, this will fail to encode downgraded names with non-ascii characters.
-Dan
See bug: https://rt.perl.org/Public/Bug/Display.html?id=130831
On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.
Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.
And for old code can be defined this function easily:
*new_name = *old_name;
Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names
On 4 July 2017 at 09:19\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.
Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.
And for old code can be defined this function easily:
*new_name = *old_name;
Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names
I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.
scalar::is_unicode_string() scalar::is_binary_string()
I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\, and the unicode flag has significance beyond the storage format; utf8-on strings get unicode semantics in case insensitive operations.
cheers\, Yves
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:
On 4 July 2017 at 09:19\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.
Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.
And for old code can be defined this function easily:
*new_name = *old_name;
Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names
I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.
I proposed Internals\, because that flag is internal for perl and invisible for pure perl code. But if more people are happy with scalar namespace\, I'm fine with it.
scalar::is_unicode_string() scalar::is_binary_string()
But this is wrong! SVf_UTF8 does not tell if scalar string is unicode or binary. It just tell type of internal storage.
Name is_binary_string is misleading in same way as current name is_utf8.
If you say that binary string is one with codes only in range 0x00-0xFF then you can have that binary string also with SVf_UTF8 flag and your function name "is_binary_string" would return false for your binary string. Such name would lead to another problems.
I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\,
Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 extension from ASCII) contains two bytes when encoded in UTF-8 and therefore are wide in UTF-8 too.
and the unicode flag has significance beyond the storage format; utf8-on strings get unicode semantics in case insensitive operations.
cheers\, Yves
On 4 July 2017 at 11:03\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 01:52:29 yves orton via RT wrote:
On 4 July 2017 at 09:19\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 10:38:26 Tony Cook wrote:
But it does deprecate the old names\, which is an issue\, I can't imagine us removing these functions.
Warning can be removed from patch. It is just question how you decide. Also functions stay there\, but we can instruct people via documentation to use new functions for a new code... Again it is question if you call it deprecation or aliasing. In any case functions are not going to be deleted\, so in final case it does not matter for old code.
And for old code can be defined this function easily:
*new_name = *old_name;
Reason for this patch series is: * document those utf8:: functions * allow developers to call those functions via non-cryptic names
I dont mind adding new aliases for these functions\, I object to your proposal to put them in Internals however; I think that they should go in 'scalar'\, which we decided at the last PerlQA is the designated place for functions that operate on scalars.
I proposed Internals\, because that flag is internal for perl and invisible for pure perl code. But if more people are happy with scalar namespace\, I'm fine with it.
scalar::is_unicode_string() scalar::is_binary_string()
But this is wrong! SVf_UTF8 does not tell if scalar string is unicode or binary. It just tell type of internal storage.
No. This is a myth. Plain and simply a myth.
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.
You can see the difference in the following:
"ba\x{DF}"=~/ss/i;
"ba\N{U+DF}"=~/ss/i;
The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.
Name is_binary_string is misleading in same way as current name is_utf8.
Erf\, maybe. We need a term for "not-unicode"\, and "binary" is as good as any. I don't mind other proposals.
If you say that binary string is one with codes only in range 0x00-0xFF then you can have that binary string also with SVf_UTF8 flag and your function name "is_binary_string" would return false for your binary string. Such name would lead to another problems.
The SVf_UTF8 flag being off means the string should be treated as ASCII when doing case-insensitive operations\, and as binary for other purposes\, and that the data is encoded as a series of discrete octets. It is not uncommon for people on this list to use the terms unicode and binary for this reason.
I don't like the wide-storage thing\, (although I admit i think it better than "is_utf8")\, a latin1 string in utf8 does not use wide-storage\,
Of course it can. Unicode code points 0x80 .. 0xFF (which are Latin1 extension from ASCII) contains two bytes when encoded in UTF-8 and therefore are wide in UTF-8 too.
I spoke imprecisely\, I should have said ASCII\, not latin-1.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Tuesday 04 July 2017 11:22:42 demerphq wrote:
No. This is a myth. Plain and simply a myth.
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.
You can see the difference in the following:
"ba\x{DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched
"ba\N{U+DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched
The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.
No\, both were matched under Perl 5.24.1.
On 4 July 2017 at 12:04\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 11:22:42 demerphq wrote:
No. This is a myth. Plain and simply a myth.
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.
You can see the difference in the following:
"ba\x{DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched
"ba\N{U+DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched
-E is not -e.
-E is enabling a pragma which changes the default behavior.
However it is *PRAGMA*. It is NOT the normal behavior of Perl.
The latter matches because \N{U+DF} produces the unicode code point DF\, and the former does not match\, because \x{DF} produces the ASCII octet DF instead. The former is an ASCII string\, and the later is a Unicode string.
No\, both were matched under Perl 5.24.1.
No\, they did not. If \x{DF} magically started matching 'ss' it would be a *MASSIVE* regression.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:
On 4 July 2017 at 12:04\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 11:22:42 demerphq wrote:
No. This is a myth. Plain and simply a myth.
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.
You can see the difference in the following:
"ba\x{DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched
"ba\N{U+DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched
-E is not -e.
-E is enabling a pragma which changes the default behavior.
However it is *PRAGMA*. It is NOT the normal behavior of Perl.
Ah\, right. I forgot that -E enables feature unicode_strings which basically means that both examples were equivalent.
Default behavior is a bit unpredicable as it is affected by the infamous Unicode Bug.
my $str1 = "\x{DF}"; my $str2 = "\N{U+DF}"; my $str3 = "\x{100}";
"ba$str1" =~ /ss/i; "ba$str2" =~ /ss/i;
"ba$str1$str3" =~ /ss/i;
To make it predicable either /aa or /u modifiers should be already used... It will prevent problems
"ba$str1" =~ /ss/aai; "ba$str2" =~ /ss/aai; "ba$str1$str3" =~ /ss/aai;
"ba$str1" =~ /ss/ui; "ba$str2" =~ /ss/ui; "ba$str1$str3" =~ /ss/ui;
On 4 July 2017 at 13:14\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 03:12:19 yves orton via RT wrote:
On 4 July 2017 at 12:04\, \pali@​cpan\.org wrote:
On Tuesday 04 July 2017 11:22:42 demerphq wrote:
No. This is a myth. Plain and simply a myth.
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\, when set those rules are Unicode. When the flag is not set the default rules are derived from ASCII.
You can see the difference in the following:
"ba\x{DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\x{DF}"=~/ss/i;' matched
"ba\N{U+DF}"=~/ss/i;
$ perl -E 'say "matched" if "ba\N{U+DF}"=~/ss/i;' matched
-E is not -e.
-E is enabling a pragma which changes the default behavior.
However it is *PRAGMA*. It is NOT the normal behavior of Perl.
Ah\, right. I forgot that -E enables feature unicode_strings which basically means that both examples were equivalent.
Default behavior is a bit unpredicable as it is affected by the infamous Unicode Bug.
It is only unpredictable if your model of strings is broken. I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Tuesday 04 July 2017 13:32:26 demerphq wrote:
It is only unpredictable if your model of strings is broken.
I do not know what you mean if model of strings is broken\, but once you start receiving strings from other modules\, user input or whatever external resource\, plus you start combining/concatenating those strings you would hit the unicode bug. Therefore safe way is to use /aa or /u modifiers in regex matching in way how you want to do matching.
I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with.
I think this discussion is out of original request\, which is for better documentation of utf8.pm and better name for utf8::is_utf8() function.
On 07/04/2017 07:38 AM\, pali@cpan.org wrote:
On Tuesday 04 July 2017 13:32:26 demerphq wrote:
It is only unpredictable if your model of strings is broken. I do not know what you mean if model of strings is broken\,
It is "broken" in that sense for probably more people than we would like. Do we have any documentation that clarifies this entire issue? (I know I trip on this frequently and never fully understood this issue myself.)
[...]
I happen to be very familiar with the internals\, and do not find the actual rules to be that difficult to deal with. I think this discussion is out of original request\, which is for better documentation of utf8.pm and better name for utf8::is_utf8() function.
Agree.
For now we seem to have two points we agree on: * We want to document these functions * We want to give them better names * We want the old behavior to work
As long as the second clause does not break the third\, I think we should seek to move forward.
Yves mentioned that "Internals" namespace to be undesired place for it (which was discussed at P5H\, the last core hackathon) and I agree. "scalar" was the most popular one\, IIRC.
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
Thanks!
demerphq wrote:
People have a hard time accepting it\, but the utf8 flag tells parts of the internals to use different rules for certain operations\,
Those are bugs. In some cases they are bugs that we've decided we can't just fix because of backcompat\, so we add a flag to enable non-buggy semantics and the bug lives on as default behaviour.
If a flag to distinguish between character strings and binary strings were an intentional semantic feature\, we'd need some rules to say how the flag is to be set by operations that generate string outputs. We've never done that.
-zefram
Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
I didn't want to add to a mostly bikeshedding discussion\, but OK. I concur that the existing names are poor\, but I'm not much happier with the names that have been suggested on this thread. I reckon the best terminology we have for this flag\, at the user level\, is "upgraded"\, and so the name "is_utf8" would be better as "is_upgraded". The existing names "upgrade" and "downgrade" for the transforming operations are OK\, and the only change I'd potentially like to make to them would be to add something that explicates their rather unusual in-place side-effecting nature.
In fact you can see all my preferred names in my CPAN module Scalar::String. This module essentially attempts to be the sane version of utf8.pm\, attempting to impart the right mental model through its function names and documentation. (The "sclstr_" prefix on all the function names may be omitted if desired; the important part of the name is that which distinguishes these functions from each other.)
I think the names for these functions should be reasonably concise\, and in particular we should have a single-word adjective for "having the SvUTF8 flag on" if possible. We should also try to reuse existing terminology\, rather than invent anything new. We should also avoid any term that implies anything beyond the storage\, such as any reference to characters or Unicode\, because such implications are largely inaccurate\, and anywhere they are accurate is a bug. All of this leads me to prefer "upgraded" over "utf8"\, "unicode"\, "uses_wide_storage"\, and the like.
I don't have any strong opinion about which package any new names for these functions should appear in. I think on balance we should not remove the old names\, because the trouble that arises from maintaining them is small compared to the hassle that would arise from requiring existing correct programs to change. Not removing them implies that we wouldn't even be deprecating them\, as currently defined\, but we can fairly discourage the use of the old names in documentation.
-zefram
On 07/10/2017 02:13 PM\, Zefram wrote:
Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
I didn't want to add to a mostly bikeshedding discussion\, but OK. I concur that the existing names are poor\, but I'm not much happier with the names that have been suggested on this thread. I reckon the best terminology we have for this flag\, at the user level\, is "upgraded"\, and so the name "is_utf8" would be better as "is_upgraded". The existing names "upgrade" and "downgrade" for the transforming operations are OK\, and the only change I'd potentially like to make to them would be to add something that explicates their rather unusual in-place side-effecting nature.
In fact you can see all my preferred names in my CPAN module Scalar::String. This module essentially attempts to be the sane version of utf8.pm\, attempting to impart the right mental model through its function names and documentation. (The "sclstr_" prefix on all the function names may be omitted if desired; the important part of the name is that which distinguishes these functions from each other.)
I think the names for these functions should be reasonably concise\, and in particular we should have a single-word adjective for "having the SvUTF8 flag on" if possible. We should also try to reuse existing terminology\, rather than invent anything new. We should also avoid any term that implies anything beyond the storage\, such as any reference to characters or Unicode\, because such implications are largely inaccurate\, and anywhere they are accurate is a bug. All of this leads me to prefer "upgraded" over "utf8"\, "unicode"\, "uses_wide_storage"\, and the like.
I don't have any strong opinion about which package any new names for these functions should appear in. I think on balance we should not remove the old names\, because the trouble that arises from maintaining them is small compared to the hassle that would arise from requiring existing correct programs to change. Not removing them implies that we wouldn't even be deprecating them\, as currently defined\, but we can fairly discourage the use of the old names in documentation.
-zefram
My view is that the current names could be improved\, and that there should be no technical nor social problem in creating new names while retaining the old ones\, but changing the docs to stress the new ones. I've done that a lot.
I don't know what namespace is best. At first blush Internals seems good to me\, for this and other things that people currently have hacks for\, like
$foo & ""
which trying to find out if $foo is a string or just a number. I don't fully understand the objection to 'Internals'
I have never liked upgrade and downgrade. When you upgrade something you are supposed to get something better\, like more legroom. I have never seen why a PV is better than a number\, or a UTF-8 string better than a non-one (it's far slower\, for example\, which is a downgrade in my estimation). The use of upgrade and downgrade is jargon based on the attitudes of the implementers\, which should be avoided. Maybe it's too baked in to change\, but I regret that it's there. UTF-8 itself is an implementation detail that should never have been exposed to the outside\, but 'use utf8' pretty much does that.
On Mon\, 10 Jul 2017 19:53:42 -0700\, public@khwilliamson.com wrote:
I don't know what namespace is best. At first blush Internals seems good to me\, for this and other things that people currently have hacks for\, like
$foo & ""
which trying to find out if $foo is a string or just a number. I don't fully understand the objection to 'Internals'
Adding new public functions to the Internals namespace would completely change its meaning. It contains functions that exist mainly for perl’s own functionality (for built-in modules like Hash::Util to use) and for testing perl itself. Users are not supposed to know about them. That the cat is out of the bag and we cannot remove them is unfortunate.
Since we already use ‘utf8’ to refer to Perl’s Unicode support\, why not continue to use that namespace?
--
Father Chrysostomos
On Mon\, 10 Jul 2017 19:53:42 -0700\, public@khwilliamson.com wrote:
I have never liked upgrade and downgrade. When you upgrade something you are supposed to get something better\, like more legroom.
Well\, er\, that is exactly what you get. You can stretch your legs beyond CLV.*
I have never seen why a PV is better than a number\, or a UTF-8 string better than a non-one (it's far slower\, for example\,
I think that is one of the best arguments in favour of ‘upgrade’. It is just like upgrading most commercial software!
which is a downgrade in my estimation). The use of upgrade and downgrade is jargon based on the attitudes of the implementers\, which should be avoided. Maybe it's too baked in to change\, but I regret that it's there. UTF-8 itself is an implementation detail that should never have been exposed to the outside\, but 'use utf8' pretty much does that.
* That is a Roman numeral.
--
Father Chrysostomos
On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.
Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.
Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.
Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).
Life is now harder.
(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.
-- All wight. I will give you one more chance. This time\, I want to hear no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers. -- Life of Brian
On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:
On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.
Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.
Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.
Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).
Life is now harder.
(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.
I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)
--
Father Chrysostomos
On Mon\, 10 Jul 2017 09:46:48 -0700\, xsawyerx@gmail.com wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
I haven't seen names I prefer over the current names\, certainly none that are improved enough that it's worth having two names for the same thing.
Tony
On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote:
On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:
On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.
Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.
Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.
Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).
Life is now harder.
(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.
I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)
Count me in: three. I like the way Dave has written down my feelings :)
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On 07/12/2017 12:36 AM\, H.Merijn Brand wrote:
On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote:
On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:
On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.
Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.
Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.
Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).
Life is now harder.
(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.
I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)
Count me in: three. I like the way Dave has written down my feelings :)
I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.
The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.
Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.
Specifically about av_top_index\, I don't believe that it is so poorly named that you have to keep consulting the documentation as to what it does.
It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.
Using av_len is a bug waiting to happen. It is a foreseeable problem. I believe that it would be unethical to not create a non-deceptive alternative. It's kind of like a safety recall.
Writing code using deceptively named things or with poor API's is slower and more error prone. Every time you use one\, you have to get out of your mental pipeline and recall that this is a gotcha\, and have to figure out how it is a gotcha and how you have to compensate. You are effectively flushing your mental instruction cache. In the case of av_len\, you have to remember which way is the off-by-one problem here.
Code reviews also are affected. It is just too easy to read the thing and forget that it doesn't do what you would want.
In researching the issue back when av_top_index was created\, I found published modules that used av_len\, as its name implies\, as a length. Others undoubtedly had caught the problem earlier\, say through their unit testing.
But all this could be avoided by the code using a non-deceptive name. Hopefully\, the coder won't even be aware that there exist deceptive ones for hysterical reasons.
It is foreseeable that av_len is going to cause problems. It would be irresponsible of us to not create a non-deceptive synonym when it is so easy to do.
No one was really happy with "av_top_index" as a name. So AvFILL was retained in the core. All occurrences of av_len were removed. If we could have come up with a short\, pithy synonym\, we would have replaced AvFILL as well\, and then people looking at the core would have seen that and gotten used to it\, and over time the memory of the less well-named versions would have faded.
Writing good APIs is hard. I have flattered myself at times into thinking I'm good at it. Maybe I am actually good\, but if so\, I'm still not good enough. And few\, if any\, are. If we have a poor API in some area\, we should not tie our hands and say tough to all those people who come along later\, and give them more reason to use some other language
On Wed\, 12 Jul 2017 22:53:57 -0600\, Karl Williamson \public@​khwilliamson\.com wrote:
It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.
The problem with av_top_index is that it hat not (yet) been ported to Devel::PPPort\, so I cannot change any XS code that uses av_len into using the new function if that XS is to support 5.16.0 or older
$ ack av_top_index ppport.h 1225:av_top_index||5.017009| $
I know I didn't quote all of your message and I understand your motivation\, but the problem for these misnamed functions is much wider than the scope of av_top_index\, which is *only* available to XS\, and XS is more or less easy to fix by adding stuff to Devel::PPPort
For the utf8 functions\, the scope is WAY wider: it is used from pure-perl\, and renaming them (with or without aliases) would cause major brain damage for all authors that use these functions (correct or incorrect) when their code has to work on a wide range of perl versions.
To be honest\, I do not see an easy way out of that dilemma. If you have one\, I'm open to change for the better.
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On Wednesday 12 July 2017 23:44:39 H. Merijn Brand via RT wrote:
On Wed\, 12 Jul 2017 22:53:57 -0600\, Karl Williamson \public@​khwilliamson\.com wrote:
It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.
The problem with av_top_index is that it hat not (yet) been ported to Devel::PPPort\,
Devel::PPPort is probably unmaintained... It has open couple of bugs since 2015 without any comments. And also pull requests are not processed since 2016. Even those security released like this: https://github.com/mhx/Devel-PPPort/pull/47
Because of those problems\, I have no motivation to prepare any other patch for Devel::PPPort. For dead/unmaintained modules it is useless.
so I cannot change any XS code that uses av_len into using the new function if that XS is to support 5.16.0 or older
$ ack av_top_index ppport.h 1225:av_top_index||5.017009| $
I know I didn't quote all of your message and I understand your motivation\, but the problem for these misnamed functions is much wider than the scope of av_top_index\, which is *only* available to XS\, and XS is more or less easy to fix by adding stuff to Devel::PPPort
For the utf8 functions\, the scope is WAY wider: it is used from pure-perl\, and renaming them (with or without aliases) would cause major brain damage for all authors that use these functions (correct or incorrect) when their code has to work on a wide range of perl versions.
To be honest\, I do not see an easy way out of that dilemma. If you have one\, I'm open to change for the better.
Problem is that people very often use construct which I wrote in first comment. Or they read "is_utf8" means string is UTF-8 encoded and therefore I need to call utf8::decode() on it.
And all this happens just because of wrong name from which can be deduced by more people what it should do -- which involves in *no* reading documentation...
If we would not add better aliases\, then broken code would be still produced on cpan.
As utf8::is_utf8() is not needed too often\, backward compatibility can be achieved by:
*NEW_NAME = *utf8::is_utf8;
I think this is a good compromise. If you think that upgrade and downgrade function names are fine\, OK\, but at least please add better name for is_utf8(). In original email I suggested is_upgraded()\, so name would be bound with "upgrade()" function. Because it really checks if upgrade() was called or not.
On Wed\, 12 Jul 2017 21:55:03 -0700\, public@khwilliamson.com wrote:
I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.
I agree the disagreement is unfortunate.
The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.
When you first put forward this argument (specifically with regard to av_len)\, it made sense to me\, and I had no objection to it. Later\, people wrote to p5p complaining that the new situation was more confusing; in addition\, *I* started to get confused. That was when I started to have second thoughts.
I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables\, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.
I think the same applies even to poorly named functions. You just have to learn the gotcha once\, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source\, then you are coming close to what I would call autopodotoxy.)
Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.
But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.
My personal experience is that what you are arguing for\, while it sounds good\, does not work in practice.
--
Father Chrysostomos
On 14 July 2017 at 04:28\, Father Chrysostomos via RT \perlbug\-followup@​perl\.org wrote:
On Wed\, 12 Jul 2017 21:55:03 -0700\, public@khwilliamson.com wrote:
I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.
I agree the disagreement is unfortunate.
The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.
When you first put forward this argument (specifically with regard to av_len)\, it made sense to me\, and I had no objection to it. Later\, people wrote to p5p complaining that the new situation was more confusing; in addition\, *I* started to get confused. That was when I started to have second thoughts.
I think Damian Conway was right when he wrote in PBP that one should not use English (the module). Since other people use punctuation variables\, you are going to have to learn them anyway. Using the English names just forces others reading your code to look up the names that you are using. It just creates more cognitive burden.
I think the same applies even to poorly named functions. You just have to learn the gotcha once\, and then you can use the function and read code that uses it. (And if you use functions without reading either the documentation or the source\, then you are coming close to what I would call autopodotoxy.)
Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.
But the gotchas never get removed. You just end up with a larger pile of functions for people to sift through. Not to mention a lot of existing (and correct) code that they cannot read without learning the discouraged parlance. So they have to learn the different forms anyway.
My personal experience is that what you are arguing for\, while it sounds good\, does not work in practice.
I think the reason that it sounds good is because it does make sense at a micro level. If you are working on company code for instance\, or a small code-base\, renaming poorly named things means that the old name is *gone*\, and cognitive burden is reduced.
But with something like Perl we can't just get rid of things\, if we want to rename we have to do something for all the older code out there. So we have to support both in some ways. Which means the cognitive burden is increased.
Despite this I think sometimes these things *can* be justified and managed\, but we have to be extremely careful about the choices we make\, and have real plans in place to deprecate the older use cases in some kind of way. So for instance if we were going to get rid of Internals then we can rename things it contained\, and then bundle an Internals.pm which does the right thing\, people needing back compat can add 'use Internals' and get the back-compat. So i could see us considering the ideas in this thread in the context of the proposed introduction of 'array'\, 'scalar'\, etc.
yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
[Top-posted]
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
(Since Perl 5.8.1) Test whether $string is marked internally as encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
[INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. If /CHECK/ is true\, also checks whether /STRING/ contains well-formed UTF-8. Returns true if successful\, false otherwise.
As of Perl 5.8.1\, utf8 \https://metacpan.org/pod/utf8 also has the |utf8::is_utf8| function.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void).
* Provide different functions and document all functions in all other functions
If we decide to have better named functions\, we will have additional cognitive load for both experienced core developers and new developers. For core developers\, it is a muscle memory to undo and two different sets of code to deal with - those with the old name and those with the new name. For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. However\, their muscle memory will be geared towards using a more descriptive name.
I mix those when it comes to English.pm. I use $_\, $@\, $!\, $#\, $/\, $^X\, $0 and a few more\, but I use English.pm for $\<\, $>\, $(\, $)\, $"\, and a few more. The reasoning is simple: $_\, $@\, $!\, and $# are so common it will be built into every muscle memory. On the other hand\, for many developers\, if they see $\<\, they will need to look it up in perlvar anyway. However\, $UID or $REAL_USER_ID is readable right away and no need to look it up.
One additional point about English is that\, unlike what we're suggesting here\, the punctuation variable names are the right name\, they're just not descriptive. is_utf8() is not about descriptive\, but misleading. It is a misnomer. It makes it an undesired pitfall.
I see value in adding proper names\, but then we would need to take care of at least making all possible names available in the documentation of all other names. If you're reading utf8.pm\, you need to find "is_upgraded" in "is_utf8" and "is_utf8" in "is_upgraded"[1]. This makes it easy to quickly find what they mean and differentiate when we see different names.
* Move all known usages in core to new functions
Another way to improve this new cognitive load is by reducing it in the codebase. Removing as many instances of the old name will reduce the mixture of names\, thus helping us move towards the new name. This is a much more intrusive change but has a high potential of helping seasoned developers to deal with the new name.
* Automated policies for improving CPAN code quality
This is beyond the scope of core\, but I think it's worthwhile taking into account the perspective of the community. Realizing the misused "is_utf8" brings with it a question of whether and how we could reduce this problem's scope outside the core\, and this could have been done with a kwalitee check (CPANTS[2]) that checked for "is_utf8" and recommends reviewing its use. This is far more complicated since there is a legitimate (but narrow) use for it\, and you might get false positives. I believe only a human could find the situations in which it's valuable.
Still\, it is worthwhile keeping in mind.
Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?
[1] Using "is_upgraded" as an example different name. [2] http://cpants.cpanauthors.org/
On 07/13/2017 06:53 AM\, Karl Williamson wrote:
On 07/12/2017 12:36 AM\, H.Merijn Brand wrote:
On Tue\, 11 Jul 2017 10:41:37 -0700\, "Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote:
On Tue\, 11 Jul 2017 00:55:51 -0700\, davem wrote:
On Mon\, Jul 10\, 2017 at 12:45:48PM -0400\, Sawyer X wrote:
Does anyone have any comments on this? Tony\, Dave\, Zefram? *Karl*? :)
My opinion on this sort of proposal (and it's an opinion which has gotten stronger over time (*)) is rarely/never to add a new alias name to an existing function.
Alias names just increase the cognitive load. If the old names were confusing\, having more names will just increase the confusion.
Before\, you would have to remember that a particular function foo() is badly named and doesn't do what you might expect it to do\, based solely on the name.
Afterwards\, you have to remember that that are two functions foo() and bar()\, one is deprecated (which one?)\, one is badly named (which one?)\, but they both do the same thing (Or do they? Sigh. Let's check the documentation one more time).
Life is now harder.
(*) My opinion firmed over AvFILL(). It was a weird name\, but I was used to it. Now I can never remember what the new alias is called (just looked it up - av_top_index()). In hindsight\, I would have voted against adding av_top_index.
I agree with everything you have said. I brought up the same objection when this proposal was first put forward\, but I thought I had lost the debate. Well\, at least there are two of us now. :-)
Count me in: three. I like the way Dave has written down my feelings :)
I guess we have a fundamental disagreement about language design and the direction Perl should go\, which makes me sad.
The point of adding synonyms for deceptively-named functions and macros is to make life easier overall. Forbidding new better-named synonyms for problematically named things forces everyone who comes along to deal with the gotchas and cognitive load that those people already here have had to deal with. By creating better-named things\, those people can largely avoid these problems. This allows them to work more efficiently\, avoiding traps\, and with less cursing Perl.
Unless Perl is close to death\, the number of people who are going to come along before it does die dwarfs the number who are already expert. Some people are knowledgeable in parts of Perl\, but not all. They also gain if gotchas get removed before they have to deal with them.
Specifically about av_top_index\, I don't believe that it is so poorly named that you have to keep consulting the documentation as to what it does.
It came about not because of AvFILL\, but because of the already-existing synonym\, the evilly named "av_len". This name implies it gives a length\, but in fact it is one-off from that. av_top_index\, though cumbersome\, accurately indicates what it returns.
Using av_len is a bug waiting to happen. It is a foreseeable problem. I believe that it would be unethical to not create a non-deceptive alternative. It's kind of like a safety recall.
Writing code using deceptively named things or with poor API's is slower and more error prone. Every time you use one\, you have to get out of your mental pipeline and recall that this is a gotcha\, and have to figure out how it is a gotcha and how you have to compensate. You are effectively flushing your mental instruction cache. In the case of av_len\, you have to remember which way is the off-by-one problem here.
Code reviews also are affected. It is just too easy to read the thing and forget that it doesn't do what you would want.
In researching the issue back when av_top_index was created\, I found published modules that used av_len\, as its name implies\, as a length. Others undoubtedly had caught the problem earlier\, say through their unit testing.
But all this could be avoided by the code using a non-deceptive name. Hopefully\, the coder won't even be aware that there exist deceptive ones for hysterical reasons.
It is foreseeable that av_len is going to cause problems. It would be irresponsible of us to not create a non-deceptive synonym when it is so easy to do.
No one was really happy with "av_top_index" as a name. So AvFILL was retained in the core. All occurrences of av_len were removed. If we could have come up with a short\, pithy synonym\, we would have replaced AvFILL as well\, and then people looking at the core would have seen that and gotten used to it\, and over time the memory of the less well-named versions would have faded.
Writing good APIs is hard. I have flattered myself at times into thinking I'm good at it. Maybe I am actually good\, but if so\, I'm still not good enough. And few\, if any\, are. If we have a poor API in some area\, we should not tie our hands and say tough to all those people who come along later\, and give them more reason to use some other language
On Mon\, 17 Jul 2017 01:47:32 -0700\, xsawyerx@gmail.com wrote:
I have mixed thoughts about this.
Me too.
If we decide to have better named functions [...] For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. [...]
I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience\, most people working in a perl shop tend to read lots of code in their local codebase\, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up\, it may not happen that early for a good proportion of new developers.
Maybe you had in mind primarily historical threads googled up from perlmonks\, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.
Hugo
On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
\(Since Perl 5\.8\.1\) Test whether $string is marked internally as encoded in UTF\-8\. Functionally the same as "Encode​::is\_utf8\(\)"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\. If /CHECK/ is true\, also checks whether /STRING/ contains well\-formed UTF\-8\. Returns true if successful\, false otherwise\. As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the |utf8​::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.
If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?
Perhaps something like:
=item * C\<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.
If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.
Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.
If you're accepting bytes:
utf8::downgrade($string); # throws an exception if code point over 0xFF
utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"
or if you're accepting characters and need encoded bytes:
utf8::encode($string); # unconditionally
The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.
\<\<
Are there any other cases someone might be tempted to call utf8::is_utf8()?
Tony
On Tue\, 18 Jul 2017 10:53:53 +1000\, Tony Cook \tony@​develop\-help\.com wrote:
On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
\(Since Perl 5\.8\.1\) Test whether $string is marked internally as encoded in UTF\-8\. Functionally the same as "Encode​::is\_utf8\(\)"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\. If /CHECK/ is true\, also checks whether /STRING/ contains well\-formed UTF\-8\. Returns true if successful\, false otherwise\. As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the |utf8​::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.
If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void).
... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?Perhaps something like:
=item * C\<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.
If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.
Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.
If you're accepting bytes:
utf8::downgrade($string); # throws an exception if code point over 0xFF
utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"
or if you're accepting characters and need encoded bytes:
utf8::encode($string); # unconditionally
The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.
\<\<
Are there any other cases someone might be tempted to call utf8::is_utf8()?
Tony
I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On Tue\, Jul 18\, 2017 at 3:04 AM\, H.Merijn Brand \h\.m\.brand@​xs4all\.nl wrote:
I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.
This isn't something that you can guarantee. It always depends on knowing how you get your input. When people don't understand this they look for the magic bullet that is_utf8 appears to be\, but it is not.
On Tue\, 18 Jul 2017 03:13:40 -0400\, Dan Book \grinnz@​gmail\.com wrote:
On Tue\, Jul 18\, 2017 at 3:04 AM\, H.Merijn Brand \h\.m\.brand@​xs4all\.nl wrote:
I like this. What I miss here is a small example of how to guarantee preventing double encoding/decoding\, as I think that is what is function is most often (erroneously) used for.
This isn't something that you can guarantee. It always depends on knowing how you get your input. When people don't understand this they look for the magic bullet that is_utf8 appears to be\, but it is not.
My point exactly. Just have a piece of text that tells the user why it isn't and what the best alternative *could* be.
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.27 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
On 07/17/2017 10:09 PM\, Hugo van der Sanden via RT wrote:
On Mon\, 17 Jul 2017 01:47:32 -0700\, xsawyerx@gmail.com wrote:
I have mixed thoughts about this. Me too.
If we decide to have better named functions [...] For new developers\, it will be simple at first\, until you come in contact with the old name. It is likely this will also happen early\, so you need to learn two names anyway. [...] I just want to reach a bit deeper into "likely this will also happen early" - in my limited experience\, most people working in a perl shop tend to read lots of code in their local codebase\, but rarely read code outside of it (not even for the CPAN modules they're using). So wherever the local codebase gets cleaned up\, it may not happen that early for a good proportion of new developers.
Maybe you had in mind primarily historical threads googled up from perlmonks\, stackoverflow and the like; I agree those are less likely to get cleaned up. I cannot guess what proportion of newish developers would hit those; I can't even guess what proportion of me would hit those had I been born 30 years later.
I meant people who will start hacking on Perl core.
On Tue\, Jul 18\, 2017 at 10:53:53AM +1000\, Tony Cook wrote:
On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
\(Since Perl 5\.8\.1\) Test whether $string is marked internally as encoded in UTF\-8\. Functionally the same as "Encode​::is\_utf8\(\)"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\. If /CHECK/ is true\, also checks whether /STRING/ contains well\-formed UTF\-8\. Returns true if successful\, false otherwise\. As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the |utf8​::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here).
utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.
If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation?
Perhaps something like:
=item * C\<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.
If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.
Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.
If you're accepting bytes:
utf8::downgrade($string); # throws an exception if code point over 0xFF
utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"
or if you're accepting characters and need encoded bytes:
utf8::encode($string); # unconditionally
The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.
\<\<
Are there any other cases someone might be tempted to call utf8::is_utf8()?
Thinking about it further\, I'm pretty sure this doesn't all belong here.
L\<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns\, and the whole of perlunifaq covers some of the things the above tries to cover.
perlunicook largely works at a higher level than the functions in utf8::* work at.
One thing from the above that doesn't seem to be discussed well[1] is what I tried to cover briefly in:
Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.
which could perhaps use some expansion in perlunicode.
I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*).
Tony
[1] perlunifaq briefly mentions some of the issues under "What about binary data\, like image?" and more detail in "What if I don't decode?"
On 07/19/2017 08:58 AM\, Tony Cook wrote:
On Tue\, Jul 18\, 2017 at 10:53:53AM +1000\, Tony Cook wrote:
On Mon\, Jul 17\, 2017 at 10:46:59AM +0200\, Sawyer X wrote:
[Top-posted]
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions to reduce confusion for future developers (we hope to have some\, right?) but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the first thing that should be done. I have double checked the wording for utf8::is_utf8() from my blead (978b185):
\(Since Perl 5\.8\.1\) Test whether $string is marked internally as encoded in UTF\-8\. Functionally the same as "Encode​::is\_utf8\(\)"\.
This is confusing\, to say the least. "Marked internally" is the words core hackers are looking for and recognize\, but "UTF-8" is what non-core hackers (those without the cognitive bias in core terms) see and understand. If we head over to Encode::is_utf8() we see:
\[INTERNAL\] Tests whether the UTF8 flag is turned on in the /STRING/\. If /CHECK/ is true\, also checks whether /STRING/ contains well\-formed UTF\-8\. Returns true if successful\, false otherwise\. As of Perl 5\.8\.1\, utf8 \<https://metacpan.org/pod/utf8> also has the |utf8​::is\_utf8| function\.
I like this wording better for several reasons: It is under the title "Messing with Perl's Internals"; it notes the "UTF8" flag\, and it adds that it checks for well-formed UTF-8 only if that flag is true. There are improvements to be made here too. We can note what the flag means (subtle\, complicated\, bike-shed-able) or at the very least add a nice "this isn't the flag you're looking for" warning. We can also suggest when to use and when not to use the function (otherwise it's left to the reader\, who can easily get it wrong\, which is why we're here). utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that)\, despite the note in utf8.pm.
If the document on both was better\, then we could have possibly left this as unfortunate naming errors we're carrying with us (along with "wantarray" for noting whether the context is scalar\, list\, or void). ... Overall\, I'm still undecided. Maybe we could start with improving the existing documentation? Perhaps something like:
=item * C\<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I\<$string> is marked internally as encoded in UTF-8. Functionally the same as C\<Encode::is_utf8($string)>. Typically only necessary for debugging.
If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12\, call C\<utf8::upgrade($string)> unconditionally.
Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong\, this should be decided as part of the interface of your function.
If you're accepting bytes:
utf8::downgrade($string); # throws an exception if code point over 0xFF
utf8::downgrade($string\, 1) # our own error handling or die "\$string must be representable as bytes"
or if you're accepting characters and need encoded bytes:
utf8::encode($string); # unconditionally
The only exception is if you're dealing with filenames\, since perl uses the internal representation of the string for system calls.
\<\<
Are there any other cases someone might be tempted to call utf8::is_utf8()? Thinking about it further\, I'm pretty sure this doesn't all belong here.
L\<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns\, and the whole of perlunifaq covers some of the things the above tries to cover.
perlunicook largely works at a higher level than the functions in utf8::* work at.
+1 on the suggested text.
I think this addition is useful\, even if it is also covered in more documents. We could also link to those documents for further learning.
On Tue\, 18 Jul 2017 23:58:39 -0700\, tonyc wrote:
which could perhaps use some expansion in perlunicode.
perlunitut covers this reasonably well.
I'm not sure where the cheat sheet following belongs\, though perlunifaq covers some of it (though using Encode instead of utf8::*).
Attached is a series of patches (as a single file)\, the first three fix some minor problems with the unicode documentation I found when going through it.
The fourth re-works the documentation in utf8.pm\, taking bits from my little cheat sheet and hopefully putting them in the right places.
Tony
Migrated from rt.perl.org#131685 (status was 'open')
Searchable as RT131685$