Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

PATCH: Use Unicode 6.0 #10720

Closed p5pRT closed 13 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#78354 (status was 'resolved')

Searchable as RT78354$

p5pRT commented 13 years ago

From @khwilliamson

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t

p5pRT commented 13 years ago

From @khwilliamson

0001-Fix-typos-in-comments.patch ```diff From 01b05d0fecfac0088df9b2739fa6f6a994c2588a Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 12 Oct 2010 16:24:59 -0600 Subject: [PATCH] Fix typos in comments --- lib/unicore/mktables | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index cd83210..c82d2e4 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -1172,7 +1172,7 @@ my %map_table_formats = ( $HEX_FORMAT => 'positive hex whole number; a code point', $RATIONAL_FORMAT => 'rational: an integer or a fraction', $STRING_FORMAT => 'string', - $DECOMP_STRING_FORMAT => 'Perl\'s internal (Normalize.pm) decompostion mapping', + $DECOMP_STRING_FORMAT => 'Perl\'s internal (Normalize.pm) decomposition mapping', ); # Unicode didn't put such derived files in a separate directory at first. @@ -8970,7 +8970,7 @@ sub output_perl_charnames_line ($$) { # 0374 ; NFD_QC; N # 003C..003E ; Math # - # the fields are: "codepoint range ; property; map" + # the fields are: "codepoint-range ; property; map" # # meaning the codepoints in the range all have the value 'map' under # 'property'. -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @khwilliamson

0002-mktables-Upgrade-to-handle-new-Unicode-6.0-tables.patch ```diff From 8055a791c3d0cfc1ee090344082cb5ef84de8850 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 12 Oct 2010 17:58:13 -0600 Subject: [PATCH] mktables: Upgrade to handle new Unicode 6.0 tables --- lib/unicore/mktables | 60 +++++++++++++++++++++++++++---------------------- 1 files changed, 33 insertions(+), 27 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index c82d2e4..fc0d8c9 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -50,7 +50,7 @@ sub DEBUG () { 0 } # Set to 0 for production; 1 for development # the small actual loop to process the input files and finish up; then # a __DATA__ section, for the .t tests # -# This program works on all releases of Unicode through at least 5.2. The +# This program works on all releases of Unicode through at least 6.0. The # outputs have been scrutinized most intently for release 5.1. The others # have been checked for somewhat more than just sanity. It can handle all # existing Unicode character properties in those releases. @@ -183,9 +183,9 @@ my $unicode_reference_url = 'http://www.unicode.org/reports/tr44/'; # More information on Unicode version glitches is further down in these # introductory comments. # -# This program works on all properties as of 5.2, though the files for some -# are suppressed from apparent lack of demand for them. You can change which -# are output by changing lists in this program. +# This program works on all non-provisional properties as of 6.0, though the +# files for some are suppressed from apparent lack of demand for them. You +# can change which are output by changing lists in this program. # # The old version of mktables emphasized the term "Fuzzy" to mean Unocde's # loose matchings rules (from Unicode TR18): @@ -418,7 +418,7 @@ my $unicode_reference_url = 'http://www.unicode.org/reports/tr44/'; # Unicode_Radical_Stroke was listed in those files, so if the Unihan database # is present in the directory, a table will be generated for that property. # In 5.2, several more properties were added. For your convenience, the two -# arrays are initialized with all the 5.2 listed properties that are also in +# arrays are initialized with all the 6.0 listed properties that are also in # earlier releases. But these are commented out. You can just uncomment the # ones you want, or use them as a template for adding entries for other # properties. @@ -805,7 +805,7 @@ if ($v_version gt v3.2.0) { 'Canonical_Combining_Class=Attached_Below_Left' } -# These are listed in the Property aliases file in 5.2, but Unihan is ignored +# These are listed in the Property aliases file in 6.0, but Unihan is ignored # unless explicitly added. if ($v_version ge v5.2.0) { my $unihan = 'Unihan; remove from list if using Unihan'; @@ -848,10 +848,10 @@ my %why_obsolete; # Documentation only my $other_properties = 'other properties'; my $contributory = "Used by Unicode internally for generating $other_properties and not intended to be used stand-alone"; - my $why_no_expand = "Easily computed, and yet doesn't cover the common encoding forms (UTF-16/8)", + my $why_no_expand = "Deprecated by Unicode: less useful than UTF-specific calculations", %why_deprecated = ( - 'Grapheme_Link' => 'Deprecated by Unicode. Use ccc=vr (Canonical_Combining_Class=Virama) instead', + 'Grapheme_Link' => 'Deprecated by Unicode: Duplicates ccc=vr (Canonical_Combining_Class=Virama)', 'Jamo_Short_Name' => $contributory, 'Line_Break=Surrogate' => 'Deprecated by Unicode because surrogates should never appear in well-formed text, and therefore shouldn\'t be the basis for line breaking', 'Other_Alphabetic' => $contributory, @@ -865,7 +865,7 @@ my %why_obsolete; # Documentation only ); %why_suppressed = ( - # There is a lib/unicore/Decomposition.pl (used by normalize.pm) which + # There is a lib/unicore/Decomposition.pl (used by Normalize.pm) which # contains the same information, but without the algorithmically # determinable Hangul syllables'. This file is not published, so it's # existence is not noted in the comment. @@ -882,10 +882,7 @@ my %why_obsolete; # Documentation only 'Name' => "Accessible via 'use charnames;'", 'Name_Alias' => "Accessible via 'use charnames;'", - # These are sort of jumping the gun; deprecation is proposed for - # Unicode version 6.0, but they have never been exposed by Perl, and - # likely are soon to be deprecated, so best not to expose them. - FC_NFKC_Closure => 'Use NFKC_Casefold instead', + FC_NFKC_Closure => 'Supplanted in usage by NFKC_Casefold; otherwise not useful', Expands_On_NFC => $why_no_expand, Expands_On_NFD => $why_no_expand, Expands_On_NFKC => $why_no_expand, @@ -907,9 +904,15 @@ my %why_obsolete; # Documentation only if ($v_version ge 4.0.0) { $why_stabilized{'Hyphen'} = 'Use the Line_Break property instead; see www.unicode.org/reports/tr14'; + if ($v_version ge 6.0.0) { + $why_deprecated{'Hyphen'} = 'Supplanted by Line_Break property values; see www.unicode.org/reports/tr14'; + } } -if ($v_version ge 5.2.0) { +if ($v_version ge 5.2.0 && $v_version lt 6.0.0) { $why_obsolete{'ISO_Comment'} = 'Code points for it have been removed'; + if ($v_version ge 6.0.0) { + $why_deprecated{'ISO_Comment'} = 'No longer needed for chart generation; otherwise not useful, and code points for it have been removed'; + } } # Probably obsolete forever @@ -928,7 +931,7 @@ END # If you are using the Unihan database, you need to add the properties that # you want to extract from it to this table. For your convenience, the -# properties in the 5.2 PropertyAliases.txt file are listed, commented out +# properties in the 6.0 PropertyAliases.txt file are listed, commented out my @cjk_properties = split "\n", <<'END'; #cjkAccountingNumeric; kAccountingNumeric #cjkOtherNumeric; kOtherNumeric @@ -947,7 +950,7 @@ my @cjk_properties = split "\n", <<'END'; END # Similarly for the property values. For your convenience, the lines in the -# 5.2 PropertyAliases.txt file are listed. Just remove the first BUT NOT both +# 6.0 PropertyAliases.txt file are listed. Just remove the first BUT NOT both # '#' marks my @cjk_property_values = split "\n", <<'END'; ## @missing: 0000..10FFFF; cjkAccountingNumeric; NaN @@ -1030,6 +1033,10 @@ my %ignored_files = ( 'ReadMe.txt' => 'Just comments', 'README.TXT' => 'Just comments', 'StandardizedVariants.txt' => 'Only for glyph changes, not a Unicode character property. Does not fit into current scheme where one code point is mapped', + 'EmojiSources.txt' => 'Not of general utility: for Japanese legacy cell-phone applications', + 'IndicMatraCategory.txt' => 'Provisional', + 'IndicSyllabicCategory.txt' => 'Provisional', + 'ScriptExtensions.txt' => 'Provisional', ); ### End of externally interesting definitions, except for @input_file_objects @@ -8218,7 +8225,7 @@ sub finish_property_setup { } } - # This entry is still missing as of 5.2, perhaps because no short name for + # This entry is still missing as of 6.0, perhaps because no short name for # it. if (-e 'NameAliases.txt') { my $aliases = property_ref('Name_Alias'); @@ -10297,7 +10304,7 @@ sub filter_special_casing_line { # implemented, it would be by hard-coding in the casing functions in the # Perl core, not through tables. But if there is a new condition we don't # know about, output a warning. We know about all the conditions through - # 5.2 + # 6.0 if ($fields[4] ne "") { my @conditions = split ' ', $fields[4]; if ($conditions[0] ne 'tr' # We know that these languages have @@ -12889,22 +12896,21 @@ several varieties of obsolesence: =item Obsolete Properties marked with $a_bold_obsolete in the table are considered -obsolete. At the time of this writing (Unicode version 5.2) there is no -information in the Unicode standard about the implications of a property being obsolete. =item Stabilized -Obsolete properties may be stabilized. This means that they are not actively -maintained by Unicode, and will not be extended as new characters are added to -the standard. Such properties are marked with $a_bold_stabilized in the -table. At the time of this writing (Unicode version 5.2) there is no further -information in the Unicode standard about the implications of a property being -stabilized. +Obsolete properties may be stabilized. Such a determination does not indicate +that the property should or should not be used; instead it is a declaration +that the property will not be maintained nor extended for newly encoded +characters. Such properties are marked with $a_bold_stabilized in the +table. =item Deprecated -Obsolete properties may be deprecated. This means that their use is strongly +An obsolete property may be deprecated, perhaps because its original intent +has been replaced by another property or because its specification was somehow +defective. This means that its use is strongly discouraged, so much so that a warning will be issued if used, unless the regular expression is in the scope of a C> statement. $A_bold_deprecated flags each such entry in the table, and -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @tux

On Tue\, 12 Oct 2010 21​:56​:17 -0700\, karl williamson (via RT) \perlbug\-followup@&#8203;perl\.org wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

Wow.

Don't forget to also update Module​::CoreList

$ corelist -a Unicode

Unicode was first released with perl v5.6.2   v5.6.2 3.0.1   v5.8.0 3.2.0   v5.8.1 4.0.0   v5.8.2 4.0.0   v5.8.3 4.0.0   v5.8.4 4.0.1   v5.8.5 4.0.1   v5.8.6 4.0.1   v5.8.7 4.1.0   v5.8.8 4.1.0   v5.8.9 5.1.0   v5.9.0 4.0.0   v5.9.1 4.0.0   v5.9.2 4.0.1   v5.9.3 4.1.0   v5.9.4 4.1.0   v5.9.5 5.0.0   v5.10.0 5.0.0   v5.10.1 5.1.0   v5.11.0 5.1.0   v5.11.1 5.1.0   v5.11.2 5.1.0   v5.11.3 5.2.0   v5.11.4 5.2.0   v5.11.5 5.2.0   v5.12.0 5.2.0   v5.12.1 5.2.0   v5.12.2 5.2.0   v5.13.0 5.2.0   v5.13.1 5.2.0   v5.13.2 5.2.0   v5.13.3 5.2.0   v5.13.4 5.2.0

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20\, 11.00\, 11.11\, 11.23 and 11.31\, OpenSuSE 10.1\, 11.0 .. 11.3 and AIX 5.2 and 5.3. http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From @khwilliamson

I added a little more detail in perldelta\, and attached is some extra wording on steps to take to deliver a new Unicode DB\, including changing pod.lst to the new number. Everything is pushed to git​://github.com/khwilliamson/perl.git branch mktables

H.Merijn Brand wrote​:

Don't forget to also update Module​::CoreList

That looks to me to be part of the monthly release manager's job. The attached patch mentions the need to update it. Thanks for pointing it out.

Also\, I misspoke when I said\,

but the rest of the Perl core should fully support it.

We do not fully support the Unicode standard; what I meant was 6.0 is supported as well as 5.2.

p5pRT commented 13 years ago

From @khwilliamson

0004-More-updates-to-point-to-Unicode-6.0.patch ```diff From a8a76aef504db9744e577d6f46e3dcb715fe0d04 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Wed, 13 Oct 2010 09:27:38 -0600 Subject: [PATCH] More updates to point to Unicode 6.0 --- lib/unicore/README.perl | 5 +++++ pod.lst | 2 +- 2 files changed, 6 insertions(+), 1 deletions(-) diff --git a/lib/unicore/README.perl b/lib/unicore/README.perl index bbfcc3d..6656daf 100644 --- a/lib/unicore/README.perl +++ b/lib/unicore/README.perl @@ -105,6 +105,11 @@ current one is mktables has many checks to warn you if there are unexpected or novel things that it doesn't know how to handle. +pod.lst should be changed so that it gives the new name (which includes the +Unicode release number) for perluniprops.pod + +Module::CoreList should be changed to include the new release + Finally: p4 submit diff --git a/pod.lst b/pod.lst index 3bcbd55..cc8bbdb 100644 --- a/pod.lst +++ b/pod.lst @@ -83,7 +83,7 @@ h Reference Manual perluniintro Perl Unicode introduction perlunicode Perl Unicode support perlunifaq Perl Unicode FAQ -g perluniprops Index of Unicode Version 5.2.0 properties in Perl +g perluniprops Index of Unicode Version 6.0.0 properties in Perl perlunitut Perl Unicode tutorial perlebcdic Considerations for running Perl on EBCDIC platforms -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @tux

On Wed\, 13 Oct 2010 09​:48​:48 -0600\, karl williamson \public@&#8203;khwilliamson\.com wrote​:

H.Merijn Brand wrote​:

Don't forget to also update Module​::CoreList

That looks to me to be part of the monthly release manager's job.

Yes\, but as Unicode is a property of the implementation\, and not a separate module\, it is most likely not auto-detected.

The attached patch mentions the need to update it. Thanks for pointing it out.

Also\, I misspoke when I said\,

but the rest of the Perl core should fully support it.

We do not fully support the Unicode standard; what I meant was 6.0 is supported as well as 5.2.

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20\, 11.00\, 11.11\, 11.23 and 11.31\, OpenSuSE 10.1\, 11.0 .. 11.3 and AIX 5.2 and 5.3. http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 13 years ago

From @cpansprout

On Tue Oct 12 21​:56​:17 2010\, public@​khwilliamson.com wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t

I’ve applied the first patch as 92f9d56c66. With the Unicode 6 database I get a test failure​:

$ curl http​://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch | git am [...] $ cd t $ ./perl harness -v ../lib/charnames.t [...] not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514) # Failed at ../lib/charnames.t line 105 # got "\a" # expected "\x{1f514}"

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Tue Oct 12 21​:56​:17 2010\, public@​khwilliamson.com wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t

I’ve applied the first patch as 92f9d56c66. With the Unicode 6 database I get a test failure​:

$ curl http​://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch | git am [...] $ cd t $ ./perl harness -v ../lib/charnames.t [...] not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514) # Failed at ../lib/charnames.t line 105 # got "\a" # expected "\x{1f514}"

I'm afraid this is what I consider to be a flaw in the new standard\, though they wouldn't; I regret that I did not find it before it was too late; as your tests are the first it surfaced. I'm not sure Unicode would have listened to me anyway\, but we would have known about this earlier.

Your tests showed the problem and my tests didn't\, because of the random sampling of the tests\, because it would take too long to go through all million possible code points each time; and my tests just didn't try that combination yet.

I'm not sure what to do; suggestions welcome.

The problem stems from the fact that the Standard does not give names to the control characters\, such as ACK and BEL. It did in version 1.0\, and it still publishes those names as the "Unicode_1_Name" property. That name for character 0x07\, known by the acronym BEL\, is "BELL". What Perl does is to use the Unicode 1 names when there is no current. All was fine until 6.0 came along and re-used BELL for a different character.

But as far as Unicode is concerned\, there isn't a problem\, as BEL has no official name. It is Perl who has persisted in using this old name. I don't know why Unicode removed the names; and it seems eminently reasonable to give them names; but here we are.

The only option I can think of that doesn't violate our stability policies is to\, in 5.14\, keep the old BELL meaning\, but deprecate it\, saying to use BEL instead\, which was added in 5.13 as a synonym for it.   This means that in 5.14 we don't accept that one new Unicode character\, except by ordinal value. In 5.16\, we convert to use Unicode.

In the meantime\, I will propose that Unicode adopt a policy of not doing this again\, and perhaps an alias that gives a somewhat different name\, just to clear up future confusion.

p5pRT commented 13 years ago

From @khwilliamson

karl williamson wrote​:

Father Chrysostomos via RT wrote​:

On Tue Oct 12 21​:56​:17 2010\, public@​khwilliamson.com wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t

I’ve applied the first patch as 92f9d56c66. With the Unicode 6 database I get a test failure​:

$ curl http​://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch | git am [...] $ cd t $ ./perl harness -v ../lib/charnames.t [...] not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514) # Failed at ../lib/charnames.t line 105 # got "\a" # expected "\x{1f514}"

I'm afraid this is what I consider to be a flaw in the new standard\, though they wouldn't; I regret that I did not find it before it was too late; as your tests are the first it surfaced. I'm not sure Unicode would have listened to me anyway\, but we would have known about this earlier.

Your tests showed the problem and my tests didn't\, because of the random sampling of the tests\, because it would take too long to go through all million possible code points each time; and my tests just didn't try that combination yet.

I'm not sure what to do; suggestions welcome.

The problem stems from the fact that the Standard does not give names to the control characters\, such as ACK and BEL. It did in version 1.0\, and it still publishes those names as the "Unicode_1_Name" property. That name for character 0x07\, known by the acronym BEL\, is "BELL". What Perl does is to use the Unicode 1 names when there is no current. All was fine until 6.0 came along and re-used BELL for a different character.

But as far as Unicode is concerned\, there isn't a problem\, as BEL has no official name. It is Perl who has persisted in using this old name. I don't know why Unicode removed the names; and it seems eminently reasonable to give them names; but here we are.

The only option I can think of that doesn't violate our stability policies is to\, in 5.14\, keep the old BELL meaning\, but deprecate it\, saying to use BEL instead\, which was added in 5.13 as a synonym for it. This means that in 5.14 we don't accept that one new Unicode character\, except by ordinal value. In 5.16\, we convert to use Unicode.

In the meantime\, I will propose that Unicode adopt a policy of not doing this again\, and perhaps an alias that gives a somewhat different name\, just to clear up future confusion.

The attached patches work around this problem by deprecating \N{BELL} for 5.14\, and giving the new name \N{ALERT} to it. The new character with that name will be unnamed. This means that Perl 5.14 doesn't quite support Unicode 6.0.

The patches are also available at​: git​://github.com/khwilliamson/perl.git branch uni6

which includes the entire series of unicode 6 patches.

p5pRT commented 13 years ago

From @khwilliamson

0004-charnames.t-PERL_RUN_SLOW_TESTS-runs-more-tests.patch ```diff From d48e5272393d74374e4c7527cf778d0546dbd4e8 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 16 Nov 2010 18:21:44 -0700 Subject: [PATCH] charnames.t: PERL_RUN_SLOW_TESTS runs more tests This patch makes this .t look for this environment variable, and if set run more tests. There are two levels of setting, as explained in the comments --- lib/charnames.t | 16 +++++++++++++++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/lib/charnames.t b/lib/charnames.t index 883740e..4271b58 100644 --- a/lib/charnames.t +++ b/lib/charnames.t @@ -1,6 +1,16 @@ #!./perl use strict; +# Test charnames.pm. If $ENV{PERL_RUN_SLOW_TESTS} is unset or 0, a random +# selection of names is tested, a higher percentage of regular names is tested +# than algorithmically-determined names. + +my $RUN_SLOW_TESTS_EVERY_CODE_POINT = 100; + +# If $ENV{PERL_RUN_SLOW_TESTS} is at least 1 and less than the number above, +# all code points with names are tested. If it is at least that number, all +# 1,114,112 Unicode code points are tested. + # Because \N{} is compile time, any warnings will get generated before # execution, so have to have an array, and arrange things so no warning # is generated twice to verify that in fact a warning did happen @@ -848,6 +858,8 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); $seed = srand; } + my $run_slow_tests = $ENV{PERL_RUN_SLOW_TESTS} || 0; + # We will look at the data grouped in "blocks" of the following # size. my $block_size_bits = 7; # above 16 is not sensible @@ -859,7 +871,7 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); # of the character. The percentage of each type to test is # fuzzily independently settable. This breaks down when the block size is # 1 or is large enough that both types of names occur in the same block - my $percentage_of_regular_names = 25; + my $percentage_of_regular_names = ($run_slow_tests) ? 100 : 25; my $percentage_of_algorithmic_names = (100 / $block_size); # 1 test/block # If wants everything tested, do so by changing the block size to 1 so @@ -1002,6 +1014,7 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); my $end_block = $block; if ($test_count == 0) { $test_count = 1; + if ($run_slow_tests < $RUN_SLOW_TESTS_EVERY_CODE_POINT) { $end_block++; # Keep coalescing until find a block that has something in @@ -1015,6 +1028,7 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); $end_block++; } $end_block--; # Back-off to a block that has no defined names + } } # Calculated how many tests. Do them -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @khwilliamson

0005-charnames.t-indent-newly-formed-block.patch ```diff From ca1f0c03ba1dbc027ede9a027fec66a01624fa53 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 16 Nov 2010 18:24:55 -0700 Subject: [PATCH] charnames.t: indent newly formed block This is a white-space only patch to indent the code that was put into an if block by the previous commit --- lib/charnames.t | 25 +++++++++++++------------ 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/lib/charnames.t b/lib/charnames.t index 4271b58..46f206a 100644 --- a/lib/charnames.t +++ b/lib/charnames.t @@ -1015,19 +1015,20 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); if ($test_count == 0) { $test_count = 1; if ($run_slow_tests < $RUN_SLOW_TESTS_EVERY_CODE_POINT) { - $end_block++; - - # Keep coalescing until find a block that has something in - # it. But don't cross plane boundaries (the 16 bits below), - # so there is at least one test for every plane. - while ($end_block < $block_count - && $end_block >> (16 - $block_size_bits) == $block >> (16 - $block_size_bits) - && ! $algorithmic_names_count[$end_block] - && ! $regular_names_count[$end_block]) - { $end_block++; - } - $end_block--; # Back-off to a block that has no defined names + + # Keep coalescing until find a block that has something in + # it. But don't cross plane boundaries (the 16 bits below), + # so there is at least one test for every plane. + while ($end_block < $block_count + && $end_block >> (16 - $block_size_bits) + == $block >> (16 - $block_size_bits) + && ! $algorithmic_names_count[$end_block] + && ! $regular_names_count[$end_block]) + { + $end_block++; + } + $end_block--; # Back-off to a block that has no defined names } } -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @khwilliamson

0006-Work-around-Uni-6.0-issues-with-BELL.patch ```diff From 0717138c94e886e324e8fb8b0a30e5bed6a43db0 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 16 Nov 2010 18:29:07 -0700 Subject: [PATCH] Work-around Uni 6.0 issues with 'BELL' Unicode version 6.0 has co-opted the name BELL for a different character than traditionally used in Perl. This patch works around that by adding ALERT as a synonym for BELL, and causing a deprecated warning for uses of the old name. The new Unicode character will be nameless in Perl 5.14, unless I can (unlikely) get Unicode to grant a synonym that they will support. --- lib/charnames.pm | 7 +++++-- lib/charnames.t | 11 +++++++++++ lib/unicore/mktables | 29 ++++++++++++++++++++++++++++- pod/perldelta.pod | 25 +++++++++++++++++++++++-- 4 files changed, 67 insertions(+), 5 deletions(-) diff --git a/lib/charnames.pm b/lib/charnames.pm index 677edfc..750b1cf 100644 --- a/lib/charnames.pm +++ b/lib/charnames.pm @@ -2,7 +2,7 @@ package charnames; use strict; use warnings; use File::Spec; -our $VERSION = '1.16'; +our $VERSION = '1.17'; use bytes (); # for $bytes::hint_bits @@ -35,7 +35,7 @@ my %system_aliases = ( 'EOT' => pack("U", 0x04), # END OF TRANSMISSION 'ENQ' => pack("U", 0x05), # ENQUIRY 'ACK' => pack("U", 0x06), # ACKNOWLEDGE - 'BEL' => pack("U", 0x07), # BELL + 'BEL' => pack("U", 0x07), # ALERT; formerly BELL 'BS' => pack("U", 0x08), # BACKSPACE 'HT' => pack("U", 0x09), # HORIZONTAL TABULATION 'LF' => pack("U", 0x0A), # LINE FEED (LF) @@ -401,6 +401,9 @@ my %deprecated_aliases = ( 'PARTIAL LINE UP' => pack("U", 0x8C), # PARTIAL LINE BACKWARD 'VERTICAL TABULATION SET' => pack("U", 0x8A), # LINE TABULATION SET 'REVERSE INDEX' => pack("U", 0x8D), # REVERSE LINE FEED + + # Unicode 6.0 co-opted this for U+1F514, so deprecate it for now. + 'BELL' => pack("U", 0x07), ); diff --git a/lib/charnames.t b/lib/charnames.t index 46f206a..f44c805 100644 --- a/lib/charnames.t +++ b/lib/charnames.t @@ -249,6 +249,11 @@ is("\N{BOM}", chr(0xFEFF)); ok(grep { /"HORIZONTAL TABULATION" is deprecated.*CHARACTER TABULATION/ } @WARN); + # XXX These tests should be changed for 5.16, when we convert BELL to the + # Unicode version. + is("\N{BELL}", "\a"); + ok((grep{ /"BELL" is deprecated.*ALERT/ } @WARN), 'BELL is deprecated'); + no warnings 'deprecated'; is("\N{VERTICAL TABULATION}", "\013"); @@ -914,6 +919,12 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"); # marked $name = $u1name if $name eq ""; + $name = 'ALERT' if $decimal == 7; + + # XXX This test should be changed for 5.16 when we convert to use + # Unicode's BELL + $name = "" if $decimal == 0x1F514; + # Some don't have names, leave those array elements undefined next unless $name; diff --git a/lib/unicore/mktables b/lib/unicore/mktables index f584882..6c13acd 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -10145,6 +10145,28 @@ END } return; } + + sub filter_v6_ucd { + + # Unicode 6.0 co-opted the name BELL for U+1F514, so change the input + # to pretend that U+0007 is ALERT instead, and for Perl 5.14, don't + # allow the BELL name for U+1F514, so that the old usage can be + # deprecated for one cycle. + + return if $_ !~ /^(?:0007|1F514);/; + + my ($code_point, @fields) = split /\s*;\s*/, $_, -1; + if ($code_point eq '0007') { + $fields[$UNICODE_1_NAME] = "ALERT"; + } + elsif ($^V lt v5.15.0) { # For 5.16 will convert to use Unicode's name + $fields[$CHARNAME] = ""; + } + + $_ = join ';', $code_point, @fields; + + return; + } } # End closure for UnicodeData sub process_GCB_test { @@ -14072,7 +14094,12 @@ my @input_file_objects = ( ? \&filter_v1_ucd : ($v_version eq v2.1.5) ? \&filter_v2_1_5_ucd - : undef), + + # And for 5.14 Perls with 6.0, + # have to also make changes + : ($v_version ge v6.0.0) + ? \&filter_v6_ucd + : undef), # And the main filter \&filter_UnicodeData_line, diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 76f97f7..7c7e56b 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -140,14 +140,29 @@ introspection of the current phase of the perl interpreter. It's explained in detail in L and L. -=head2 Unicode Version 6.0 is now supported. +=head2 Unicode Version 6.0 is now supported (mostly) -Perl comes with the Unicode 6.0 data base. +Perl comes with the Unicode 6.0 data base, with one exception noted +below. See L for details on the new release. Perl does not support any Unicode provisional properties, including the new ones for this release, but their database files are packaged with Perl. +Unicode 6.0 has chosen to use the name C for the character at U+1F514, +which is a symbol that looks like a bell, and used in Japanese cell +phones. This conflicts with the long-standing Perl usage of having +C mean the ASCII C character, U+0007. In Perl 5.14, +C<\N{BELL}> will continue to mean U+0007, but its use will generate a +deprecated warning message, unless such warnings are turned off. The +new name for U+0007 in Perl will be C, which corresponds nicely +with the existing shorthand sequence for it, C<"\a">. C<\N{BEL}> will +mean U+0007, with no warning given. The character at U+1F514 will not +have a name in 5.14, but can be referred to by C<\N{U+1F514}>. The plan +is that in Perl 5.16, C<\N{BELL}> will refer to U+1F514, and so all code +that uses C<\N{BELL}> should convert by then to using C<\N{ALERT}>, +C<\N{BEL}>, or C<"\a"> instead. + =head1 Security XXX Any security-related notices go here. In particular, any security @@ -229,6 +244,12 @@ listed as an updated module in the L section. [ List each deprecation as a =head2 entry ] +=head2 C<\N{BELL}> is deprecated + +This is because Unicode is using that name for a different character. +See L for more +explanation. + =head1 Performance Enhancements XXX Changes which enhance performance without changing behaviour go here. There -- 1.5.6.3 ```
p5pRT commented 13 years ago

From @demerphq

On 13 October 2010 19​:16\, karl williamson \public@&#8203;khwilliamson\.com wrote​:

In the meantime\, I will propose that Unicode adopt a policy of not doing this again\, and perhaps an alias that gives a somewhat different name\, just to clear up future confusion.

And register a *very* *strong* note of protest at the current action.

They need to look at their standard in the same way we do in terms of backwards compatibility.

Once published it cant be removed for two major releases. And must go through a deprecation cycle.

I think that you need to be very aggressive about the last point.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 13 years ago

From @cpansprout

On Tue Nov 16 17​:44​:57 2010\, public@​khwilliamson.com wrote​:

karl williamson wrote​:

Father Chrysostomos via RT wrote​:

On Tue Oct 12 21​:56​:17 2010\, public@​khwilliamson.com wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t

I’ve applied the first patch as 92f9d56c66. With the Unicode 6 database I get a test failure​:

$ curl http​://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch | git am [...] $ cd t $ ./perl harness -v ../lib/charnames.t [...] not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514) # Failed at ../lib/charnames.t line 105 # got "\a" # expected "\x{1f514}"

I'm afraid this is what I consider to be a flaw in the new standard\, though they wouldn't; I regret that I did not find it before it was too late; as your tests are the first it surfaced. I'm not sure Unicode would have listened to me anyway\, but we would have known about this earlier.

Your tests showed the problem and my tests didn't\, because of the random sampling of the tests\, because it would take too long to go through all million possible code points each time; and my tests just didn't try that combination yet.

I'm not sure what to do; suggestions welcome.

The problem stems from the fact that the Standard does not give names to the control characters\, such as ACK and BEL. It did in version 1.0\, and it still publishes those names as the "Unicode_1_Name" property. That name for character 0x07\, known by the acronym BEL\, is "BELL". What Perl does is to use the Unicode 1 names when there is no current. All was fine until 6.0 came along and re-used BELL for a different character.

But as far as Unicode is concerned\, there isn't a problem\, as BEL has no official name. It is Perl who has persisted in using this old name. I don't know why Unicode removed the names; and it seems eminently reasonable to give them names; but here we are.

The only option I can think of that doesn't violate our stability policies is to\, in 5.14\, keep the old BELL meaning\, but deprecate it\, saying to use BEL instead\, which was added in 5.13 as a synonym for it. This means that in 5.14 we don't accept that one new Unicode character\, except by ordinal value. In 5.16\, we convert to use Unicode.

In the meantime\, I will propose that Unicode adopt a policy of not doing this again\, and perhaps an alias that gives a somewhat different name\, just to clear up future confusion.

The attached patches work around this problem by deprecating \N{BELL} for 5.14\, and giving the new name \N{ALERT} to it. The new character with that name will be unnamed. This means that Perl 5.14 doesn't quite support Unicode 6.0.

The patches are also available at​: git​://github.com/khwilliamson/perl.git branch uni6

which includes the entire series of unicode 6 patches.

Thank you. All applied.

(Why did you not use ALARM?)

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Tue Nov 16 17​:44​:57 2010\, public@​khwilliamson.com wrote​:

karl williamson wrote​:

Father Chrysostomos via RT wrote​:

On Tue Oct 12 21​:56​:17 2010\, public@​khwilliamson.com wrote​:

This series of commits delivers the Unicode 6.0 db\, and upgrades Perl to use it. There may still be some work to do in Unicode​::UCD to support the new characters (which I'll investigate)\, but the rest of the Perl core should fully support it.

The few code changes are attached to this email\, but the bulk of the changes (along with the attachments here)\, too large to email\, are located at git​://github.com/khwilliamson/perl.git branch mktables

Those changes are essentially entirely official Unicode data\, except for the MANIFEST\, perldelta\, version\, and a couple data changes in UCD.t I’ve applied the first patch as 92f9d56c66. With the Unicode 6 database I get a test failure​:

$ curl http​://github.com/khwilliamson/perl/commit/35e84e1c3151243.patch | git am [...] $ cd t $ ./perl harness -v ../lib/charnames.t [...] not ok 17078 - Verify string_vianame("BELL") is chr(0x1F514) # Failed at ../lib/charnames.t line 105 # got "\a" # expected "\x{1f514}"

I'm afraid this is what I consider to be a flaw in the new standard\, though they wouldn't; I regret that I did not find it before it was too late; as your tests are the first it surfaced. I'm not sure Unicode would have listened to me anyway\, but we would have known about this earlier.

Your tests showed the problem and my tests didn't\, because of the random sampling of the tests\, because it would take too long to go through all million possible code points each time; and my tests just didn't try that combination yet.

I'm not sure what to do; suggestions welcome.

The problem stems from the fact that the Standard does not give names to the control characters\, such as ACK and BEL. It did in version 1.0\, and it still publishes those names as the "Unicode_1_Name" property. That name for character 0x07\, known by the acronym BEL\, is "BELL". What Perl does is to use the Unicode 1 names when there is no current. All was fine until 6.0 came along and re-used BELL for a different character.

But as far as Unicode is concerned\, there isn't a problem\, as BEL has no official name. It is Perl who has persisted in using this old name. I don't know why Unicode removed the names; and it seems eminently reasonable to give them names; but here we are.

The only option I can think of that doesn't violate our stability policies is to\, in 5.14\, keep the old BELL meaning\, but deprecate it\, saying to use BEL instead\, which was added in 5.13 as a synonym for it. This means that in 5.14 we don't accept that one new Unicode character\, except by ordinal value. In 5.16\, we convert to use Unicode.

In the meantime\, I will propose that Unicode adopt a policy of not doing this again\, and perhaps an alias that gives a somewhat different name\, just to clear up future confusion.

The attached patches work around this problem by deprecating \N{BELL} for 5.14\, and giving the new name \N{ALERT} to it. The new character with that name will be unnamed. This means that Perl 5.14 doesn't quite support Unicode 6.0.

The patches are also available at​: git​://github.com/khwilliamson/perl.git branch uni6

which includes the entire series of unicode 6 patches.

Thank you. All applied.

(Why did you not use ALARM?)

Thanks for applying these.

Alert is a semi-standard name for it\, used in K&R\, for example. I have never hear of "ALARM".

p5pRT commented 13 years ago

From @cpansprout

On Thu Nov 18 14​:29​:39 2010\, public@​khwilliamson.com wrote​:

Alert is a semi-standard name for it\, used in K&R\, for example. I have never hear of "ALARM".

$ ack -i '\\a.*alarm' pod pod/perlop.pod 1029​: \a alarm (bell) (BEL)

pod/perlre.pod 230​: \a alarm (bell) (BEL)

pod/perlrebackslash.pod 67​: \a Alarm or bell. 120​: \a 7 07 BEL \cG alarm or bell

pod/perlreref.pod 86​: \a Alarm (beep)

So I thought ‘alarm’ was standard. But ‘alert’ makes more sense.