Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 556 forks source link

DOC PATCH: various fixes to pods #10345

Closed p5pRT closed 14 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#74642 (status was 'resolved')

Searchable as RT74642$

p5pRT commented 14 years ago

From @khwilliamson

These are mostly about regex and Unicode things\, and correcting a couple broken links. Details in the commit messages

p5pRT commented 14 years ago

From @khwilliamson

0001-Remove-false-statement-about-Unicode-strings.patch ```diff From 826f0b2bdc48047deb46c635b14080117c46eb69 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 10:23:08 -0600 Subject: [PATCH] Remove false statement about Unicode strings It is simply not true that all text strings are Unicode strings in Perl. --- pod/perlunitut.pod | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/pod/perlunitut.pod b/pod/perlunitut.pod index 9c4f307..fc352d5 100644 --- a/pod/perlunitut.pod +++ b/pod/perlunitut.pod @@ -66,9 +66,6 @@ B, or B are made of characters. Bytes are irrelevant here, and so are encodings. Each character is just that: the character. -Text strings are also called B, because in Perl, every text -string is a Unicode string. - On a text string, you would do things like: $text =~ s/foo/bar/; -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0002-Nits-in-perluniintro.pod.patch ```diff From dc5c3e806c55a5200dbbb434f6969da7179905db Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:03:48 -0600 Subject: [PATCH] Nits in perluniintro.pod Make accurate the advice about eighth-bit set characters, and a few editing improvements. --- pod/perluniintro.pod | 33 +++++++++++++++++---------------- 1 files changed, 17 insertions(+), 16 deletions(-) diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 6c82efd..bee286f 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -553,19 +553,19 @@ L Character Ranges and Classes -Character ranges in regular expression character classes (C) -and in the C (also known as C) operator are not magically -Unicode-aware. What this means is that C<[A-Za-z]> will not magically start -to mean "all alphabetic letters"; not that it does mean that even for -8-bit characters, you should be using C in that case. - -For specifying character classes like that in regular expressions, -you can use the various Unicode properties--C<\pL>, or perhaps -C<\p{Alphabetic}>, in this particular case. You can use Unicode -code points as the end points of character ranges, but there is no -magic associated with specifying a certain range. For further -information--there are dozens of Unicode character classes--see -L. +Character ranges in regular expression bracketed character classes ( e.g., +C) and in the C (also known as C) operator are not +magically Unicode-aware. What this means is that C<[A-Za-z]> will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8-bit characters; for those, if you are using locales (L), +use C; and if not, use the 8-bit-aware property C<\p{alpha}>). + +All the properties that begin with C<\p> (and its inverse C<\P>) are actually +character classes that are Unicode-aware. There are dozens of them, see +L. + +You can use Unicode code points as the end points of character ranges, and the +range will include all Unicode code points that lie between those end points. =item * @@ -607,7 +607,7 @@ Unicode; for that, see the earlier I/O discussion. How Do I Know Whether My String Is In Unicode? You shouldn't have to care. But you may, because currently the semantics of the -characters whose ordinals are in the range 128 to 255 is different depending on +characters whose ordinals are in the range 128 to 255 are different depending on whether the string they are contained within is in Unicode or not. (See L.) @@ -622,8 +622,8 @@ string has any characters at all. All the C does is to return the value of the internal "utf8ness" flag attached to the C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar -are interpreted as the (multi-byte, variable-length) UTF-8 encoded code -points of the characters. Bytes added to a UTF-8 encoded string are +are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded +code points of the characters. Bytes added to a UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded @@ -648,6 +648,7 @@ the C function: use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) + no bytes; =item * -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0003-Nits-in-perlunifaq.pod.patch ```diff From cae8ce7efb40de7f6216e16967b0a6e2801a8360 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:15:33 -0600 Subject: [PATCH] Nits in perlunifaq.pod --- pod/perlunifaq.pod | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod index 89cbad3..ab42ff1 100644 --- a/pod/perlunifaq.pod +++ b/pod/perlunifaq.pod @@ -25,7 +25,7 @@ To find out which character encodings your Perl supports, run: =head2 Which version of perl should I use? Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer. -The tutorial and FAQ are based on the status quo as of C<5.8.8>. +The tutorial and FAQ assume the latest release. You should also check your modules, and upgrade them if necessary. For example, HTML::Entities requires version >= 1.32 to function correctly, even though the @@ -227,9 +227,9 @@ use C, C<_utf8_on> or C<_utf8_off> at all. The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the current internal representation is UTF-8. Without the flag, it is assumed to be -ISO-8859-1. Perl converts between these automatically. (Actually Perl assumes -the representation is ASCII; see L above.) +ISO-8859-1. Perl converts between these automatically. (Actually Perl usually +assumes the representation is ASCII; see L above.) One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't keep a secret, so everyone knows about this. That is the source of much -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0004-Clarify-c-usage-in-perlrebackslash.pod.patch ```diff From 7513b4b906b2c99e1640af182925c98a2a2e71d4 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:21:24 -0600 Subject: [PATCH] Clarify \c usage in perlrebackslash.pod --- pod/perlrebackslash.pod | 26 +++++++++++++++++--------- 1 files changed, 17 insertions(+), 9 deletions(-) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 5ff2601..461ebd9 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -16,7 +16,6 @@ Most sequences are described in detail in different documents; the primary purpose of this document is to have a quick reference guide describing all backslash and escape sequences. - =head2 The backslash In a regular expression, the backslash can perform one of two tasks: @@ -69,7 +68,7 @@ as C \A Beginning of string. Not in []. \b Word/non-word boundary. (Backspace in []). \B Not a word/non-word boundary. Not in []. - \cX Control-X (X can be any ASCII character). + \cX Control-X \C Single octet, even under UTF-8. Not in []. \d Character class for digits. \D Character class for non-digits. @@ -112,9 +111,10 @@ as C A handful of characters have a dedicated I. The following table shows them, along with their ASCII code points (in decimal and hex), -their ASCII name, the control escape (see below) and a short description. +their ASCII name, the control escape on ASCII platforms and a short +description. (For EBCDIC platforms, see L.) - Seq. Code Point ASCII Cntr Description. + Seq. Code Point ASCII Cntrl Description. Dec Hex \a 7 07 BEL \cG alarm or bell \b 8 08 BS \cH backspace [1] @@ -145,10 +145,18 @@ OSses native newline character when reading from or writing to text files. =head3 Control characters C<\c> is used to denote a control character; the character following C<\c> -is the name of the control character. For instance, C matches the -character I (a carriage return, code point 13). The case of the -character following C<\c> doesn't matter: C<\cM> and C<\cm> match the same -character. +determines the value of the construct. For example the value of C<\cA> is +C, and the value of C<\cb> is C, etc. +The gory details are in L. A complete +list of what C, etc. means for ASCII and EBCDIC platforms is in +L. + +Note that C<\c\> alone at the end of a regular expression (or doubled-quoted +string) is not valid. The backslash must be followed by another character. +That is, C<\c\I> means C'> for all characters I. + +To write platform-independent code, you must use C<\N{I}> instead, like +C<\N{ESCAPE}> or C<\N{U+001B}>, see L. Mnemonic: Iontrol character. @@ -335,7 +343,7 @@ match a character that matches the given Unicode property; properties include things like "letter", or "thai character". Capitalizing the sequence to C<\PP> and C<\P{Property}> make the sequence match a character that doesn't match the given Unicode property. For more details, see -L and +L and L. Mnemonic: I

roperty. -- 1.5.6.3 ```

p5pRT commented 14 years ago

From @khwilliamson

0005-Nits-in-perlunicode.pod.patch ```diff From 8c5f9e69708fb7fb232c5f279f93ca8c6a48caac Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:14:27 -0600 Subject: [PATCH] Nits in perlunicode.pod --- pod/perlunicode.pod | 62 ++++++++++++++++++++++++++++---------------------- 1 files changed, 35 insertions(+), 27 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1f4be43..140d134 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -11,9 +11,12 @@ implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. People who want to learn to use Unicode in Perl, should probably read -L, before reading +the L, before reading this reference document. +Also, the use of Unicode may present security issues that aren't obvious. +Read L. + =over 4 =item Input and Output Layers @@ -99,8 +102,8 @@ The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. The C pragma is intended to always, regardless -of platform, force Unicode semantics in a particular lexical scope. In -release 5.12, it is partially implemented, applying only to case changes. +of platform, force character (Unicode) semantics in a particular lexical scope. +In release 5.12, it is partially implemented, applying only to case changes. See L below. The C pragma is primarily a compatibility device that enables @@ -180,15 +183,15 @@ a character instead of a byte. =item * -Character classes in regular expressions match characters instead of +Bracketed character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. =item * -Named Unicode properties, scripts, and block ranges may be used like -character classes via the C<\p{}> "matches property" construct and +Named Unicode properties, scripts, and block ranges may be used (like bracketed +character classes) by using the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". See L for more details. @@ -261,8 +264,9 @@ complement B the full character-wide bit complement. =item * -You can define your own mappings to be used in lc(), -lcfirst(), uc(), and ucfirst() (or their string-inlined versions). +You can define your own mappings to be used in C, +C, C, and C (or their double-quoted string inlined +versions such as C<\U>). See L for more details. =back @@ -278,25 +282,30 @@ And finally, C reverses by character rather than by byte. =head2 Unicode Character Properties Most Unicode character properties are accessible by using regular expressions. -They are used like character classes via the C<\p{}> "matches property" -construct and the C<\P{}> negation, "doesn't match property". +They are used (like bracketed character classes) by using the C<\p{}> "matches +property" construct and the C<\P{}> negation, "doesn't match property". + +Note that the only time that Perl considers a sequence of individual code +points as a single logical character is in the C<\X> construct, already +mentioned above. Therefore "character" in this discussion means a single +Unicode code point. -For instance, C<\p{Uppercase}> matches any character with the Unicode +For instance, C<\p{Uppercase}> matches any single character with the Unicode "Uppercase" property, while C<\p{L}> matches any character with a General_Category of "L" (letter) property. Brackets are not -required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. +required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. -More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase -property value is True, and C<\P{Uppercase}> matches any character whose -Uppercase property value is False, and they could have been written as -C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively +More formally, C<\p{Uppercase}> matches any single character whose Unicode +Uppercase property value is True, and C<\P{Uppercase}> matches any character +whose Uppercase property value is False, and they could have been written as +C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. This formality is needed when properties are not binary, that is if they can take on more values than just True and False. For example, the Bidi_Class (see L below), can take on a number of different values, such as Left, Right, Whitespace, and others. To match these, one needs to specify the property name (Bidi_Class), and the value being matched against -(Left, Right, I). This is done, as in the examples above, by having the +(Left, Right, etc.). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. @@ -403,8 +412,7 @@ Here are the short and long forms of the General Category properties: Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C and C are special cases, which are aliases for the set of -C, C, and C. +C and C are special cases, which are both aliases for the set consisting of everything matched by C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement @@ -413,8 +421,8 @@ supported. =head3 B -Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties in +Because scripts differ in their directionality (Hebrew is +written right to left, for example) Unicode supplies these properties in the Bidi_Class class: Property Meaning @@ -451,10 +459,10 @@ written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. The Unicode Script property gives what script a given character is in, -and can be matched with the compound form like C<\p{Script=Hebrew}> (short: -C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit -everything up through the equals (or colon), and simply write C<\p{Latin}> or -C<\P{Cyrillic}>. +and the property can be specified with the compound form like +C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all +script names. You can omit everything up through the equals (or colon), and +simply write C<\p{Latin}> or C<\P{Cyrillic}>. A complete list of scripts and their shortcuts is in L. @@ -475,7 +483,7 @@ characters with consecutive ordinal values. For example, the "Basic Latin" block is all characters whose ordinals are between 0 and 127, inclusive, in other words, the ASCII characters. The "Latin" script contains some letters from this block as well as several more, like "Latin-1 Supplement", -"Latin Extended-A", I, but it does not contain all the characters from +"Latin Extended-A", etc., but it does not contain all the characters from those blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in the script called C. There is also a script called C for @@ -571,7 +579,7 @@ To understand the use of this rarely used property=value combination, it is necessary to know some basics about decomposition. Consider a character, say H. It could appear with various marks around it, such as an acute accent, or a circumflex, or various hooks, circles, arrows, -I, above, below, to one side and/or the other, I There are many +I, above, below, to one side and/or the other, etc. There are many possibilities among the world's languages. The number of combinations is astronomical, and if there were a character for each combination, it would soon exhaust Unicode's more than a million possible characters. So Unicode -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0006-perlfunc.pod-case-change-cleanup-mention-packtut.patch ```diff From 6af92d53131d3827a2496075c6b9ef0e95c19cb9 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:27:01 -0600 Subject: [PATCH] perlfunc.pod: case-change cleanup; mention packtut Specifies completely the behavior of the case-changing functions, and mentions in the existence of the pack tutorial for the packing ones. --- pod/perlfunc.pod | 88 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 files changed, 72 insertions(+), 16 deletions(-) diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 3fabeb0..1989f11 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -2712,12 +2712,61 @@ X X =item lc Returns a lowercased version of EXPR. This is the internal function -implementing the C<\L> escape in double-quoted strings. Respects -current LC_CTYPE locale if C in force. See L -and L for more details about locale and Unicode support. +implementing the C<\L> escape in double-quoted strings. If EXPR is omitted, uses C<$_>. +What gets returned depends on several factors: + +=over + +=item If C is in effect: + +=over + +=item On EBCDIC platforms + +The results are what the C language system call C returns. + +=item On ASCII platforms + +The results follow ASCII semantics. Only characters C change, to C +respectively. + +=back + +=item Otherwise, If EXPR has the UTF8 flag set + +If the current package has a subroutine named C, it will be used to +change the case (See L.) +Otherwise Unicode semantics are used for the case change. + +=item Otherwise, if C is in effect + +Respects current LC_CTYPE locale. See L. + +=item Otherwise, if C is in effect: + +Unicode semantics are used for the case change. Any subroutine named +C will not be used. + +=item Otherwise: + +=over + +=item On EBCDIC platforms + +The results are what the C language system call C returns. + +=item On ASCII platforms + +ASCII semantics are used for the case change. The lowercase of any character +outside the ASCII range is the character itself. + +=back + +=back + =item lcfirst EXPR X X @@ -2725,12 +2774,13 @@ X X Returns the value of EXPR with the first character lowercased. This is the internal function implementing the C<\l> escape in -double-quoted strings. Respects current LC_CTYPE locale if C in force. See L and L for more -details about locale and Unicode support. +double-quoted strings. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item length EXPR X X @@ -3603,8 +3653,10 @@ Takes a LIST of values and converts it into a string using the rules given by the TEMPLATE. The resulting string is the concatenation of the converted values. Typically, each converted value looks like its machine-level representation. For example, on 32-bit machines -an integer may be represented by a sequence of 4 bytes, which will in -Perl be presented as a string that's 4 characters long. +an integer may be represented by a sequence of 4 bytes, which will in +Perl be presented as a string that's 4 characters long. + +See L for an introduction to this function. The TEMPLATE is a sequence of characters that give the order and type of values, as follows: @@ -6869,14 +6921,15 @@ X X X =item uc Returns an uppercased version of EXPR. This is the internal function -implementing the C<\U> escape in double-quoted strings. Respects -current LC_CTYPE locale if C in force. See L -and L for more details about locale and Unicode support. +implementing the C<\U> escape in double-quoted strings. It does not attempt to do titlecase mapping on initial letters. See -C for that. +L for that. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item ucfirst EXPR X X @@ -6884,12 +6937,13 @@ X X Returns the value of EXPR with the first character in uppercase (titlecase in Unicode). This is the internal function implementing -the C<\u> escape in double-quoted strings. Respects current LC_CTYPE -locale if C in force. See L and L -for more details about locale and Unicode support. +the C<\u> escape in double-quoted strings. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item umask EXPR X @@ -6993,7 +7047,9 @@ C does the reverse of C: it takes a string and expands it out into a list of values. (In scalar context, it returns merely the first value produced.) -If EXPR is omitted, unpacks the C<$_> string. +If EXPR is omitted, unpacks the C<$_> string. for an introduction to this function. + +See L for an introduction to this function. The string is broken into chunks described by the TEMPLATE. Each chunk is converted separately to a value. Typically, either the string is a result -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0007-Fix-broken-links.patch ```diff From eee7b9a30dac58e30fbceb06d7a2857b43c09a15 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:32:42 -0600 Subject: [PATCH] Fix broken links --- pod/perl5111delta.pod | 2 +- pod/perl5120delta.pod | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pod/perl5111delta.pod b/pod/perl5111delta.pod index 87fb9df..4717374 100644 --- a/pod/perl5111delta.pod +++ b/pod/perl5111delta.pod @@ -260,7 +260,7 @@ Perl now defaults to issuing a warning if a deprecated language feature is used. To disable this feature in a given lexical scope, you should use C For information about which language features are deprecated and explanations of various deprecation warnings, please -see L +see L =back diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod index 35fab9a..5d5b401 100644 --- a/pod/perl5120delta.pod +++ b/pod/perl5120delta.pod @@ -251,7 +251,7 @@ C file for that release. To disable this feature in a given lexical scope, you should use C For information about which language features are deprecated and explanations of various deprecation warnings, please -see L. See L below for the list of features +see L. See L below for the list of features and modules Perl's developers have deprecated as part of this release. =head2 Version number formats -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0008-Nits-in-perlre.pod-x-referencing-broken-links.patch ```diff From 57fb01a46585f16a0200fd57c9c010ec88c1bcd7 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:37:19 -0600 Subject: [PATCH] Nits in perlre.pod, x-referencing, broken links --- pod/perlre.pod | 163 +++++++++++++++++++++++++------------------------------ 1 files changed, 74 insertions(+), 89 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index 48ca403..40e6c28 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -98,14 +98,14 @@ the C-comment deletion code in L. Also note that anything inside a C<\Q...\E> stays unaffected by C. And note that C doesn't affect whether space interpretation within a single multi-character construct. For example in C<\x{...}>, regardless of the C modifier, there can be no -spaces. Same for a L such as C<{3}> or +spaces. Same for a L such as C<{3}> or C<{5,}>. Similarly, C<(?:...)> can't have a space between the C and C<:>, but can between the C<(> and C. Within any delimiters for such a construct, allowed spaces are not affected by C, and depend on the construct. For example, C<\x{...}> can't have spaces because hexadecimal numbers don't have spaces in them. But, Unicode properties can have spaces, so in C<\p{...}> there can be spaces that follow the Unicode rules, for which see -L. +L. X =head2 Regular Expressions @@ -130,7 +130,7 @@ X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> $ Match the end of the line (or before newline at the end) | Alternation () Grouping - [] Character class + [] Bracketed Character class By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the @@ -222,8 +222,6 @@ instance the above example could also be written as follows: Because patterns are processed as double quoted strings, the following also work: -X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> -X<\0> X<\c> X<\N{}> X<\x> \t tab (HT, TAB) \n newline (LF, NL) @@ -241,101 +239,88 @@ X<\0> X<\c> X<\N{}> X<\x> \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) - \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E + \E end either case modification or quoted section (think vi) -If C is in effect, the case map used by C<\l>, C<\L>, C<\u> -and C<\U> is taken from the current locale. See L. For -documentation of C<\N{name}>, see L. - -You cannot include a literal C<$> or C<@> within a C<\Q> sequence. -An unescaped C<$> or C<@> interpolates the corresponding variable, -while escaping will cause the literal string C<\$> to be matched. -You'll need to write something like C. +Details are in L. =head3 Character Classes and other Special Escapes In addition, Perl defines the following: X<\g> X<\k> X<\K> X - \w Match a "word" character (alphanumeric plus "_") - \W Match a non-"word" character - \s Match a whitespace character - \S Match a non-whitespace character - \d Match a digit character - \D Match a non-digit character - \pP Match P, named property. Use \p{Prop} for longer names. - \PP Match non-P - \X Match Unicode "eXtended grapheme cluster" - \C Match a single C char (octet) even under Unicode. - NOTE: breaks up characters into their UTF-8 bytes, - so you may end up with malformed pieces of UTF-8. - Unsupported in lookbehind. - \1 Backreference to a specific group. - '1' may actually be any positive integer. - \g1 Backreference to a specific or previous group, - \g{-1} number may be negative indicating a previous buffer and may - optionally be wrapped in curly brackets for safer parsing. - \g{name} Named backreference - \k Named backreference - \K Keep the stuff left of the \K, don't include it in $& - \N Any character but \n (experimental) - \v Vertical whitespace - \V Not vertical whitespace - \h Horizontal whitespace - \H Not horizontal whitespace - \R Linebreak - -See L for details on -C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, C<\D>, C<\p>, C<\P>, C<\N>, C<\v>, C<\V>, -C<\h>, and C<\H>. -See L for details on C<\R> and C<\X>. + Sequence Note Description + [...] [1] Match a character according to the rules of the bracketed + character class defined by the "...". Example: [a-z] + matches "a" or "b" or "c" ... or "z" + [[:...:]] [2] Match a character according to the rules of the POSIX + character class "..." within the outer bracketed character + class. Example: [[:upper:]] matches any uppercase + character. + \w [3] Match a "word" character (alphanumeric plus "_") + \W [3] Match a non-"word" character + \s [3] Match a whitespace character + \S [3] Match a non-whitespace character + \d [3] Match a decimal digit character + \D [3] Match a non-digit character + \pP [3] Match P, named property. Use \p{Prop} for longer names. + \PP [3] Match non-P + \X [4] Match Unicode "eXtended grapheme cluster" + \C Match a single C-language char (octet) even if that is part + of a larger UTF-8 character. Thus it breaks up characters + into their UTF-8 bytes, so you may end up with malformed + pieces of UTF-8. Unsupported in lookbehind. + \1 [5] Backreference to a specific capture buffer or group. + '1' may actually be any positive integer. + \g1 [5] Backreference to a specific or previous group, + \g{-1} [5] The number may be negative indicating a relative previous + buffer and may optionally be wrapped in curly brackets for + safer parsing. + \g{name} [5] Named backreference + \k [5] Named backreference + \K [6] Keep the stuff left of the \K, don't include it in $& + \N [7] Any character but \n (experimental). Not affected by /s + modifier + \v [3] Vertical whitespace + \V [3] Not vertical whitespace + \h [3] Horizontal whitespace + \H [3] Not horizontal whitespace + \R [4] Linebreak -Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the -character whose name is C; and similarly when of the form -C<\N{U+I}>, it matches the character whose Unicode ordinal is -I. Otherwise it matches any character but C<\n>. +=over 4 + +=item [1] + +See L for details. -The POSIX character class syntax -X +=item [2] - [:class:] +See L for details. -is also available. Note that the C<[> and C<]> brackets are I; -they must always be used within a character class expression. +=item [3] - # this is correct: - $string =~ /[[:alpha:]]/; +See L for details. - # this is not, and will generate a warning: - $string =~ /[:alpha:]/; +=item [4] -The following Posix-style character classes are available: +See L for details. - [[:alpha:]] Any alphabetical character. - [[:alnum:]] Any alphanumerical character. - [[:ascii:]] Any character in the ASCII character set. - [[:blank:]] A GNU extension, equal to a space or a horizontal tab - [[:cntrl:]] Any control character. - [[:digit:]] Any decimal digit, equivalent to "\d". - [[:graph:]] Any printable character, excluding a space. - [[:lower:]] Any lowercase character. - [[:print:]] Any printable character, including a space. - [[:punct:]] Any graphical character excluding "word" characters. - [[:space:]] Any whitespace character. "\s" plus vertical tab ("\cK"). - [[:upper:]] Any uppercase character. - [[:word:]] A Perl extension, equivalent to "\w". - [[:xdigit:]] Any hexadecimal digit. +=item [5] -You can negate the [::] character classes by prefixing the class name -with a '^'. This is a Perl extension. +See L below for details. -The POSIX character classes -[.cc.] and [=cc=] are recognized but B supported and trying to -use them will cause an error. +=item [6] -Details on POSIX character classes are in -L. +See L below for details. + +=item [7] + +Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the +character whose name is C; and similarly when of the form +C<\N{U+I}>, it matches the character whose Unicode ordinal is +I. Otherwise it matches any character but C<\n>. + +=back =head3 Assertions @@ -345,12 +330,12 @@ X X X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> - \b Match a word boundary - \B Match except at a word boundary - \A Match only at beginning of string - \Z Match only at end of string, or before newline at the end - \z Match only at end of string - \G Match only at pos() (e.g. at the end-of-match position + \b Match a word boundary + \B Match except at a word boundary + \A Match only at beginning of string + \Z Match only at end of string, or before newline at the end + \z Match only at end of string + \G Match only at pos() (e.g. at the end-of-match position of prior m//g) A word boundary (C<\b>) is a spot between two characters @@ -866,7 +851,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of C operator (see -L). +Lmsixpo">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: @@ -937,7 +922,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of C operator (see -L). +LSTRINGEmsixpo">). Because perl's regex engine is not currently re-entrant, delayed code may not invoke the regex engine either directly with C or C), -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0009-Edits-to-perlrecharclass.pod.patch ```diff From 080d3a3888e53704540f96e1a616c86787d34864 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 13:35:34 -0600 Subject: [PATCH] Edits to perlrecharclass.pod A number of clarification and wording edits have been made, fixing some broken links, and details especially on \d in the Unicode range. Fixed an incorrect character ordinal --- pod/perlrecharclass.pod | 241 ++++++++++++++++++++++++++++------------------- 1 files changed, 143 insertions(+), 98 deletions(-) diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 7c92008..047915b 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -9,27 +9,29 @@ The top level documentation about Perl regular expressions is found in L. This manual page discusses the syntax and use of character -classes in Perl Regular Expressions. +classes in Perl regular expressions. -A character class is a way of denoting a set of characters, +A character class is a way of denoting a set of characters in such a way that one character of the set is matched. -It's important to remember that matching a character class +It's important to remember that: matching a character class consumes exactly one character in the source string. (The source string is the string the regular expression is matched against.) There are three types of character classes in Perl regular -expressions: the dot, backslashed sequences, and the form enclosed in square +expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used -to mean just the bracketed form. This is true in other Perl documentation. +to mean just the bracketed form. Certainly, most Perl documentation does that. =head2 The dot The dot (or period), C<.> is probably the most used, and certainly the most well-known character class. By default, a dot matches any character, except for the newline. The default can be changed to -add matching the newline with the I modifier: either -for the entire regular expression using the C modifier, or -locally using C<(?s)>. +add matching the newline by using the I modifier: either +for the entire regular expression with the C modifier, or +locally with C<(?s)>. (The experimental C<\N> backslash sequence, described +below, matches any character except newline without regard to the +I modifier.) Here are some examples: @@ -41,53 +43,80 @@ Here are some examples: "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) "ab" =~ /^.$/ # No match (dot matches one character) -=head2 Backslashed sequences +=head2 Backslash sequences X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> X<\N> X<\v> X<\V> X<\h> X<\H> X X -Perl regular expressions contain many backslashed sequences that -constitute a character class. That is, they will match a single -character, if that character belongs to a specific set of characters -(defined by the sequence). A backslashed sequence is a sequence of -characters starting with a backslash. Not all backslashed sequences -are character classes; for a full list, see L. +A backslash sequence is a sequence of characters, the first one of which is a +backslash. Perl ascribes special meaning to many such sequences, and some of +these are character classes. That is, they match a single character each, +provided that the character belongs to the specific set of characters defined +by the sequence. -Here's a list of the backslashed sequences that are character classes. They -are discussed in more detail below. +Here's a list of the backslash sequences that are character classes. They +are discussed in more detail below. (For the backslash sequences that aren't +character classes, see L.) - \d Match a digit character. - \D Match a non-digit character. + \d Match a decimal digit character. + \D Match a non-decimal-digit character. \w Match a "word" character. \W Match a non-"word" character. \s Match a whitespace character. \S Match a non-whitespace character. \h Match a horizontal whitespace character. \H Match a character that isn't horizontal whitespace. - \N Match a character that isn't newline. Experimental. \v Match a vertical whitespace character. \V Match a character that isn't vertical whitespace. - \pP, \p{Prop} Match a character matching a Unicode property. - \PP, \P{Prop} Match a character that doesn't match a Unicode property. + \N Match a character that isn't a newline. Experimental. + \pP, \p{Prop} Match a character that has the given Unicode property. + \PP, \P{Prop} Match a character that doesn't have the given Unicode property =head3 Digits -C<\d> matches a single character that is considered to be a I. What is -considered a digit depends on the internal encoding of the source string and -the locale that is in effect. If the source string is in UTF-8 format, C<\d> -not only matches the digits '0' - '9', but also Arabic, Devanagari and digits -from other languages. Otherwise, if there is a locale in effect, it will match -whatever characters the locale considers digits. Without a locale, C<\d> -matches the digits '0' to '9'. See L. +C<\d> matches a single character that is considered to be a decimal I. +What is considered a decimal digit depends on the internal encoding of the +source string and the locale that is in effect. If the source string is in +UTF-8 format, C<\d> not only matches the digits '0' - '9', but also Arabic, +Devanagari and digits from other languages. Otherwise, if there is a locale in +effect, it will match whatever characters the locale considers decimal digits. +Without a locale, C<\d> matches just the digits '0' to '9'. +See L. + +Unicode digits may cause some confusion, and some security issues. In UTF-8 +strings, C<\d> matches the same characters matched by +C<\p{General_Category=Decimal_Number}>, or synonymously, +C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the +same set of characters matched by C<\p{Numeric_Type=Decimal}>. + +But Unicode also has a different property with a similar name, +C<\p{Numeric_Type=Digit}>, which matches a completely different set of +characters. These characters are things such as subscripts. + +The design intent is for C<\d> to match all the digits (and no other characters) +that can be used with "normal" big-endian positional decimal syntax, whereby a +sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10 ++ N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 - +U+0BEF) can also legally be used in old-style Tamil numbers in which they would +appear no more than one in a row, separated by characters that mean "times 10", +"times 100", etc. (See L.) + +Some of the non-European digits that C<\d> matches look like European ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09A) looks very much +like an ASCII DIGIT EIGHT (U+0038). + +It may be useful for security purposes for an application to require that all +digits in a row be from the same script. See L. Any character that isn't matched by C<\d> will be matched by C<\D>. =head3 Word characters A C<\w> matches a single alphanumeric character (an alphabetic character, or a -decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match -a string of Perl-identifier characters (which isn't the same as matching an -English word). What is considered a word character depends on the internal +decimal digit) or an underscore (C<_>), not a whole word. To match a whole +word, use C<\w+>. This isn't the same thing as matching an English word, but +is the same as a string of Perl-identifier characters. What is considered a +word character depends on the internal encoding of the string and the locale or EBCDIC code page that is in effect. If it's in UTF-8 format, C<\w> matches those characters that are considered word characters in the Unicode database. That is, it not only matches ASCII letters, @@ -97,48 +126,43 @@ the current locale or EBCDIC code page. Without a locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and the underscore. See L. +There are a number of security issues with the full Unicode list of word +characters. See L. + +Also, for a somewhat finer-grained set of characters that are in programming +language identifiers beyond the ASCII range, you may wish to instead use the +more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and +"XID_Continue". See L. + Any character that isn't matched by C<\w> will be matched by C<\W>. =head3 Whitespace -C<\s> matches any single character that is considered whitespace. In the ASCII -range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form -feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab, -C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s> -depends on whether the source string is in UTF-8 format and the locale or -EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what -is considered whitespace in the Unicode database; the complete list is in the -table below. Otherwise, if there is a locale or EBCDIC code page in effect, -C<\s> matches whatever is considered whitespace by the current locale or EBCDIC -code page. Without a locale or EBCDIC code page, C<\s> matches the five -characters mentioned in the beginning of this paragraph. Perhaps the most -notable possible surprise is that C<\s> matches a non-breaking space only if -the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC -code page that is in effect has that character. +C<\s> matches any single character that is considered whitespace. The exact +set of characters matched by C<\s> depends on whether the source string is in +UTF-8 format and the locale or EBCDIC code page that is in effect. If it's in +UTF-8 format, C<\s> matches what is considered whitespace in the Unicode +database; the complete list is in the table below. Otherwise, if there is a +locale or EBCDIC code page in effect, C<\s> matches whatever is considered +whitespace by the current locale or EBCDIC code page. Without a locale or +EBCDIC code page, C<\s> matches the horizontal tab (C<\t>), the newline +(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the space. +(Note that it doesn't match the vertical tab, C<\cK>.) Perhaps the most notable +possible surprise is that C<\s> matches a non-breaking space only if the +non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC code +page that is in effect has that character. See L. Any character that isn't matched by C<\s> will be matched by C<\S>. C<\h> will match any character that is considered horizontal whitespace; -this includes the space and the tab characters and 17 other characters that are -listed in the table below. C<\H> will match any character +this includes the space and the tab characters and a number other characters, +all of which are listed in the table below. C<\H> will match any character that is not considered horizontal whitespace. -C<\N> is new in 5.12, and is experimental. It, like the dot, will match any -character that is not a newline. The difference is that C<\N> will not be -influenced by the single line C regular expression modifier. Note that -there is a second meaning of C<\N> when of the form C<\N{...}>. This form is -for named characters. See L for those. If C<\N> is followed by an -opening brace and something that is not a quantifier, perl will assume that a -character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> -means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, -but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to -look for characters named C<4F> or C, respectively (and won't find them, -thus raising an error, unless they have been defined using custom names). - C<\v> will match any character that is considered vertical whitespace; -this includes the carriage return and line feed characters (newline) plus 5 -other characters listed in the table below. +this includes the carriage return and line feed characters (newline) plus several +other characters, all listed in the table below. C<\V> will match any character that is not considered vertical whitespace. C<\R> matches anything that can be considered a newline under Unicode @@ -156,10 +180,10 @@ One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered vertical whitespace. Furthermore, if the source string is not in UTF-8 format, and any locale or EBCDIC code page that is in effect doesn't include them, the -next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not -matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source -string is in UTF-8 format, both the next line and the no-break space are -matched by C<\s>. +next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform +C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h> +respectively. If the source string is in UTF-8 format, both the next line and +the no-break space are matched by C<\s>. The following table is a complete listing of characters matched by C<\s>, C<\h> and C<\v> as of Unicode 5.2. @@ -209,6 +233,19 @@ It is worth noting that C<\d>, C<\w>, etc, match single characters, not complete numbers or words. To match a number (that consists of integers), use C<\d+>; to match a word, use C<\w+>. +=head3 \N + +C<\N> is new in 5.12, and is experimental. It, like the dot, will match any +character that is not a newline. The difference is that C<\N> is not influenced +by the I regular expression modifier (see L above). Note +that the form C<\N{...}> may mean something completely different. When the +C<{...}> is a L, it means to match a non-newline +character that many times. For example, C<\N{3}> means to match 3 +non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> +is not a legal quantifier, it is presumed to be a named character. See +L for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and +C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose +names are, respectively, C, C<4F>, and C. =head3 Unicode Properties @@ -263,13 +300,13 @@ L. =head2 Bracketed Character Classes The third form of character class you can use in Perl regular expressions -is the bracketed form. In its simplest form, it lists the characters +is the bracketed character class. In its simplest form, it lists the characters that may be matched, surrounded by square brackets, like this: C<[aeiou]>. This matches one of C, C, C, C or C. Like the other character classes, exactly one character will be matched. To match a longer string consisting of characters mentioned in the character -class, follow the character class with a quantifier. For instance, -C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. +class, follow the character class with a L. For +instance, C<[aeiou]+> matches a string of one or more lowercase English vowels. Repeating a character in a character class has no effect; it's considered to be in the set only once. @@ -297,7 +334,7 @@ escaped with a backslash, although this is sometimes not needed, in which case the backslash may be omitted. The sequence C<\b> is special inside a bracketed character class. While -outside the character class C<\b> is an assertion indicating a point +outside the character class, C<\b> is an assertion indicating a point that does not have either two word characters or two non-word characters on either side, inside a bracketed character class, C<\b> matches a backspace character. @@ -320,12 +357,14 @@ class. Also, a backslash followed by two or three octal digits is considered an octal number. -A C<[> is not special inside a character class, unless it's the start -of a POSIX character class (see below). It normally does not need escaping. +A C<[> is not special inside a character class, unless it's the start of a +POSIX character class (see L below). It normally does +not need escaping. -A C<]> is normally either the end of a POSIX character class (see below), or it -signals the end of the bracketed character class. If you want to include a -C<]> in the set of characters, you must generally escape it. +A C<]> is normally either the end of a POSIX character class (see +L below), or it signals the end of the bracketed +character class. If you want to include a C<]> in the set of characters, you +must generally escape it. However, if the C<]> is the I (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) @@ -362,7 +401,7 @@ a platform that uses a different character set, such as EBCDIC. If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and will be -considered a character that may be matched literally. You have to escape the +considered a character that is to be matched literally. You have to escape the hyphen with a backslash if you want to have a hyphen in your set of characters to be matched, and its position in the class is such that it could be considered part of a range. @@ -403,13 +442,15 @@ Examples: You can put any backslash sequence character class (with the exception of C<\N>) inside a bracketed character class, and it will act just as if you put all the characters matched by the backslash sequence inside the -character class. For instance, C<[a-f\d]> will match any digit, or any of the -lowercase letters between 'a' and 'f' inclusive. +character class. For instance, C<[a-f\d]> will match any decimal digit, or any +of the lowercase letters between 'a' and 'f' inclusive. + +C<\N> within a bracketed character class must be of the forms C<\N{I}> +or C<\N{U+I}>, and NOT be the form that matches non-newlines, +for the same reason that a dot C<.> inside a bracketed character class loses +its special meaning: it matches nearly anything, which generally isn't what you +want to happen. -C<\N> within a bracketed character class must be of the forms C<\N{I}> or -C<\N{U+I}> for the same reason that a dot C<.> inside a -bracketed character class loses its special meaning: it matches nearly -anything, which generally isn't what you want to happen. Examples: @@ -419,19 +460,22 @@ Examples: # character, nor a parenthesis. Backslash sequence character classes cannot form one of the endpoints -of a range. +of a range. Thus, you can't say: + + /[\p{Thai}-\d]/ # Wrong! -=head3 Posix Character Classes +=head3 POSIX Character Classes X X<\p> X<\p{}> X X X X X X X X X X X X X X -Posix character classes have the form C<[:class:]>, where I is -name, and the C<[:> and C<:]> delimiters. Posix character classes only appear +POSIX character classes have the form C<[:class:]>, where I is +name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear I bracketed character classes, and are a convenient and descriptive way of listing a group of characters, though they currently suffer from -portability issues (see below and L). Be -careful about the syntax, +portability issues (see below and L). + +Be careful about the syntax, # Correct: $string =~ /[[:alpha:]]/ @@ -441,7 +485,7 @@ careful about the syntax, The latter pattern would be a character class consisting of a colon, and the letters C, C, C

and C. -These character classes can be part of a larger bracketed character class. For +POSIX character classes can be part of a larger bracketed character class. For example, [01[:alpha:]%] @@ -471,8 +515,7 @@ derived from official Unicode properties.) The table below shows the relation between POSIX character classes and these counterparts. One counterpart, in the column labelled "ASCII-range Unicode" in -the table will only match characters in the ASCII range. (On EBCDIC platforms, -they match those characters which have ASCII equivalents.) +the table, will only match characters in the ASCII character set. The other counterpart, in the column labelled "Full-range Unicode", matches any appropriate characters in the full Unicode character set. For example, @@ -490,10 +533,12 @@ Both the C<\p> forms are unaffected by any locale that is in effect, or whether the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. In contrast, the POSIX character classes are affected. If the source string is in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see -Note [5]) behave like their "Full-range" Unicode counterparts. If the source -string is not in UTF-8 format, and no locale is in effect, and the platform is -not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. -Otherwise, they behave based on the rules of the locale or EBCDIC code page. +Note [5] below) behave like their "Full-range" Unicode counterparts. If the +source string is not in UTF-8 format, and no locale is in effect, and the +platform is not EBCDIC, all the POSIX classes behave like their ASCII-range +counterparts. Otherwise, they behave based on the rules of the locale or +EBCDIC code page. + It is proposed to change this behavior in a future release of Perl so that the the UTF8ness of the source string will be irrelevant to the behavior of the POSIX character classes. This means they will always behave in strict @@ -537,7 +582,7 @@ plus 127 (C) are control characters. On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> to be the EBCDIC equivalents of the ASCII controls, plus the controls -that in Unicode have ordinals from 128 through 139. +that in Unicode have ordinals from 128 through 159. =item [3] @@ -624,7 +669,7 @@ The rule is that if the source string is in UTF-8 format, the character classes match according to the Unicode properties. If the source string isn't, then the character classes match according to whatever locale or EBCDIC code page is in effect. If there is no locale nor EBCDIC, they match the ASCII -defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>; +defaults (0 to 9 for C<\d>; 52 letters, 10 digits and underscore for C<\w>; etc.). This usually means that if you are matching against characters whose C @@ -632,7 +677,7 @@ values are between 128 and 255 inclusive, your character class may match or not depending on the current locale or EBCDIC code page, and whether the source string is in UTF-8 format. The string will be in UTF-8 format if it contains characters whose C value exceeds 255. But a string may be in -UTF-8 format without it having such characters. See L. For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> -- 1.5.6.3 ```

p5pRT commented 14 years ago

From @khwilliamson

0010-Clarify-c-in-perlop.pod.patch ```diff From 063686b7cb0d39dc7e8c10c416b8dd3c847f04ae Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 13:44:30 -0600 Subject: [PATCH] Clarify \c in perlop.pod. And structure the table containing \c better. --- pod/perlop.pod | 85 ++++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 61 insertions(+), 24 deletions(-) diff --git a/pod/perlop.pod b/pod/perlop.pod index ebe32fb..fc78326 100644 --- a/pod/perlop.pod +++ b/pod/perlop.pod @@ -1011,33 +1011,70 @@ from the next line. This allows you to write: The following escape sequences are available in constructs that interpolate and in transliterations. -X<\t> X<\n> X<\r> X<\f> X<\b> X<\a> X<\e> X<\x> X<\0> X<\c> X<\N> - - \t tab (HT, TAB) - \n newline (NL) - \r return (CR) - \f form feed (FF) - \b backspace (BS) - \a alarm (bell) (BEL) - \e escape (ESC) - \033 octal char (example: ESC) - \x1b hex char (example: ESC) - \x{263a} wide hex char (example: SMILEY) - \c[ control char (example: ESC) - \N{name} named Unicode character - \N{U+263D} Unicode character (example: FIRST QUARTER MOON) - -The character following C<\c> is mapped to some other character by -converting letters to upper case and then (on ASCII systems) by inverting -the 7th bit (0x40). The most interesting range is from '@' to '_' -(0x40 through 0x5F), resulting in a control character from 0x00 -through 0x1F. A '?' maps to the DEL character. On EBCDIC systems only -'@', the letters, '[', '\', ']', '^', '_' and '?' will work, resulting -in 0x00 through 0x1F and 0x7F. +X<\t> X<\n> X<\r> X<\f> X<\b> X<\a> X<\e> X<\x> X<\0> X<\c> X<\N> X<\N{}> + + Sequence Note Description + \t tab (HT, TAB) + \n newline (NL) + \r return (CR) + \f form feed (FF) + \b backspace (BS) + \a alarm (bell) (BEL) + \e escape (ESC) + \033 octal char (example: ESC) + \x1b hex char (example: ESC) + \x{263a} wide hex char (example: SMILEY) + \c[ [1] control char (example: chr(27)) + \N{name} [2] named Unicode character + \N{U+263D} [3] Unicode character (example: FIRST QUARTER MOON) + +=over 4 + +=item [1] + +The character following C<\c> is mapped to some other character as shown in the +table: + + Sequence Value + \c@ chr(0) + \cA chr(1) + \ca chr(1) + \cB chr(2) + \cb chr(2) + ... + \cZ chr(26) + \cz chr(26) + \c[ chr(27) + \c] chr(29) + \c^ chr(30) + \c? chr(127) + +Also, C<\c\I> yields C< chr(28) . "I"> for any I, but cannot come at the +end of a string, because the backslash would be parsed as escaping the end +quote. + +On ASCII platforms, the resulting characters from the list above are the +complete set of ASCII controls. This isn't the case on EBCDIC platforms; see +L for the complete list of what these +sequences mean on both ASCII and EBCDIC platforms. + +Use of any other character following the "c" besides those listed above is +prohibited on EBCDIC platforms, and discouraged (and may become deprecated or +forbidden) on ASCII ones. What happens for those other characters currently +though, is that the value is derived by inverting the 7th bit (0x40). + +To get platform independent controls, you can use C<\N{...}>. + +=item [2] + +For documentation of C<\N{name}>, see L. + +=item [3] C<\N{U+I}> means the Unicode character whose Unicode ordinal number is I. -For documentation of C<\N{name}>, see L. + +=back B: Unlike C and other languages, Perl has no C<\v> escape sequence for the vertical tab (VT - ASCII 11), but you may use C<\ck> or C<\x0b>. (C<\v> -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @khwilliamson

0011-perlebcdic.pod-nits-plus-improve-controls-docs.patch ```diff From ac84412f2584c338535f177e87bb4667892cbaf4 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 14:24:25 -0600 Subject: [PATCH] perlebcdic.pod nits plus improve controls docs The controls all now have names, and the part about \c\ has been corrected. The table widths have been changed; all recipes have been tested on the new tables. --- pod/perlebcdic.pod | 665 ++++++++++++++++++++++++++-------------------------- 1 files changed, 329 insertions(+), 336 deletions(-) diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod index 28d47b9..f178912 100644 --- a/pod/perlebcdic.pod +++ b/pod/perlebcdic.pod @@ -70,9 +70,7 @@ characters [a-z] and [A-Z], but there were gaps within each Latin alphabet range. Some IBM EBCDIC character sets may be known by character code set -identification numbers (CCSID numbers) or code page numbers. Leading -zero digits in CCSID numbers within this document are insignificant. -E.g. CCSID 0037 may be referred to as 37 in places. +identification numbers (CCSID numbers) or code page numbers. Perl can be compiled on platforms that run any of three commonly used EBCDIC character sets, listed below. @@ -97,7 +95,7 @@ They are: Character code set ID 0037 is a mapping of the ASCII plus Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used in North American English locales on the OS/400 operating system -that runs on AS/400 computers. CCSID 37 differs from ISO 8859-1 +that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1 in 237 places, in other words they agree on only 19 code point values. =head2 1047 @@ -216,7 +214,7 @@ you to use different encodings per IO channel. For example you may use open($f, ">:encoding(utf8)", "test.utf8"); print $f "Hello World!\n"; -to get four files containing "Hello World!\n" in ASCII, CP 37 EBCDIC, +to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC, ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII characters were printed), and UTF-EBCDIC (in this example identical to normal EBCDIC since only characters @@ -236,10 +234,11 @@ extensions to ASCII have been labelled with character names roughly corresponding to I albeit with substitutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ -in some other cases (the C pragma names unfortunately do -not list explicit names for the C0 or C1 control characters). The -"names" of the C1 control set (128..159 in ISO 8859-1) listed here are -somewhat arbitrary. The differences between the 0037 and 1047 sets are +in some other cases. The "names" of the controls listed here are +the Unicode Version 1 names, except for the few that don't have names, in which +case the names in the Wikipedia article were used +(L. +The differences between the 0037 and 1047 sets are flagged with ***. The differences between the 1047 and POSIX-BC sets are flagged with ###. All ord() numbers listed are decimal. If you would rather see this table listing octal values then run the table @@ -252,7 +251,7 @@ work with a pod2_other_format translation) through: =back - perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ + perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod If you want to retain the UTF-x code points then in script form you @@ -266,7 +265,7 @@ might want to write: open(FH,") { - if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { + if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { if ($7 ne '' && $9 ne '') { printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n",$1,$2,$3,$4,$5,$6,$7,$8,$9); } @@ -288,7 +287,7 @@ run the table through: =back - perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ + perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod Or, in order to retain the UTF-x code points in hexadecimal: @@ -301,7 +300,7 @@ Or, in order to retain the UTF-x code points in hexadecimal: open(FH,") { - if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { + if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { if ($7 ne '' && $9 ne '') { printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n",$1,$2,$3,$4,$5,$6,$7,$8,$9); } @@ -315,266 +314,265 @@ Or, in order to retain the UTF-x code points in hexadecimal: } - incomp- incomp- - 8859-1 lete lete - chr 0819 0037 1047 POSIX-BC UTF-8 UTF-EBCDIC - ------------------------------------------------------------------------------------ - 0 0 0 0 0 0 - 1 1 1 1 1 1 - 2 2 2 2 2 2 - 3 3 3 3 3 3 - 4 55 55 55 4 55 - 5 45 45 45 5 45 - 6 46 46 46 6 46 - 7 47 47 47 7 47 - 8 22 22 22 8 22 - 9 5 5 5 9 5 - 10 37 21 21 10 21 *** - 11 11 11 11 11 11 -
12 12 12 12 12 12 - 13 13 13 13 13 13 - 14 14 14 14 14 14 - 15 15 15 15 15 15 - 16 16 16 16 16 16 - 17 17 17 17 17 17 - 18 18 18 18 18 18 - 19 19 19 19 19 19 - 20 60 60 60 20 60 - 21 61 61 61 21 61 - 22 50 50 50 22 50 - 23 38 38 38 23 38 - 24 24 24 24 24 24 - 25 25 25 25 25 25 - 26 63 63 63 26 63 - 27 39 39 39 27 39 - 28 28 28 28 28 28 - 29 29 29 29 29 29 - 30 30 30 30 30 30 - 31 31 31 31 31 31 - 32 64 64 64 32 64 - ! 33 90 90 90 33 90 - " 34 127 127 127 34 127 - # 35 123 123 123 35 123 - $ 36 91 91 91 36 91 - % 37 108 108 108 37 108 - & 38 80 80 80 38 80 - ' 39 125 125 125 39 125 - ( 40 77 77 77 40 77 - ) 41 93 93 93 41 93 - * 42 92 92 92 42 92 - + 43 78 78 78 43 78 - , 44 107 107 107 44 107 - - 45 96 96 96 45 96 - . 46 75 75 75 46 75 - / 47 97 97 97 47 97 - 0 48 240 240 240 48 240 - 1 49 241 241 241 49 241 - 2 50 242 242 242 50 242 - 3 51 243 243 243 51 243 - 4 52 244 244 244 52 244 - 5 53 245 245 245 53 245 - 6 54 246 246 246 54 246 - 7 55 247 247 247 55 247 - 8 56 248 248 248 56 248 - 9 57 249 249 249 57 249 - : 58 122 122 122 58 122 - ; 59 94 94 94 59 94 - < 60 76 76 76 60 76 - = 61 126 126 126 61 126 - > 62 110 110 110 62 110 - ? 63 111 111 111 63 111 - @ 64 124 124 124 64 124 - A 65 193 193 193 65 193 - B 66 194 194 194 66 194 - C 67 195 195 195 67 195 - D 68 196 196 196 68 196 - E 69 197 197 197 69 197 - F 70 198 198 198 70 198 - G 71 199 199 199 71 199 - H 72 200 200 200 72 200 - I 73 201 201 201 73 201 - J 74 209 209 209 74 209 - K 75 210 210 210 75 210 - L 76 211 211 211 76 211 - M 77 212 212 212 77 212 - N 78 213 213 213 78 213 - O 79 214 214 214 79 214 - P 80 215 215 215 80 215 - Q 81 216 216 216 81 216 - R 82 217 217 217 82 217 - S 83 226 226 226 83 226 - T 84 227 227 227 84 227 - U 85 228 228 228 85 228 - V 86 229 229 229 86 229 - W 87 230 230 230 87 230 - X 88 231 231 231 88 231 - Y 89 232 232 232 89 232 - Z 90 233 233 233 90 233 - [ 91 186 173 187 91 173 *** ### - \ 92 224 224 188 92 224 ### - ] 93 187 189 189 93 189 *** - ^ 94 176 95 106 94 95 *** ### - _ 95 109 109 109 95 109 - ` 96 121 121 74 96 121 ### - a 97 129 129 129 97 129 - b 98 130 130 130 98 130 - c 99 131 131 131 99 131 - d 100 132 132 132 100 132 - e 101 133 133 133 101 133 - f 102 134 134 134 102 134 - g 103 135 135 135 103 135 - h 104 136 136 136 104 136 - i 105 137 137 137 105 137 - j 106 145 145 145 106 145 - k 107 146 146 146 107 146 - l 108 147 147 147 108 147 - m 109 148 148 148 109 148 - n 110 149 149 149 110 149 - o 111 150 150 150 111 150 - p 112 151 151 151 112 151 - q 113 152 152 152 113 152 - r 114 153 153 153 114 153 - s 115 162 162 162 115 162 - t 116 163 163 163 116 163 - u 117 164 164 164 117 164 - v 118 165 165 165 118 165 - w 119 166 166 166 119 166 - x 120 167 167 167 120 167 - y 121 168 168 168 121 168 - z 122 169 169 169 122 169 - { 123 192 192 251 123 192 ### - | 124 79 79 79 124 79 - } 125 208 208 253 125 208 ### - ~ 126 161 161 255 126 161 ### - 127 7 7 7 127 7 - 128 32 32 32 194.128 32 - 129 33 33 33 194.129 33 - 130 34 34 34 194.130 34 - 131 35 35 35 194.131 35 - 132 36 36 36 194.132 36 - 133 21 37 37 194.133 37 *** - 134 6 6 6 194.134 6 - 135 23 23 23 194.135 23 - 136 40 40 40 194.136 40 - 137 41 41 41 194.137 41 - 138 42 42 42 194.138 42 - 139 43 43 43 194.139 43 - 140 44 44 44 194.140 44 - 141 9 9 9 194.141 9 - 142 10 10 10 194.142 10 - 143 27 27 27 194.143 27 - 144 48 48 48 194.144 48 - 145 49 49 49 194.145 49 - 146 26 26 26 194.146 26 - 147 51 51 51 194.147 51 - 148 52 52 52 194.148 52 - 149 53 53 53 194.149 53 - 150 54 54 54 194.150 54 - 151 8 8 8 194.151 8 - 152 56 56 56 194.152 56 - 153 57 57 57 194.153 57 - 154 58 58 58 194.154 58 - 155 59 59 59 194.155 59 - 156 4 4 4 194.156 4 - 157 20 20 20 194.157 20 - 158 62 62 62 194.158 62 - 159 255 255 95 194.159 255 ### - 160 65 65 65 194.160 128.65 - 161 170 170 170 194.161 128.66 - 162 74 74 176 194.162 128.67 ### - 163 177 177 177 194.163 128.68 - 164 159 159 159 194.164 128.69 - 165 178 178 178 194.165 128.70 - 166 106 106 208 194.166 128.71 ### -
167 181 181 181 194.167 128.72 - 168 189 187 121 194.168 128.73 *** ### - 169 180 180 180 194.169 128.74 - 170 154 154 154 194.170 128.81 - 171 138 138 138 194.171 128.82 - 172 95 176 186 194.172 128.83 *** ### - 173 202 202 202 194.173 128.84 - 174 175 175 175 194.174 128.85 - 175 188 188 161 194.175 128.86 ### - 176 144 144 144 194.176 128.87 - 177 143 143 143 194.177 128.88 - 178 234 234 234 194.178 128.89 - 179 250 250 250 194.179 128.98 - 180 190 190 190 194.180 128.99 - 181 160 160 160 194.181 128.100 - 182 182 182 182 194.182 128.101 - 183 179 179 179 194.183 128.102 - 184 157 157 157 194.184 128.103 - 185 218 218 218 194.185 128.104 - 186 155 155 155 194.186 128.105 - 187 139 139 139 194.187 128.106 - 188 183 183 183 194.188 128.112 - 189 184 184 184 194.189 128.113 - 190 185 185 185 194.190 128.114 - 191 171 171 171 194.191 128.115 - 192 100 100 100 195.128 138.65 - 193 101 101 101 195.129 138.66 - 194 98 98 98 195.130 138.67 - 195 102 102 102 195.131 138.68 - 196 99 99 99 195.132 138.69 - 197 103 103 103 195.133 138.70 - 198 158 158 158 195.134 138.71 - 199 104 104 104 195.135 138.72 - 200 116 116 116 195.136 138.73 - 201 113 113 113 195.137 138.74 - 202 114 114 114 195.138 138.81 - 203 115 115 115 195.139 138.82 - 204 120 120 120 195.140 138.83 - 205 117 117 117 195.141 138.84 - 206 118 118 118 195.142 138.85 - 207 119 119 119 195.143 138.86 - 208 172 172 172 195.144 138.87 - 209 105 105 105 195.145 138.88 - 210 237 237 237 195.146 138.89 - 211 238 238 238 195.147 138.98 - 212 235 235 235 195.148 138.99 - 213 239 239 239 195.149 138.100 - 214 236 236 236 195.150 138.101 - 215 191 191 191 195.151 138.102 - 216 128 128 128 195.152 138.103 - 217 253 253 224 195.153 138.104 ### - 218 254 254 254 195.154 138.105 - 219 251 251 221 195.155 138.106 ### - 220 252 252 252 195.156 138.112 - 221 173 186 173 195.157 138.113 *** ### - 222 174 174 174 195.158 138.114 - 223 89 89 89 195.159 138.115 - 224 68 68 68 195.160 139.65 - 225 69 69 69 195.161 139.66 - 226 66 66 66 195.162 139.67 - 227 70 70 70 195.163 139.68 - 228 67 67 67 195.164 139.69 - 229 71 71 71 195.165 139.70 - 230 156 156 156 195.166 139.71 - 231 72 72 72 195.167 139.72 - 232 84 84 84 195.168 139.73 - 233 81 81 81 195.169 139.74 - 234 82 82 82 195.170 139.81 - 235 83 83 83 195.171 139.82 - 236 88 88 88 195.172 139.83 - 237 85 85 85 195.173 139.84 - 238 86 86 86 195.174 139.85 - 239 87 87 87 195.175 139.86 - 240 140 140 140 195.176 139.87 - 241 73 73 73 195.177 139.88 - 242 205 205 205 195.178 139.89 - 243 206 206 206 195.179 139.98 - 244 203 203 203 195.180 139.99 - 245 207 207 207 195.181 139.100 - 246 204 204 204 195.182 139.101 - 247 225 225 225 195.183 139.102 - 248 112 112 112 195.184 139.103 - 249 221 221 192 195.185 139.104 ### - 250 222 222 222 195.186 139.105 - 251 219 219 219 195.187 139.106 - 252 220 220 220 195.188 139.112 - 253 141 141 141 195.189 139.113 - 254 142 142 142 195.190 139.114 - 255 223 223 223 195.191 139.115 + ISO 8859-1 CCSID CCSID CCSID 1047 + chr CCSID 0819 0037 1047 POSIX-BC UTF-8 UTF-EBCDIC + ---------------------------------------------------------------------------------------------- + 0 0 0 0 0 0 + 1 1 1 1 1 1 + 2 2 2 2 2 2 + 3 3 3 3 3 3 + 4 55 55 55 4 55 + 5 45 45 45 5 45 + 6 46 46 46 6 46 + 7 47 47 47 7 47 + 8 22 22 22 8 22 + 9 5 5 5 9 5 + 10 37 21 21 10 21 *** + 11 11 11 11 11 11 + 12 12 12 12 12 12 + 13 13 13 13 13 13 + 14 14 14 14 14 14 + 15 15 15 15 15 15 + 16 16 16 16 16 16 + 17 17 17 17 17 17 + 18 18 18 18 18 18 + 19 19 19 19 19 19 + 20 60 60 60 20 60 + 21 61 61 61 21 61 + 22 50 50 50 22 50 + 23 38 38 38 23 38 + 24 24 24 24 24 24 + 25 25 25 25 25 25 + 26 63 63 63 26 63 + 27 39 39 39 27 39 + 28 28 28 28 28 28 + 29 29 29 29 29 29 + 30 30 30 30 30 30 + 31 31 31 31 31 31 + 32 64 64 64 32 64 + ! 33 90 90 90 33 90 + " 34 127 127 127 34 127 + # 35 123 123 123 35 123 + $ 36 91 91 91 36 91 + % 37 108 108 108 37 108 + & 38 80 80 80 38 80 + ' 39 125 125 125 39 125 + ( 40 77 77 77 40 77 + ) 41 93 93 93 41 93 + * 42 92 92 92 42 92 + + 43 78 78 78 43 78 + , 44 107 107 107 44 107 + - 45 96 96 96 45 96 + . 46 75 75 75 46 75 + / 47 97 97 97 47 97 + 0 48 240 240 240 48 240 + 1 49 241 241 241 49 241 + 2 50 242 242 242 50 242 + 3 51 243 243 243 51 243 + 4 52 244 244 244 52 244 + 5 53 245 245 245 53 245 + 6 54 246 246 246 54 246 + 7 55 247 247 247 55 247 + 8 56 248 248 248 56 248 + 9 57 249 249 249 57 249 + : 58 122 122 122 58 122 + ; 59 94 94 94 59 94 + < 60 76 76 76 60 76 + = 61 126 126 126 61 126 + > 62 110 110 110 62 110 + ? 63 111 111 111 63 111 + @ 64 124 124 124 64 124 + A 65 193 193 193 65 193 + B 66 194 194 194 66 194 + C 67 195 195 195 67 195 + D 68 196 196 196 68 196 + E 69 197 197 197 69 197 + F 70 198 198 198 70 198 + G 71 199 199 199 71 199 + H 72 200 200 200 72 200 + I 73 201 201 201 73 201 + J 74 209 209 209 74 209 + K 75 210 210 210 75 210 + L 76 211 211 211 76 211 + M 77 212 212 212 77 212 + N 78 213 213 213 78 213 + O 79 214 214 214 79 214 + P 80 215 215 215 80 215 + Q 81 216 216 216 81 216 + R 82 217 217 217 82 217 + S 83 226 226 226 83 226 + T 84 227 227 227 84 227 + U 85 228 228 228 85 228 + V 86 229 229 229 86 229 + W 87 230 230 230 87 230 + X 88 231 231 231 88 231 + Y 89 232 232 232 89 232 + Z 90 233 233 233 90 233 + [ 91 186 173 187 91 173 *** ### + \ 92 224 224 188 92 224 ### + ] 93 187 189 189 93 189 *** + ^ 94 176 95 106 94 95 *** ### + _ 95 109 109 109 95 109 + ` 96 121 121 74 96 121 ### + a 97 129 129 129 97 129 + b 98 130 130 130 98 130 + c 99 131 131 131 99 131 + d 100 132 132 132 100 132 + e 101 133 133 133 101 133 + f 102 134 134 134 102 134 + g 103 135 135 135 103 135 + h 104 136 136 136 104 136 + i 105 137 137 137 105 137 + j 106 145 145 145 106 145 + k 107 146 146 146 107 146 + l 108 147 147 147 108 147 + m 109 148 148 148 109 148 + n 110 149 149 149 110 149 + o 111 150 150 150 111 150 + p 112 151 151 151 112 151 + q 113 152 152 152 113 152 + r 114 153 153 153 114 153 + s 115 162 162 162 115 162 + t 116 163 163 163 116 163 + u 117 164 164 164 117 164 + v 118 165 165 165 118 165 + w 119 166 166 166 119 166 + x 120 167 167 167 120 167 + y 121 168 168 168 121 168 + z 122 169 169 169 122 169 + { 123 192 192 251 123 192 ### + | 124 79 79 79 124 79 + } 125 208 208 253 125 208 ### + ~ 126 161 161 255 126 161 ### + 127 7 7 7 127 7 + 128 32 32 32 194.128 32 + 129 33 33 33 194.129 33 + 130 34 34 34 194.130 34 + 131 35 35 35 194.131 35 + 132 36 36 36 194.132 36 + 133 21 37 37 194.133 37 *** + 134 6 6 6 194.134 6 + 135 23 23 23 194.135 23 + 136 40 40 40 194.136 40 + 137 41 41 41 194.137 41 + 138 42 42 42 194.138 42 + 139 43 43 43 194.139 43 + 140 44 44 44 194.140 44 + 141 9 9 9 194.141 9 + 142 10 10 10 194.142 10 + 143 27 27 27 194.143 27 + 144 48 48 48 194.144 48 + 145 49 49 49 194.145 49 + 146 26 26 26 194.146 26 + 147 51 51 51 194.147 51 + 148 52 52 52 194.148 52 + 149 53 53 53 194.149 53 + 150 54 54 54 194.150 54 + 151 8 8 8 194.151 8 + 152 56 56 56 194.152 56 + 153 57 57 57 194.153 57 + 154 58 58 58 194.154 58 + 155 59 59 59 194.155 59 + 156 4 4 4 194.156 4 + 157 20 20 20 194.157 20 + 158 62 62 62 194.158 62 + 159 255 255 95 194.159 255 ### + 160 65 65 65 194.160 128.65 + 161 170 170 170 194.161 128.66 + 162 74 74 176 194.162 128.67 ### + 163 177 177 177 194.163 128.68 + 164 159 159 159 194.164 128.69 + 165 178 178 178 194.165 128.70 + 166 106 106 208 194.166 128.71 ### +
167 181 181 181 194.167 128.72 + 168 189 187 121 194.168 128.73 *** ### + 169 180 180 180 194.169 128.74 + 170 154 154 154 194.170 128.81 + 171 138 138 138 194.171 128.82 + 172 95 176 186 194.172 128.83 *** ### + 173 202 202 202 194.173 128.84 + 174 175 175 175 194.174 128.85 + 175 188 188 161 194.175 128.86 ### + 176 144 144 144 194.176 128.87 + 177 143 143 143 194.177 128.88 + 178 234 234 234 194.178 128.89 + 179 250 250 250 194.179 128.98 + 180 190 190 190 194.180 128.99 + 181 160 160 160 194.181 128.100 + 182 182 182 182 194.182 128.101 + 183 179 179 179 194.183 128.102 + 184 157 157 157 194.184 128.103 + 185 218 218 218 194.185 128.104 + 186 155 155 155 194.186 128.105 + 187 139 139 139 194.187 128.106 + 188 183 183 183 194.188 128.112 + 189 184 184 184 194.189 128.113 + 190 185 185 185 194.190 128.114 + 191 171 171 171 194.191 128.115 + 192 100 100 100 195.128 138.65 + 193 101 101 101 195.129 138.66 + 194 98 98 98 195.130 138.67 + 195 102 102 102 195.131 138.68 + 196 99 99 99 195.132 138.69 + 197 103 103 103 195.133 138.70 + 198 158 158 158 195.134 138.71 + 199 104 104 104 195.135 138.72 + 200 116 116 116 195.136 138.73 + 201 113 113 113 195.137 138.74 + 202 114 114 114 195.138 138.81 + 203 115 115 115 195.139 138.82 + 204 120 120 120 195.140 138.83 + 205 117 117 117 195.141 138.84 + 206 118 118 118 195.142 138.85 + 207 119 119 119 195.143 138.86 + 208 172 172 172 195.144 138.87 + 209 105 105 105 195.145 138.88 + 210 237 237 237 195.146 138.89 + 211 238 238 238 195.147 138.98 + 212 235 235 235 195.148 138.99 + 213 239 239 239 195.149 138.100 + 214 236 236 236 195.150 138.101 + 215 191 191 191 195.151 138.102 + 216 128 128 128 195.152 138.103 + 217 253 253 224 195.153 138.104 ### + 218 254 254 254 195.154 138.105 + 219 251 251 221 195.155 138.106 ### + 220 252 252 252 195.156 138.112 + 221 173 186 173 195.157 138.113 *** ### + 222 174 174 174 195.158 138.114 + 223 89 89 89 195.159 138.115 + 224 68 68 68 195.160 139.65 + 225 69 69 69 195.161 139.66 + 226 66 66 66 195.162 139.67 + 227 70 70 70 195.163 139.68 + 228 67 67 67 195.164 139.69 + 229 71 71 71 195.165 139.70 + 230 156 156 156 195.166 139.71 + 231 72 72 72 195.167 139.72 + 232 84 84 84 195.168 139.73 + 233 81 81 81 195.169 139.74 + 234 82 82 82 195.170 139.81 + 235 83 83 83 195.171 139.82 + 236 88 88 88 195.172 139.83 + 237 85 85 85 195.173 139.84 + 238 86 86 86 195.174 139.85 + 239 87 87 87 195.175 139.86 + 240 140 140 140 195.176 139.87 + 241 73 73 73 195.177 139.88 + 242 205 205 205 195.178 139.89 + 243 206 206 206 195.179 139.98 + 244 203 203 203 195.180 139.99 + 245 207 207 207 195.181 139.100 + 246 204 204 204 195.182 139.101 + 247 225 225 225 195.183 139.102 + 248 112 112 112 195.184 139.103 + 249 221 221 192 195.185 139.104 ### + 250 222 222 222 195.186 139.105 + 251 219 219 219 195.187 139.106 + 252 220 220 220 195.188 139.112 + 253 141 141 141 195.189 139.113 + 254 142 142 142 195.190 139.114 + 255 223 223 223 195.191 139.115 If you would rather see the above table in CCSID 0037 order rather than ASCII + Latin-1 order then run the table through: @@ -585,14 +583,14 @@ ASCII + Latin-1 order then run the table through: =back - perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ + perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ - -e ' map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod + -e ' map{[$_,substr($_,52,3)]}@l;}' perlebcdic.pod -If you would rather see it in CCSID 1047 order then change the digit -42 in the last line to 51, like this: +If you would rather see it in CCSID 1047 order then change the number +52 in the last line to 61, like this: =over 4 @@ -600,14 +598,14 @@ If you would rather see it in CCSID 1047 order then change the digit =back - perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ + perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ - -e ' map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod + -e ' map{[$_,substr($_,61,3)]}@l;}' perlebcdic.pod -If you would rather see it in POSIX-BC order then change the digit -51 in the last line to 60, like this: +If you would rather see it in POSIX-BC order then change the number +61 in the last line to 70, like this: =over 4 @@ -615,11 +613,11 @@ If you would rather see it in POSIX-BC order then change the digit =back - perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ + perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ - -e ' map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod + -e ' map{[$_,substr($_,70,3)]}@l;}' perlebcdic.pod =head1 IDENTIFYING CHARACTER CODE SETS @@ -758,58 +756,55 @@ an example adapted from the one in L: An interesting property of the 32 C0 control characters in the ASCII table is that they can "literally" be constructed -as control characters in perl, e.g. C<(chr(0) eq "\c@")> -C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC platforms has been -ported to take "\c@" to chr(0) and "\cA" to chr(1) as well, but the +as control characters in perl, e.g. C<(chr(0) eq C<\c@>)> +C<(chr(1) eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been +ported to take C<\c@> to chr(0) and C<\cA> to chr(1), etc. as well, but the thirty three characters that result depend on which code page you are -using. The table below uses the character names from the previous table -but with substitutions such as s/START OF/S.O./; s/END OF /E.O./; -s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./; -s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./; -s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are +using. The table below uses the standard acronyms for the controls. +The POSIX-BC and 1047 sets are identical throughout this range and differ from the 0037 set at only one spot (21 decimal). Note that the C character -may be generated by "\cJ" on ASCII platforms but by "\cU" on 1047 or POSIX-BC +may be generated by C<\cJ> on ASCII platforms but by C<\cU> on 1047 or POSIX-BC platforms and cannot be generated as a C<"\c.letter."> control character on -0037 platforms. Note also that "\c\\" maps to two characters -not one. - - chr ord 8859-1 0037 1047 && POSIX-BC - ------------------------------------------------------------------------ - "\c?" 127 " " ***>< - "\c@" 0 ***>< - "\cA" 1 - "\cB" 2 - "\cC" 3 - "\cD" 4 - "\cE" 5 - "\cF" 6 - "\cG" 7 - "\cH" 8 - "\cI" 9 - "\cJ" 10 - "\cK" 11 - "\cL" 12 - "\cM" 13 - "\cN" 14 - "\cO" 15 - "\cP" 16 - "\cQ" 17 - "\cR" 18 - "\cS" 19 - "\cT" 20 - "\cU" 21 *** - "\cV" 22 - "\cW" 23 - "\cX" 24 - "\cY" 25 - "\cZ" 26 - "\c[" 27 - "\c\\" 28 \ \ \ - "\c]" 29 - "\c^" 30 ***>< - "\c_" 31 ***>< - +0037 platforms. Note also that C<\c\> cannot be the final element in a string +or regex, as it will absorb the terminator. But C<\c\I> is a C concatenated with I for all I. + + chr ord 8859-1 0037 1047 && POSIX-BC + ------------------------------------------------------------------------ + \c? 127 " " + \c@ 0 + \cA 1 + \cB 2 + \cC 3 + \cD 4 + \cE 5 + \cF 6 + \cG 7 + \cH 8 + \cI 9 + \cJ 10 + \cK 11 + \cL 12 + \cM 13 + \cN 14 + \cO 15 + \cP 16 + \cQ 17 + \cR 18 + \cS 19 + \cT 20 + \cU 21 *** + \cV 22 + \cW 23 + \cX 24 + \cY 25 + \cZ 26 + \c[ 27 + \c\X 28 X X X + \c] 29 + \c^ 30 + \c_ 31 =head1 FUNCTION DIFFERENCES @@ -948,7 +943,7 @@ four coded character sets discussed in this document is as follows: if (ord('^')==94) { # ascii return $char =~ /[\000-\037]/; } - if (ord('^')==176) { # 37 + if (ord('^')==176) { # 0037 return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/; } if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc @@ -976,7 +971,7 @@ four coded character sets discussed in this document is as follows: if (ord('^')==94) { # ascii return $char =~ /[\200-\237]/; } - if (ord('^')==176) { # 37 + if (ord('^')==176) { # 0037 return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; } if (ord('^')==95) { # 1047 @@ -993,7 +988,7 @@ four coded character sets discussed in this document is as follows: if (ord('^')==94) { # ascii return $char =~ /[\240-\377]/; } - if (ord('^')==176) { # 37 + if (ord('^')==176) { # 0037 return $char =~ /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; } @@ -1427,5 +1422,3 @@ Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and Joe Smith. Trademarks, registered trademarks, service marks and registered service marks used in this document are the property of their respective owners. - - -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @rgs

Thanks\, applied to bleadperl.

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

@rgs - Status changed from 'open' to 'resolved'