DOC PATCH: various fixes to pods

~~From @khwilliamson~~

0009-Edits-to-perlrecharclass.pod.patch
```diff From 080d3a3888e53704540f96e1a616c86787d34864 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 13:35:34 -0600 Subject: [PATCH] Edits to perlrecharclass.pod A number of clarification and wording edits have been made, fixing some broken links, and details especially on \d in the Unicode range. Fixed an incorrect character ordinal --- pod/perlrecharclass.pod | 241 ++++++++++++++++++++++++++++------------------- 1 files changed, 143 insertions(+), 98 deletions(-) diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 7c92008..047915b 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -9,27 +9,29 @@ The top level documentation about Perl regular expressions is found in L. This manual page discusses the syntax and use of character -classes in Perl Regular Expressions. +classes in Perl regular expressions. -A character class is a way of denoting a set of characters, +A character class is a way of denoting a set of characters in such a way that one character of the set is matched. -It's important to remember that matching a character class +It's important to remember that: matching a character class consumes exactly one character in the source string. (The source string is the string the regular expression is matched against.) There are three types of character classes in Perl regular -expressions: the dot, backslashed sequences, and the form enclosed in square +expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used -to mean just the bracketed form. This is true in other Perl documentation. +to mean just the bracketed form. Certainly, most Perl documentation does that. =head2 The dot The dot (or period), C<.> is probably the most used, and certainly the most well-known character class. By default, a dot matches any character, except for the newline. The default can be changed to -add matching the newline with the I modifier: either -for the entire regular expression using the C modifier, or -locally using C<(?s)>. +add matching the newline by using the I modifier: either +for the entire regular expression with the C modifier, or +locally with C<(?s)>. (The experimental C<\N> backslash sequence, described +below, matches any character except newline without regard to the +I modifier.) Here are some examples: @@ -41,53 +43,80 @@ Here are some examples: "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) "ab" =~ /^.$/ # No match (dot matches one character) -=head2 Backslashed sequences +=head2 Backslash sequences X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> X<\N> X<\v> X<\V> X<\h> X<\H> X X -Perl regular expressions contain many backslashed sequences that -constitute a character class. That is, they will match a single -character, if that character belongs to a specific set of characters -(defined by the sequence). A backslashed sequence is a sequence of -characters starting with a backslash. Not all backslashed sequences -are character classes; for a full list, see L. +A backslash sequence is a sequence of characters, the first one of which is a +backslash. Perl ascribes special meaning to many such sequences, and some of +these are character classes. That is, they match a single character each, +provided that the character belongs to the specific set of characters defined +by the sequence. -Here's a list of the backslashed sequences that are character classes. They -are discussed in more detail below. +Here's a list of the backslash sequences that are character classes. They +are discussed in more detail below. (For the backslash sequences that aren't +character classes, see L.) - \d Match a digit character. - \D Match a non-digit character. + \d Match a decimal digit character. + \D Match a non-decimal-digit character. \w Match a "word" character. \W Match a non-"word" character. \s Match a whitespace character. \S Match a non-whitespace character. \h Match a horizontal whitespace character. \H Match a character that isn't horizontal whitespace. - \N Match a character that isn't newline. Experimental. \v Match a vertical whitespace character. \V Match a character that isn't vertical whitespace. - \pP, \p{Prop} Match a character matching a Unicode property. - \PP, \P{Prop} Match a character that doesn't match a Unicode property. + \N Match a character that isn't a newline. Experimental. + \pP, \p{Prop} Match a character that has the given Unicode property. + \PP, \P{Prop} Match a character that doesn't have the given Unicode property =head3 Digits -C<\d> matches a single character that is considered to be a I. What is -considered a digit depends on the internal encoding of the source string and -the locale that is in effect. If the source string is in UTF-8 format, C<\d> -not only matches the digits '0' - '9', but also Arabic, Devanagari and digits -from other languages. Otherwise, if there is a locale in effect, it will match -whatever characters the locale considers digits. Without a locale, C<\d> -matches the digits '0' to '9'. See L. +C<\d> matches a single character that is considered to be a decimal I. +What is considered a decimal digit depends on the internal encoding of the +source string and the locale that is in effect. If the source string is in +UTF-8 format, C<\d> not only matches the digits '0' - '9', but also Arabic, +Devanagari and digits from other languages. Otherwise, if there is a locale in +effect, it will match whatever characters the locale considers decimal digits. +Without a locale, C<\d> matches just the digits '0' to '9'. +See L. + +Unicode digits may cause some confusion, and some security issues. In UTF-8 +strings, C<\d> matches the same characters matched by +C<\p{General_Category=Decimal_Number}>, or synonymously, +C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the +same set of characters matched by C<\p{Numeric_Type=Decimal}>. + +But Unicode also has a different property with a similar name, +C<\p{Numeric_Type=Digit}>, which matches a completely different set of +characters. These characters are things such as subscripts. + +The design intent is for C<\d> to match all the digits (and no other characters) +that can be used with "normal" big-endian positional decimal syntax, whereby a +sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10 ++ N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 - +U+0BEF) can also legally be used in old-style Tamil numbers in which they would +appear no more than one in a row, separated by characters that mean "times 10", +"times 100", etc. (See L.) + +Some of the non-European digits that C<\d> matches look like European ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09A) looks very much +like an ASCII DIGIT EIGHT (U+0038). + +It may be useful for security purposes for an application to require that all +digits in a row be from the same script. See L. Any character that isn't matched by C<\d> will be matched by C<\D>. =head3 Word characters A C<\w> matches a single alphanumeric character (an alphabetic character, or a -decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match -a string of Perl-identifier characters (which isn't the same as matching an -English word). What is considered a word character depends on the internal +decimal digit) or an underscore (C<_>), not a whole word. To match a whole +word, use C<\w+>. This isn't the same thing as matching an English word, but +is the same as a string of Perl-identifier characters. What is considered a +word character depends on the internal encoding of the string and the locale or EBCDIC code page that is in effect. If it's in UTF-8 format, C<\w> matches those characters that are considered word characters in the Unicode database. That is, it not only matches ASCII letters, @@ -97,48 +126,43 @@ the current locale or EBCDIC code page. Without a locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and the underscore. See L. +There are a number of security issues with the full Unicode list of word +characters. See L. + +Also, for a somewhat finer-grained set of characters that are in programming +language identifiers beyond the ASCII range, you may wish to instead use the +more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and +"XID_Continue". See L. + Any character that isn't matched by C<\w> will be matched by C<\W>. =head3 Whitespace -C<\s> matches any single character that is considered whitespace. In the ASCII -range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form -feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab, -C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s> -depends on whether the source string is in UTF-8 format and the locale or -EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what -is considered whitespace in the Unicode database; the complete list is in the -table below. Otherwise, if there is a locale or EBCDIC code page in effect, -C<\s> matches whatever is considered whitespace by the current locale or EBCDIC -code page. Without a locale or EBCDIC code page, C<\s> matches the five -characters mentioned in the beginning of this paragraph. Perhaps the most -notable possible surprise is that C<\s> matches a non-breaking space only if -the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC -code page that is in effect has that character. +C<\s> matches any single character that is considered whitespace. The exact +set of characters matched by C<\s> depends on whether the source string is in +UTF-8 format and the locale or EBCDIC code page that is in effect. If it's in +UTF-8 format, C<\s> matches what is considered whitespace in the Unicode +database; the complete list is in the table below. Otherwise, if there is a +locale or EBCDIC code page in effect, C<\s> matches whatever is considered +whitespace by the current locale or EBCDIC code page. Without a locale or +EBCDIC code page, C<\s> matches the horizontal tab (C<\t>), the newline +(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the space. +(Note that it doesn't match the vertical tab, C<\cK>.) Perhaps the most notable +possible surprise is that C<\s> matches a non-breaking space only if the +non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC code +page that is in effect has that character. See L. Any character that isn't matched by C<\s> will be matched by C<\S>. C<\h> will match any character that is considered horizontal whitespace; -this includes the space and the tab characters and 17 other characters that are -listed in the table below. C<\H> will match any character +this includes the space and the tab characters and a number other characters, +all of which are listed in the table below. C<\H> will match any character that is not considered horizontal whitespace. -C<\N> is new in 5.12, and is experimental. It, like the dot, will match any -character that is not a newline. The difference is that C<\N> will not be -influenced by the single line C regular expression modifier. Note that -there is a second meaning of C<\N> when of the form C<\N{...}>. This form is -for named characters. See L for those. If C<\N> is followed by an -opening brace and something that is not a quantifier, perl will assume that a -character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> -means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, -but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to -look for characters named C<4F> or C, respectively (and won't find them, -thus raising an error, unless they have been defined using custom names). - C<\v> will match any character that is considered vertical whitespace; -this includes the carriage return and line feed characters (newline) plus 5 -other characters listed in the table below. +this includes the carriage return and line feed characters (newline) plus several +other characters, all listed in the table below. C<\V> will match any character that is not considered vertical whitespace. C<\R> matches anything that can be considered a newline under Unicode @@ -156,10 +180,10 @@ One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered vertical whitespace. Furthermore, if the source string is not in UTF-8 format, and any locale or EBCDIC code page that is in effect doesn't include them, the -next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not -matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source -string is in UTF-8 format, both the next line and the no-break space are -matched by C<\s>. +next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform +C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h> +respectively. If the source string is in UTF-8 format, both the next line and +the no-break space are matched by C<\s>. The following table is a complete listing of characters matched by C<\s>, C<\h> and C<\v> as of Unicode 5.2. @@ -209,6 +233,19 @@ It is worth noting that C<\d>, C<\w>, etc, match single characters, not complete numbers or words. To match a number (that consists of integers), use C<\d+>; to match a word, use C<\w+>. +=head3 \N + +C<\N> is new in 5.12, and is experimental. It, like the dot, will match any +character that is not a newline. The difference is that C<\N> is not influenced +by the I regular expression modifier (see L above). Note +that the form C<\N{...}> may mean something completely different. When the +C<{...}> is a L, it means to match a non-newline +character that many times. For example, C<\N{3}> means to match 3 +non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> +is not a legal quantifier, it is presumed to be a named character. See +L for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and +C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose +names are, respectively, C, C<4F>, and C. =head3 Unicode Properties @@ -263,13 +300,13 @@ L. =head2 Bracketed Character Classes The third form of character class you can use in Perl regular expressions -is the bracketed form. In its simplest form, it lists the characters +is the bracketed character class. In its simplest form, it lists the characters that may be matched, surrounded by square brackets, like this: C<[aeiou]>. This matches one of C, C, C, C or C. Like the other character classes, exactly one character will be matched. To match a longer string consisting of characters mentioned in the character -class, follow the character class with a quantifier. For instance, -C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. +class, follow the character class with a L. For +instance, C<[aeiou]+> matches a string of one or more lowercase English vowels. Repeating a character in a character class has no effect; it's considered to be in the set only once. @@ -297,7 +334,7 @@ escaped with a backslash, although this is sometimes not needed, in which case the backslash may be omitted. The sequence C<\b> is special inside a bracketed character class. While -outside the character class C<\b> is an assertion indicating a point +outside the character class, C<\b> is an assertion indicating a point that does not have either two word characters or two non-word characters on either side, inside a bracketed character class, C<\b> matches a backspace character. @@ -320,12 +357,14 @@ class. Also, a backslash followed by two or three octal digits is considered an octal number. -A C<[> is not special inside a character class, unless it's the start -of a POSIX character class (see below). It normally does not need escaping. +A C<[> is not special inside a character class, unless it's the start of a +POSIX character class (see L below). It normally does +not need escaping. -A C<]> is normally either the end of a POSIX character class (see below), or it -signals the end of the bracketed character class. If you want to include a -C<]> in the set of characters, you must generally escape it. +A C<]> is normally either the end of a POSIX character class (see +L below), or it signals the end of the bracketed +character class. If you want to include a C<]> in the set of characters, you +must generally escape it. However, if the C<]> is the I (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) @@ -362,7 +401,7 @@ a platform that uses a different character set, such as EBCDIC. If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and will be -considered a character that may be matched literally. You have to escape the +considered a character that is to be matched literally. You have to escape the hyphen with a backslash if you want to have a hyphen in your set of characters to be matched, and its position in the class is such that it could be considered part of a range. @@ -403,13 +442,15 @@ Examples: You can put any backslash sequence character class (with the exception of C<\N>) inside a bracketed character class, and it will act just as if you put all the characters matched by the backslash sequence inside the -character class. For instance, C<[a-f\d]> will match any digit, or any of the -lowercase letters between 'a' and 'f' inclusive. +character class. For instance, C<[a-f\d]> will match any decimal digit, or any +of the lowercase letters between 'a' and 'f' inclusive. + +C<\N> within a bracketed character class must be of the forms C<\N{I}> +or C<\N{U+I}>, and NOT be the form that matches non-newlines, +for the same reason that a dot C<.> inside a bracketed character class loses +its special meaning: it matches nearly anything, which generally isn't what you +want to happen. -C<\N> within a bracketed character class must be of the forms C<\N{I}> or -C<\N{U+I}> for the same reason that a dot C<.> inside a -bracketed character class loses its special meaning: it matches nearly -anything, which generally isn't what you want to happen. Examples: @@ -419,19 +460,22 @@ Examples: # character, nor a parenthesis. Backslash sequence character classes cannot form one of the endpoints -of a range. +of a range. Thus, you can't say: + + /[\p{Thai}-\d]/ # Wrong! -=head3 Posix Character Classes +=head3 POSIX Character Classes X X<\p> X<\p{}> X X X X X X X X X X X X X X -Posix character classes have the form C<[:class:]>, where I is -name, and the C<[:> and C<:]> delimiters. Posix character classes only appear +POSIX character classes have the form C<[:class:]>, where I is +name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear I bracketed character classes, and are a convenient and descriptive way of listing a group of characters, though they currently suffer from -portability issues (see below and L). Be -careful about the syntax, +portability issues (see below and L). + +Be careful about the syntax, # Correct: $string =~ /[[:alpha:]]/ @@ -441,7 +485,7 @@ careful about the syntax, The latter pattern would be a character class consisting of a colon, and the letters C, C, C
and C. -These character classes can be part of a larger bracketed character class. For +POSIX character classes can be part of a larger bracketed character class. For example, [01[:alpha:]%] @@ -471,8 +515,7 @@ derived from official Unicode properties.) The table below shows the relation between POSIX character classes and these counterparts. One counterpart, in the column labelled "ASCII-range Unicode" in -the table will only match characters in the ASCII range. (On EBCDIC platforms, -they match those characters which have ASCII equivalents.) +the table, will only match characters in the ASCII character set. The other counterpart, in the column labelled "Full-range Unicode", matches any appropriate characters in the full Unicode character set. For example, @@ -490,10 +533,12 @@ Both the C<\p> forms are unaffected by any locale that is in effect, or whether the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. In contrast, the POSIX character classes are affected. If the source string is in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see -Note [5]) behave like their "Full-range" Unicode counterparts. If the source -string is not in UTF-8 format, and no locale is in effect, and the platform is -not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. -Otherwise, they behave based on the rules of the locale or EBCDIC code page. +Note [5] below) behave like their "Full-range" Unicode counterparts. If the +source string is not in UTF-8 format, and no locale is in effect, and the +platform is not EBCDIC, all the POSIX classes behave like their ASCII-range +counterparts. Otherwise, they behave based on the rules of the locale or +EBCDIC code page. + It is proposed to change this behavior in a future release of Perl so that the the UTF8ness of the source string will be irrelevant to the behavior of the POSIX character classes. This means they will always behave in strict @@ -537,7 +582,7 @@ plus 127 (C) are control characters. On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> to be the EBCDIC equivalents of the ASCII controls, plus the controls -that in Unicode have ordinals from 128 through 139. +that in Unicode have ordinals from 128 through 159. =item [3] @@ -624,7 +669,7 @@ The rule is that if the source string is in UTF-8 format, the character classes match according to the Unicode properties. If the source string isn't, then the character classes match according to whatever locale or EBCDIC code page is in effect. If there is no locale nor EBCDIC, they match the ASCII -defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>; +defaults (0 to 9 for C<\d>; 52 letters, 10 digits and underscore for C<\w>; etc.). This usually means that if you are matching against characters whose C @@ -632,7 +677,7 @@ values are between 128 and 255 inclusive, your character class may match or not depending on the current locale or EBCDIC code page, and whether the source string is in UTF-8 format. The string will be in UTF-8 format if it contains characters whose C value exceeds 255. But a string may be in -UTF-8 format without it having such characters. See L. For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> -- 1.5.6.3 ```

Migrated from rt.perl.org#74642 (status was 'resolved')

Searchable as RT74642$

From @khwilliamson

These are mostly about regex and Unicode things\, and correcting a couple broken links. Details in the commit messages

From @khwilliamson

0001-Remove-false-statement-about-Unicode-strings.patch

```diff From 826f0b2bdc48047deb46c635b14080117c46eb69 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 10:23:08 -0600 Subject: [PATCH] Remove false statement about Unicode strings It is simply not true that all text strings are Unicode strings in Perl. --- pod/perlunitut.pod | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/pod/perlunitut.pod b/pod/perlunitut.pod index 9c4f307..fc352d5 100644 --- a/pod/perlunitut.pod +++ b/pod/perlunitut.pod @@ -66,9 +66,6 @@ B, or B are made of characters. Bytes are irrelevant here, and so are encodings. Each character is just that: the character. -Text strings are also called B, because in Perl, every text -string is a Unicode string. - On a text string, you would do things like: $text =~ s/foo/bar/; -- 1.5.6.3 ```

From @khwilliamson

0002-Nits-in-perluniintro.pod.patch

```diff From dc5c3e806c55a5200dbbb434f6969da7179905db Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:03:48 -0600 Subject: [PATCH] Nits in perluniintro.pod Make accurate the advice about eighth-bit set characters, and a few editing improvements. --- pod/perluniintro.pod | 33 +++++++++++++++++---------------- 1 files changed, 17 insertions(+), 16 deletions(-) diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 6c82efd..bee286f 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -553,19 +553,19 @@ L Character Ranges and Classes -Character ranges in regular expression character classes (C) -and in the C (also known as C) operator are not magically -Unicode-aware. What this means is that C<[A-Za-z]> will not magically start -to mean "all alphabetic letters"; not that it does mean that even for -8-bit characters, you should be using C in that case. - -For specifying character classes like that in regular expressions, -you can use the various Unicode properties--C<\pL>, or perhaps -C<\p{Alphabetic}>, in this particular case. You can use Unicode -code points as the end points of character ranges, but there is no -magic associated with specifying a certain range. For further -information--there are dozens of Unicode character classes--see -L. +Character ranges in regular expression bracketed character classes ( e.g., +C) and in the C (also known as C) operator are not +magically Unicode-aware. What this means is that C<[A-Za-z]> will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8-bit characters; for those, if you are using locales (L), +use C; and if not, use the 8-bit-aware property C<\p{alpha}>). + +All the properties that begin with C<\p> (and its inverse C<\P>) are actually +character classes that are Unicode-aware. There are dozens of them, see +L. + +You can use Unicode code points as the end points of character ranges, and the +range will include all Unicode code points that lie between those end points. =item * @@ -607,7 +607,7 @@ Unicode; for that, see the earlier I/O discussion. How Do I Know Whether My String Is In Unicode? You shouldn't have to care. But you may, because currently the semantics of the -characters whose ordinals are in the range 128 to 255 is different depending on +characters whose ordinals are in the range 128 to 255 are different depending on whether the string they are contained within is in Unicode or not. (See L.) @@ -622,8 +622,8 @@ string has any characters at all. All the C does is to return the value of the internal "utf8ness" flag attached to the C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar -are interpreted as the (multi-byte, variable-length) UTF-8 encoded code -points of the characters. Bytes added to a UTF-8 encoded string are +are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded +code points of the characters. Bytes added to a UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded @@ -648,6 +648,7 @@ the C function: use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) + no bytes; =item * -- 1.5.6.3 ```

From @khwilliamson

0003-Nits-in-perlunifaq.pod.patch

```diff From cae8ce7efb40de7f6216e16967b0a6e2801a8360 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:15:33 -0600 Subject: [PATCH] Nits in perlunifaq.pod --- pod/perlunifaq.pod | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod index 89cbad3..ab42ff1 100644 --- a/pod/perlunifaq.pod +++ b/pod/perlunifaq.pod @@ -25,7 +25,7 @@ To find out which character encodings your Perl supports, run: =head2 Which version of perl should I use? Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer. -The tutorial and FAQ are based on the status quo as of C<5.8.8>. +The tutorial and FAQ assume the latest release. You should also check your modules, and upgrade them if necessary. For example, HTML::Entities requires version >= 1.32 to function correctly, even though the @@ -227,9 +227,9 @@ use C, C<_utf8_on> or C<_utf8_off> at all. The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the current internal representation is UTF-8. Without the flag, it is assumed to be -ISO-8859-1. Perl converts between these automatically. (Actually Perl assumes -the representation is ASCII; see L above.) +ISO-8859-1. Perl converts between these automatically. (Actually Perl usually +assumes the representation is ASCII; see L above.) One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't keep a secret, so everyone knows about this. That is the source of much -- 1.5.6.3 ```

From @khwilliamson

0004-Clarify-c-usage-in-perlrebackslash.pod.patch

```diff From 7513b4b906b2c99e1640af182925c98a2a2e71d4 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 11:21:24 -0600 Subject: [PATCH] Clarify \c usage in perlrebackslash.pod --- pod/perlrebackslash.pod | 26 +++++++++++++++++--------- 1 files changed, 17 insertions(+), 9 deletions(-) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 5ff2601..461ebd9 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -16,7 +16,6 @@ Most sequences are described in detail in different documents; the primary purpose of this document is to have a quick reference guide describing all backslash and escape sequences. - =head2 The backslash In a regular expression, the backslash can perform one of two tasks: @@ -69,7 +68,7 @@ as C \A Beginning of string. Not in []. \b Word/non-word boundary. (Backspace in []). \B Not a word/non-word boundary. Not in []. - \cX Control-X (X can be any ASCII character). + \cX Control-X \C Single octet, even under UTF-8. Not in []. \d Character class for digits. \D Character class for non-digits. @@ -112,9 +111,10 @@ as C A handful of characters have a dedicated I. The following table shows them, along with their ASCII code points (in decimal and hex), -their ASCII name, the control escape (see below) and a short description. +their ASCII name, the control escape on ASCII platforms and a short +description. (For EBCDIC platforms, see L.) - Seq. Code Point ASCII Cntr Description. + Seq. Code Point ASCII Cntrl Description. Dec Hex \a 7 07 BEL \cG alarm or bell \b 8 08 BS \cH backspace [1] @@ -145,10 +145,18 @@ OSses native newline character when reading from or writing to text files. =head3 Control characters C<\c> is used to denote a control character; the character following C<\c> -is the name of the control character. For instance, C matches the -character I (a carriage return, code point 13). The case of the -character following C<\c> doesn't matter: C<\cM> and C<\cm> match the same -character. +determines the value of the construct. For example the value of C<\cA> is +C, and the value of C<\cb> is C, etc. +The gory details are in L. A complete +list of what C, etc. means for ASCII and EBCDIC platforms is in +L. + +Note that C<\c\> alone at the end of a regular expression (or doubled-quoted +string) is not valid. The backslash must be followed by another character. +That is, C<\c\I> means C'> for all characters I. + +To write platform-independent code, you must use C<\N{I}> instead, like +C<\N{ESCAPE}> or C<\N{U+001B}>, see L. Mnemonic: Iontrol character. @@ -335,7 +343,7 @@ match a character that matches the given Unicode property; properties include things like "letter", or "thai character". Capitalizing the sequence to C<\PP> and C<\P{Property}> make the sequence match a character that doesn't match the given Unicode property. For more details, see -L and +L and L. Mnemonic: I

roperty. -- 1.5.6.3 ```

From @khwilliamson

0005-Nits-in-perlunicode.pod.patch

```diff From 8c5f9e69708fb7fb232c5f279f93ca8c6a48caac Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:14:27 -0600 Subject: [PATCH] Nits in perlunicode.pod --- pod/perlunicode.pod | 62 ++++++++++++++++++++++++++++---------------------- 1 files changed, 35 insertions(+), 27 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1f4be43..140d134 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -11,9 +11,12 @@ implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. People who want to learn to use Unicode in Perl, should probably read -L, before reading +the L, before reading this reference document. +Also, the use of Unicode may present security issues that aren't obvious. +Read L. + =over 4 =item Input and Output Layers @@ -99,8 +102,8 @@ The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. The C pragma is intended to always, regardless -of platform, force Unicode semantics in a particular lexical scope. In -release 5.12, it is partially implemented, applying only to case changes. +of platform, force character (Unicode) semantics in a particular lexical scope. +In release 5.12, it is partially implemented, applying only to case changes. See L below. The C pragma is primarily a compatibility device that enables @@ -180,15 +183,15 @@ a character instead of a byte. =item * -Character classes in regular expressions match characters instead of +Bracketed character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. =item * -Named Unicode properties, scripts, and block ranges may be used like -character classes via the C<\p{}> "matches property" construct and +Named Unicode properties, scripts, and block ranges may be used (like bracketed +character classes) by using the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". See L for more details. @@ -261,8 +264,9 @@ complement B the full character-wide bit complement. =item * -You can define your own mappings to be used in lc(), -lcfirst(), uc(), and ucfirst() (or their string-inlined versions). +You can define your own mappings to be used in C, +C, C, and C (or their double-quoted string inlined +versions such as C<\U>). See L for more details. =back @@ -278,25 +282,30 @@ And finally, C reverses by character rather than by byte. =head2 Unicode Character Properties Most Unicode character properties are accessible by using regular expressions. -They are used like character classes via the C<\p{}> "matches property" -construct and the C<\P{}> negation, "doesn't match property". +They are used (like bracketed character classes) by using the C<\p{}> "matches +property" construct and the C<\P{}> negation, "doesn't match property". + +Note that the only time that Perl considers a sequence of individual code +points as a single logical character is in the C<\X> construct, already +mentioned above. Therefore "character" in this discussion means a single +Unicode code point. -For instance, C<\p{Uppercase}> matches any character with the Unicode +For instance, C<\p{Uppercase}> matches any single character with the Unicode "Uppercase" property, while C<\p{L}> matches any character with a General_Category of "L" (letter) property. Brackets are not -required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. +required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. -More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase -property value is True, and C<\P{Uppercase}> matches any character whose -Uppercase property value is False, and they could have been written as -C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively +More formally, C<\p{Uppercase}> matches any single character whose Unicode +Uppercase property value is True, and C<\P{Uppercase}> matches any character +whose Uppercase property value is False, and they could have been written as +C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. This formality is needed when properties are not binary, that is if they can take on more values than just True and False. For example, the Bidi_Class (see L below), can take on a number of different values, such as Left, Right, Whitespace, and others. To match these, one needs to specify the property name (Bidi_Class), and the value being matched against -(Left, Right, I). This is done, as in the examples above, by having the +(Left, Right, etc.). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. @@ -403,8 +412,7 @@ Here are the short and long forms of the General Category properties: Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C and C are special cases, which are aliases for the set of -C, C, and C. +C and C are special cases, which are both aliases for the set consisting of everything matched by C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement @@ -413,8 +421,8 @@ supported. =head3 B -Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties in +Because scripts differ in their directionality (Hebrew is +written right to left, for example) Unicode supplies these properties in the Bidi_Class class: Property Meaning @@ -451,10 +459,10 @@ written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. The Unicode Script property gives what script a given character is in, -and can be matched with the compound form like C<\p{Script=Hebrew}> (short: -C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit -everything up through the equals (or colon), and simply write C<\p{Latin}> or -C<\P{Cyrillic}>. +and the property can be specified with the compound form like +C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all +script names. You can omit everything up through the equals (or colon), and +simply write C<\p{Latin}> or C<\P{Cyrillic}>. A complete list of scripts and their shortcuts is in L. @@ -475,7 +483,7 @@ characters with consecutive ordinal values. For example, the "Basic Latin" block is all characters whose ordinals are between 0 and 127, inclusive, in other words, the ASCII characters. The "Latin" script contains some letters from this block as well as several more, like "Latin-1 Supplement", -"Latin Extended-A", I, but it does not contain all the characters from +"Latin Extended-A", etc., but it does not contain all the characters from those blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in the script called C. There is also a script called C for @@ -571,7 +579,7 @@ To understand the use of this rarely used property=value combination, it is necessary to know some basics about decomposition. Consider a character, say H. It could appear with various marks around it, such as an acute accent, or a circumflex, or various hooks, circles, arrows, -I, above, below, to one side and/or the other, I There are many +I, above, below, to one side and/or the other, etc. There are many possibilities among the world's languages. The number of combinations is astronomical, and if there were a character for each combination, it would soon exhaust Unicode's more than a million possible characters. So Unicode -- 1.5.6.3 ```

From @khwilliamson

0006-perlfunc.pod-case-change-cleanup-mention-packtut.patch

```diff From 6af92d53131d3827a2496075c6b9ef0e95c19cb9 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:27:01 -0600 Subject: [PATCH] perlfunc.pod: case-change cleanup; mention packtut Specifies completely the behavior of the case-changing functions, and mentions in the existence of the pack tutorial for the packing ones. --- pod/perlfunc.pod | 88 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 files changed, 72 insertions(+), 16 deletions(-) diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 3fabeb0..1989f11 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -2712,12 +2712,61 @@ X X =item lc Returns a lowercased version of EXPR. This is the internal function -implementing the C<\L> escape in double-quoted strings. Respects -current LC_CTYPE locale if C in force. See L -and L for more details about locale and Unicode support. +implementing the C<\L> escape in double-quoted strings. If EXPR is omitted, uses C<$_>. +What gets returned depends on several factors: + +=over + +=item If C is in effect: + +=over + +=item On EBCDIC platforms + +The results are what the C language system call C returns. + +=item On ASCII platforms + +The results follow ASCII semantics. Only characters C change, to C +respectively. + +=back + +=item Otherwise, If EXPR has the UTF8 flag set + +If the current package has a subroutine named C, it will be used to +change the case (See L.) +Otherwise Unicode semantics are used for the case change. + +=item Otherwise, if C is in effect + +Respects current LC_CTYPE locale. See L. + +=item Otherwise, if C is in effect: + +Unicode semantics are used for the case change. Any subroutine named +C will not be used. + +=item Otherwise: + +=over + +=item On EBCDIC platforms + +The results are what the C language system call C returns. + +=item On ASCII platforms + +ASCII semantics are used for the case change. The lowercase of any character +outside the ASCII range is the character itself. + +=back + +=back + =item lcfirst EXPR X X @@ -2725,12 +2774,13 @@ X X Returns the value of EXPR with the first character lowercased. This is the internal function implementing the C<\l> escape in -double-quoted strings. Respects current LC_CTYPE locale if C in force. See L and L for more -details about locale and Unicode support. +double-quoted strings. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item length EXPR X X @@ -3603,8 +3653,10 @@ Takes a LIST of values and converts it into a string using the rules given by the TEMPLATE. The resulting string is the concatenation of the converted values. Typically, each converted value looks like its machine-level representation. For example, on 32-bit machines -an integer may be represented by a sequence of 4 bytes, which will in -Perl be presented as a string that's 4 characters long. +an integer may be represented by a sequence of 4 bytes, which will in +Perl be presented as a string that's 4 characters long. + +See L for an introduction to this function. The TEMPLATE is a sequence of characters that give the order and type of values, as follows: @@ -6869,14 +6921,15 @@ X X X =item uc Returns an uppercased version of EXPR. This is the internal function -implementing the C<\U> escape in double-quoted strings. Respects -current LC_CTYPE locale if C in force. See L -and L for more details about locale and Unicode support. +implementing the C<\U> escape in double-quoted strings. It does not attempt to do titlecase mapping on initial letters. See -C for that. +L for that. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item ucfirst EXPR X X @@ -6884,12 +6937,13 @@ X X Returns the value of EXPR with the first character in uppercase (titlecase in Unicode). This is the internal function implementing -the C<\u> escape in double-quoted strings. Respects current LC_CTYPE -locale if C in force. See L and L -for more details about locale and Unicode support. +the C<\u> escape in double-quoted strings. If EXPR is omitted, uses C<$_>. +This function behaves the same way under various pragma, such as in a locale, +as L does. + =item umask EXPR X @@ -6993,7 +7047,9 @@ C does the reverse of C: it takes a string and expands it out into a list of values. (In scalar context, it returns merely the first value produced.) -If EXPR is omitted, unpacks the C<$_> string. +If EXPR is omitted, unpacks the C<$_> string. for an introduction to this function. + +See L for an introduction to this function. The string is broken into chunks described by the TEMPLATE. Each chunk is converted separately to a value. Typically, either the string is a result -- 1.5.6.3 ```

From @khwilliamson

0007-Fix-broken-links.patch

```diff From eee7b9a30dac58e30fbceb06d7a2857b43c09a15 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:32:42 -0600 Subject: [PATCH] Fix broken links --- pod/perl5111delta.pod | 2 +- pod/perl5120delta.pod | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pod/perl5111delta.pod b/pod/perl5111delta.pod index 87fb9df..4717374 100644 --- a/pod/perl5111delta.pod +++ b/pod/perl5111delta.pod @@ -260,7 +260,7 @@ Perl now defaults to issuing a warning if a deprecated language feature is used. To disable this feature in a given lexical scope, you should use C For information about which language features are deprecated and explanations of various deprecation warnings, please -see L +see L =back diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod index 35fab9a..5d5b401 100644 --- a/pod/perl5120delta.pod +++ b/pod/perl5120delta.pod @@ -251,7 +251,7 @@ C file for that release. To disable this feature in a given lexical scope, you should use C For information about which language features are deprecated and explanations of various deprecation warnings, please -see L. See L below for the list of features +see L. See L below for the list of features and modules Perl's developers have deprecated as part of this release. =head2 Version number formats -- 1.5.6.3 ```

From @khwilliamson

0008-Nits-in-perlre.pod-x-referencing-broken-links.patch

```diff From 57fb01a46585f16a0200fd57c9c010ec88c1bcd7 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 12:37:19 -0600 Subject: [PATCH] Nits in perlre.pod, x-referencing, broken links --- pod/perlre.pod | 163 +++++++++++++++++++++++++------------------------------ 1 files changed, 74 insertions(+), 89 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index 48ca403..40e6c28 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -98,14 +98,14 @@ the C-comment deletion code in L. Also note that anything inside a C<\Q...\E> stays unaffected by C. And note that C doesn't affect whether space interpretation within a single multi-character construct. For example in C<\x{...}>, regardless of the C modifier, there can be no -spaces. Same for a L such as C<{3}> or +spaces. Same for a L such as C<{3}> or C<{5,}>. Similarly, C<(?:...)> can't have a space between the C and C<:>, but can between the C<(> and C. Within any delimiters for such a construct, allowed spaces are not affected by C, and depend on the construct. For example, C<\x{...}> can't have spaces because hexadecimal numbers don't have spaces in them. But, Unicode properties can have spaces, so in C<\p{...}> there can be spaces that follow the Unicode rules, for which see -L. +L. X =head2 Regular Expressions @@ -130,7 +130,7 @@ X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> $ Match the end of the line (or before newline at the end) | Alternation () Grouping - [] Character class + [] Bracketed Character class By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the @@ -222,8 +222,6 @@ instance the above example could also be written as follows: Because patterns are processed as double quoted strings, the following also work: -X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> -X<\0> X<\c> X<\N{}> X<\x> \t tab (HT, TAB) \n newline (LF, NL) @@ -241,101 +239,88 @@ X<\0> X<\c> X<\N{}> X<\x> \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) - \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E + \E end either case modification or quoted section (think vi) -If C is in effect, the case map used by C<\l>, C<\L>, C<\u> -and C<\U> is taken from the current locale. See L. For -documentation of C<\N{name}>, see L. - -You cannot include a literal C<$> or C<@> within a C<\Q> sequence. -An unescaped C<$> or C<@> interpolates the corresponding variable, -while escaping will cause the literal string C<\$> to be matched. -You'll need to write something like C. +Details are in L. =head3 Character Classes and other Special Escapes In addition, Perl defines the following: X<\g> X<\k> X<\K> X - \w Match a "word" character (alphanumeric plus "_") - \W Match a non-"word" character - \s Match a whitespace character - \S Match a non-whitespace character - \d Match a digit character - \D Match a non-digit character - \pP Match P, named property. Use \p{Prop} for longer names. - \PP Match non-P - \X Match Unicode "eXtended grapheme cluster" - \C Match a single C char (octet) even under Unicode. - NOTE: breaks up characters into their UTF-8 bytes, - so you may end up with malformed pieces of UTF-8. - Unsupported in lookbehind. - \1 Backreference to a specific group. - '1' may actually be any positive integer. - \g1 Backreference to a specific or previous group, - \g{-1} number may be negative indicating a previous buffer and may - optionally be wrapped in curly brackets for safer parsing. - \g{name} Named backreference - \k Named backreference - \K Keep the stuff left of the \K, don't include it in $& - \N Any character but \n (experimental) - \v Vertical whitespace - \V Not vertical whitespace - \h Horizontal whitespace - \H Not horizontal whitespace - \R Linebreak - -See L for details on -C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, C<\D>, C<\p>, C<\P>, C<\N>, C<\v>, C<\V>, -C<\h>, and C<\H>. -See L for details on C<\R> and C<\X>. + Sequence Note Description + [...] [1] Match a character according to the rules of the bracketed + character class defined by the "...". Example: [a-z] + matches "a" or "b" or "c" ... or "z" + [[:...:]] [2] Match a character according to the rules of the POSIX + character class "..." within the outer bracketed character + class. Example: [[:upper:]] matches any uppercase + character. + \w [3] Match a "word" character (alphanumeric plus "_") + \W [3] Match a non-"word" character + \s [3] Match a whitespace character + \S [3] Match a non-whitespace character + \d [3] Match a decimal digit character + \D [3] Match a non-digit character + \pP [3] Match P, named property. Use \p{Prop} for longer names. + \PP [3] Match non-P + \X [4] Match Unicode "eXtended grapheme cluster" + \C Match a single C-language char (octet) even if that is part + of a larger UTF-8 character. Thus it breaks up characters + into their UTF-8 bytes, so you may end up with malformed + pieces of UTF-8. Unsupported in lookbehind. + \1 [5] Backreference to a specific capture buffer or group. + '1' may actually be any positive integer. + \g1 [5] Backreference to a specific or previous group, + \g{-1} [5] The number may be negative indicating a relative previous + buffer and may optionally be wrapped in curly brackets for + safer parsing. + \g{name} [5] Named backreference + \k [5] Named backreference + \K [6] Keep the stuff left of the \K, don't include it in $& + \N [7] Any character but \n (experimental). Not affected by /s + modifier + \v [3] Vertical whitespace + \V [3] Not vertical whitespace + \h [3] Horizontal whitespace + \H [3] Not horizontal whitespace + \R [4] Linebreak -Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the -character whose name is C; and similarly when of the form -C<\N{U+I}>, it matches the character whose Unicode ordinal is -I. Otherwise it matches any character but C<\n>. +=over 4 + +=item [1] + +See L for details. -The POSIX character class syntax -X +=item [2] - [:class:] +See L for details. -is also available. Note that the C<[> and C<]> brackets are I; -they must always be used within a character class expression. +=item [3] - # this is correct: - $string =~ /[[:alpha:]]/; +See L for details. - # this is not, and will generate a warning: - $string =~ /[:alpha:]/; +=item [4] -The following Posix-style character classes are available: +See L for details. - [[:alpha:]] Any alphabetical character. - [[:alnum:]] Any alphanumerical character. - [[:ascii:]] Any character in the ASCII character set. - [[:blank:]] A GNU extension, equal to a space or a horizontal tab - [[:cntrl:]] Any control character. - [[:digit:]] Any decimal digit, equivalent to "\d". - [[:graph:]] Any printable character, excluding a space. - [[:lower:]] Any lowercase character. - [[:print:]] Any printable character, including a space. - [[:punct:]] Any graphical character excluding "word" characters. - [[:space:]] Any whitespace character. "\s" plus vertical tab ("\cK"). - [[:upper:]] Any uppercase character. - [[:word:]] A Perl extension, equivalent to "\w". - [[:xdigit:]] Any hexadecimal digit. +=item [5] -You can negate the [::] character classes by prefixing the class name -with a '^'. This is a Perl extension. +See L below for details. -The POSIX character classes -[.cc.] and [=cc=] are recognized but B supported and trying to -use them will cause an error. +=item [6] -Details on POSIX character classes are in -L. +See L below for details. + +=item [7] + +Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the +character whose name is C; and similarly when of the form +C<\N{U+I}>, it matches the character whose Unicode ordinal is +I. Otherwise it matches any character but C<\n>. + +=back =head3 Assertions @@ -345,12 +330,12 @@ X X X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> - \b Match a word boundary - \B Match except at a word boundary - \A Match only at beginning of string - \Z Match only at end of string, or before newline at the end - \z Match only at end of string - \G Match only at pos() (e.g. at the end-of-match position + \b Match a word boundary + \B Match except at a word boundary + \A Match only at beginning of string + \Z Match only at end of string, or before newline at the end + \z Match only at end of string + \G Match only at pos() (e.g. at the end-of-match position of prior m//g) A word boundary (C<\b>) is a spot between two characters @@ -866,7 +851,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of C operator (see -L). +Lmsixpo">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: @@ -937,7 +922,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of C operator (see -L). +LSTRINGEmsixpo">). Because perl's regex engine is not currently re-entrant, delayed code may not invoke the regex engine either directly with C or C~~), -- 1.5.6.3 ```~~

From @khwilliamson

0010-Clarify-c-in-perlop.pod.patch

```diff From 063686b7cb0d39dc7e8c10c416b8dd3c847f04ae Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sat, 24 Apr 2010 13:44:30 -0600 Subject: [PATCH] Clarify \c in perlop.pod. And structure the table containing \c better. --- pod/perlop.pod | 85 ++++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 61 insertions(+), 24 deletions(-) diff --git a/pod/perlop.pod b/pod/perlop.pod index ebe32fb..fc78326 100644 --- a/pod/perlop.pod +++ b/pod/perlop.pod @@ -1011,33 +1011,70 @@ from the next line. This allows you to write: The following escape sequences are available in constructs that interpolate and in transliterations. -X<\t> X<\n> X<\r> X<\f> X<\b> X<\a> X<\e> X<\x> X<\0> X<\c> X<\N> - - \t tab (HT, TAB) - \n newline (NL) - \r return (CR) - \f form feed (FF) - \b backspace (BS) - \a alarm (bell) (BEL) - \e escape (ESC) - \033 octal char (example: ESC) - \x1b hex char (example: ESC) - \x{263a} wide hex char (example: SMILEY) - \c[ control char (example: ESC) - \N{name} named Unicode character - \N{U+263D} Unicode character (example: FIRST QUARTER MOON) - -The character following C<\c> is mapped to some other character by -converting letters to upper case and then (on ASCII systems) by inverting -the 7th bit (0x40). The most interesting range is from '@' to '_' -(0x40 through 0x5F), resulting in a control character from 0x00 -through 0x1F. A '?' maps to the DEL character. On EBCDIC systems only -'@', the letters, '[', '\', ']', '^', '_' and '?' will work, resulting -in 0x00 through 0x1F and 0x7F. +X<\t> X<\n> X<\r> X<\f> X<\b> X<\a> X<\e> X<\x> X<\0> X<\c> X<\N> X<\N{}> + + Sequence Note Description + \t tab (HT, TAB) + \n newline (NL) + \r return (CR) + \f form feed (FF) + \b backspace (BS) + \a alarm (bell) (BEL) + \e escape (ESC) + \033 octal char (example: ESC) + \x1b hex char (example: ESC) + \x{263a} wide hex char (example: SMILEY) + \c[ [1] control char (example: chr(27)) + \N{name} [2] named Unicode character + \N{U+263D} [3] Unicode character (example: FIRST QUARTER MOON) + +=over 4 + +=item [1] + +The character following C<\c> is mapped to some other character as shown in the +table: + + Sequence Value + \c@ chr(0) + \cA chr(1) + \ca chr(1) + \cB chr(2) + \cb chr(2) + ... + \cZ chr(26) + \cz chr(26) + \c[ chr(27) + \c] chr(29) + \c^ chr(30) + \c? chr(127) + +Also, C<\c\I> yields C< chr(28) . "I"> for any I, but cannot come at the +end of a string, because the backslash would be parsed as escaping the end +quote. + +On ASCII platforms, the resulting characters from the list above are the +complete set of ASCII controls. This isn't the case on EBCDIC platforms; see +L for the complete list of what these +sequences mean on both ASCII and EBCDIC platforms. + +Use of any other character following the "c" besides those listed above is +prohibited on EBCDIC platforms, and discouraged (and may become deprecated or +forbidden) on ASCII ones. What happens for those other characters currently +though, is that the value is derived by inverting the 7th bit (0x40). + +To get platform independent controls, you can use C<\N{...}>. + +=item [2] + +For documentation of C<\N{name}>, see L. + +=item [3] C<\N{U+I}> means the Unicode character whose Unicode ordinal number is I. -For documentation of C<\N{name}>, see L. + +=back B: Unlike C and other languages, Perl has no C<\v> escape sequence for the vertical tab (VT - ASCII 11), but you may use C<\ck> or C<\x0b>. (C<\v> -- 1.5.6.3 ```

From @rgs

Thanks\, applied to bleadperl.

The RT System itself - Status changed from 'new' to 'open'

@rgs - Status changed from 'open' to 'resolved'

From @khwilliamson

Perl / perl5