Closed p5pRT closed 11 years ago
Had someone file a bug against 'P' because it used a numeric format from 'sprintf' to print out a string matching "\d".... (running w/UTF-8...)
"2.٩" is like writing "2.IX"\, or 2.9. Since sprintf's "%d" (and %f for that matter) are grouped under the heading "universally-known conversions".
By such reasoning\, many people would know that '\d' is a universal match for numbers.
Should the universal match '\d' be printable by sprintf using the universal corresponding format of '%d'?
Should it print it "as is" and not convert it to arabic numerals (though U+669 is the arabic-indic digit nine).
It *seems* that if the universally known pattern matching groups are being allowed to match numbers in other scripts\, that the corresponding universally known format (\d => %d) be usable to print it out. Otherwise\, how might one be ABLE to print the value of the foreign number propose that "\d" now matches?
Sure could complicate formatted printing\, but it's not like \d has ever meant anything other than arabic numerals [0-9]. Seeing how precedent has been set to repurpose the 'universally known character specifications" to match unicode w.r.t. '\d'\, it only seems logical to continue the trend and do the same with '%d'.
(Note... as I write the above\, I wince at the size of such a task\, but I ask again -- how would one print it in a helpful way? It seems like just printing the whole thing as a string is *A WAY*\, but it also feels a bit like a cop-out.)
Comments\, ideas and discourse that would make me feel better about taking the 'low' road (presuming that fixing sprintf's %d to match pattern matching's \d isn't already planned for and near completion). ;-)
On 03.11.2013 21:50\, Linda Walsh (via RT) wrote:
Had someone file a bug against 'P' because it used a numeric format from 'sprintf' to print out a string matching "\d".... (running w/UTF-8...)
"2.٩" is like writing "2.IX"\, or 2.9. Since sprintf's "%d" (and %f for that matter) are grouped under the heading "universally-known conversions".
By such reasoning\, many people would know that '\d' is a universal match for numbers.
Should the universal match '\d' be printable by sprintf using the universal corresponding format of '%d'?
This is a category error. Regexes (such as \d) match strings. sprintf %d takes numbers (specifically integers). So no\, strings matched by \d should not be printable via %d because %d takes integers\, not strings.
sprintf %d doesn't "correspond" to \d in regex. I don't know what you mean by "universal" here.
The two 'd's don't even refer to the same thing. sprintf %d stands for "decimal" (there's also %o for "octal" and %x for "hexadecimal")\, regex \d stands for "digit" (and there's \w for "word character" and \s for "(white)space character".
The RT System itself - Status changed from 'new' to 'open'
I agree with Lukas Mai.
The place one would make a change like this would be to change the interpretation of a string-as-a-number to accept all digits\, rather than only the ASCII digits.
Simply making the change would be a massive backcompat issue. Adding a lexical pragma would not be much of a help\, because to avoid bizarre effects at a distance\, the scope would need to be exceedingly controlled.
Easier and already possible to just use Unicode::UCD::num to convert strings of digits (or single numeric characters) into their numeric value\, as needed.
-- rjbs
@rjbs - Status changed from 'open' to 'rejected'
Ricardo SIGNES via RT wrote:
Adding a lexical pragma would not be much of a help\, because to avoid bizarre effects at a distance\, the scope would need to be exceedingly controlled.
This is one of those cases where lexical scoping doesn't help at all. Is "2.\x{666}" a numeric scalar\, and what is its numeric value? It's normal to pass numeric strings across module boundaries\, and this usage breaks entirely if the modules have different string->number coercion semantics. If there's a change in the coercion behaviour\, it has to be everywhere at once with no lexical or dynamic option.
I think the OP actually wanted sprintf("%.2f"\, "2.\x{666}") to yield "2.\x{666}0"\, or possibly "2.\x{666}\x{660}". That would involve attaching extra script information to numerical values\, which would be ridiculous.
The original bug report to which the OP referred was from me. The OP's bug was to assume that /\d/ was a suitable thing to use in a regexp meant to match Perl's numeric syntax for string->number coercion. It's a somewhat common bug. Changing the coercion behaviour would fix those modules that share this bug; or\, rather\, would fix *this aspect* of them\, as this bug tends to go alongside other bugs such as the use of /$/ where /\z/ is required. (The OP's code has the /$/ bug\, reported separately.) Changing coercion behaviour (or /\d/) to fix these modules\, at the expense of breaking modules that got it right\, should be viewed similarly to the idea of changing what /$/ means for the same purpose.
-zefram
On 4 November 2013 17:04\, Ricardo SIGNES via RT \perlbug\-followup@​perl\.org wrote:
I agree with Lukas Mai.
The place one would make a change like this would be to change the interpretation of a string-as-a-number to accept all digits\, rather than only the ASCII digits.
Simply making the change would be a massive backcompat issue. Adding a lexical pragma would not be much of a help\, because to avoid bizarre effects at a distance\, the scope would need to be exceedingly controlled.
Easier and already possible to just use Unicode::UCD::num to convert strings of digits (or single numeric characters) into their numeric value\, as needed.
This subject comes up rather regularly. Unfortunately the Unicode folks included non-ascii "digits" in their definition of a digit.
One plan that was kicking around for a while was a regex modifier that made \d match only [0-9]. I was very much in favour of this plan.
Im not sure how possible this is anymore given Karls new modifiers\, and it might even already be covered. I havent checked.
But anyway\, the root point here is that the problem is with \d not with sprintf.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Mon\, Nov 4\, 2013 at 12:01 PM\, demerphq \demerphq@​gmail\.com wrote:
One plan that was kicking around for a while was a regex modifier that made \d match only [0-9]. I was very much in favour of this plan.
Im not sure how possible this is anymore given Karls new modifiers\, and it might even already be covered. I havent checked.
/\d/a
On Mon Nov 04 09:02:15 2013\, demerphq wrote:
This subject comes up rather regularly. Unfortunately the Unicode folks included non-ascii "digits" in their definition of a digit.
One plan that was kicking around for a while was a regex modifier that made \d match only [0-9]. I was very much in favour of this plan.
Im not sure how possible this is anymore given Karls new modifiers\, and it might even already be covered. I havent checked.
But anyway\, the root point here is that the problem is with \d not with sprintf.
Yves
The above hits it right on. I wasn't suggesting\, *necessarily*\, that sprintf change.
To the folks who thought that I was\, please note the following quote from the original:
"Note... as I write the above\, I wince at the size of such a task\, but I ask again -- how would one print it in a helpful way? It seems like just printing the whole thing as a string is *A WAY*\, but it also feels a bit like a cop-out.)
Comments\, ideas and discourse that would make me feel better about taking the 'low' road (presuming that fixing sprintf's %d to match pattern matching's \d isn't already planned for and near completion). ;-) "
I thought changing sprintf would be a huge task\, and wasn't comfortable with the idea.
I do feel there is an inconsistency between interpreting "\d" as a string of numeric digits\, and "%d" printing out a signed integer (note\, \d doesn't match "\."\, so it isn't a floating point number\, it would be an integer) -- same goes for including "\." and trying to print with "%f".
It's the inconsistency introduced by expanding the meaning of a digit in pattern matching\, yet having the same mnemonically named "%d" fail on the digits that "\d" matches.
I'm not sure that having \d match digits that %d cannot handle is a *great idea*.... in fact\, I'd lean toward the opposite.
As for fixing the above to fix anything in my code -- that's ridiculous\, as the above *design flaw* (note\, that isn't a bug against a spec\, but a defect in design -- something I do all the time\, and later correct (that's called learning)) has already been worked around in my development code.
I'm not saying\, necessarily "\d" needs to be changed either -- only that the mismatch was a bad choice that either needs some alternate workaround\, or change.
The pattern matching got enhanced with modal charset matching. That being the case\, it would be a logical and helpful solution if sprintf got similar treatment.
Rejecting this as not being a problem is called "burying one's head in the sand".
On Tue\, Nov 5\, 2013 at 7:15 PM\, Linda Walsh via RT \< perlbug-followup@perl.org> wrote:
I do feel there is an inconsistency between interpreting "\d" as a string of numeric digits\, and "%d" printing out a signed integer
Then you must really hate the difference between %u and \u %x and \x
On Tue Nov 05 20:12:29 2013\, ikegami@adaelis.com wrote:
Then you must really hate the difference between %u and \u %x and \x
None of those have the historical usage.
Perl's usage of \u conflicts with gnu and shell usage (likely POSIX as well given gnu's and bash's posix bent lately)\, though if hex was my primary counting system\, that %x works w/numbers and \x works with characters might bug me more. But since neither are nearly as commonly used as \d & %d\, neither are nearly the ripe area for inconsistent usage.
On 06.11.2013 05:26\, Linda Walsh via RT wrote:
On Tue Nov 05 20:12:29 2013\, ikegami@adaelis.com wrote:
Then you must really hate the difference between %u and \u %x and \x ---- None of those have the historical usage.
Perl's usage of \u conflicts with gnu and shell usage (likely POSIX as well given gnu's and bash's posix bent lately)\, though if hex was my primary counting system\, that %x works w/numbers and \x works with characters might bug me more. But since neither are nearly as commonly used as \d & %d\, neither are nearly the ripe area for inconsistent usage.
What about the difference between %s and \s?
-- Lukas Mai \plokinom@​gmail\.com
On Tue\, Nov 5\, 2013 at 11:26 PM\, Linda Walsh via RT \< perlbug-followup@perl.org> wrote:
On Tue Nov 05 20:12:29 2013\, ikegami@adaelis.com wrote:
Then you must really hate the difference between %u and \u %x and \x ---- None of those have the historical usage.
huh? %u and %x are just as old!
sprintf "%d" # Converts a number to (d)ecimal //d # (d)efault t///d # (d)etete /\d/ # Matches a digit (has nothing to do with numbers\, even if just matched [0-9]).
And of course there's %s and \s.
sprintf "%s" # Interpolates a (s)tring //s # (s)ingle s/// # (s)ubstitute /\s/ # Matches a white(s)pace character.
Anyway\, there's no design flaw. /\d/ doesn't match numbers. There are too many definitions of numbers. Even sprintf's definition varies\, and it definitely differs from Perl's.
On Wed Nov 06 06:29:05 2013\, ikegami@adaelis.com wrote:
On Tue\, Nov 5\, 2013 at 11:26 PM\, Linda Walsh via RT \< perlbug-followup@perl.org> wrote:
On Tue Nov 05 20:12:29 2013\, ikegami@adaelis.com wrote:
Then you must really hate the difference between %u and \u %x and \x ---- None of those have the historical usage.
huh? %u and %x are just as old!
You are missing the "and" between the pairs.
Or are you claiming \u used to present strings that were printable by %u\, and that \x returned strings that were printable via $x. If that is your claim\, I will stand corrected.
No\, it was anything read by what \d matched\, was printable by %d (until perl broke it).
\d was created as a short-hand for [0-9] -- not all forms of integers in any format. It didn't match abcdef -- even though in a different encoding\, they are hex digits. It didn't match I II III IV\, either\, as they are in a different locale. So \d wasn't designed to match any "number symbol -- it never did". It only matched [0-9].
In perl\, it has been made worthless in all locales and languages. GREAT JOB GUYS!
Give me any good usage for "\d" in common usage.
(You can't.)
This all comes down to the same problem that contributed to my not being on the devel list --- I wanted to make that work\, like Unicode\, based on locale settings. I have en_US.utf8 (used to be en_US.UTF-8\, but that seems to no longer be in vogue\, likely so "-" could be significant) in all entries except collation ("C").
I see this being the same *type* of bug that caused problems when Unicode was first implemented.
Perl knows the difference between different ranges in Unicode\, but\, unlike most other multi-lingual programs\, it refuses to use locale settings at all\, by default\, and then does it wrong when you do enable them.
My locale clearly says (among other things): LANG=en_US.utf8 LC_CTYPE=en_US.utf8 LC_NUMERIC=en_US.utf8
Yet this "works"[sic]:
echo "ENV=$PERL5OPTS" ENV= perl -CL -we 'use 5.16.0;use utf8;my $num="1٦";
printf "%s\n"\, $num =~ /^\d+$/?"T":"F";' T #or equiv: perl -CL -we 'use 5.16.0;use utf8;my $num="1\x{666}";
printf "%s\n"\, $num =~ /^\d+$/?"T":"F";'
I would assert this bug is valid\, since the matching code isn't paying attention to the locale as specified.
The Arabic-Indic number "6"\, is not a number in locale en_US.* (including utf8).
Thank you for the "discussion" that helped me find bug I'd call important.
I would also point out that this would make \d useful again in every locality.
If someone specifies *no* language or country code but only "UTF8" in LC_NUMERIC\, then the current behavior might be correct.
TLDR: +---------------------------------------------------------------------------+ | This ticket should be closed until Perl's UTS18 Level 3 conformance is a | | documented\, non-experimental\, and supported element of its regex match. | +---------------------------------------------------------------------------+
"Linda Walsh via RT" \perlbug\-followup@​perl\.org wrote on Wed\, 06 Nov 2013 15:19:56 PST:
The Arabic-Indic number "6"\, is not a number in locale en_US.* (including utf8).
That's not the way it works. Code point U+0666 is a numeric digit by definition\, and this has nothing to do with your so-called "locale".
It is a digit because under UAX#44\, the Unicode Standard assigns the character property General_Category=Decimal_Number to that code point\, and this in turn derives from its being a Numeric_Type=Decimal.
Character properties are not optional. That's part of the real standard\, Unicode Standard Annax #44: "Unicode Character Database". This isn't like some optional UTS or something. It's a UAX\, which means you *have* to do what it says. This cannot be argued.
What you are hissing over is Annex C: "Compatibility Properties" from Unicode Technical Standard #18: "Unicode Regular Expressions"\, in which it gives two possible interpretations of \d:
Property Standard Recommendation POSIX Compatible ====================================================== digit (\d) \p{gc=Decimal_Number} [0..9]
Perl has always:
(1) Taken UTS 18 as part of the de-facto standard. (2) Followed the Standard Recommendation.
If you want locale-tailoring\, then you need something like
\T{\<locale_id>}..\E
Where \<locale_id> is a CLDR locale\, not some pansy-sass POSIX locale\, thus admitting this solution:
m{ \T{\
However\, that comes from UTS 18's Level 3\, a conformance level that Perl Perl has never purported to have anything whatsoever to do with. At all. We barely squeak by through Level 1 (arguably)\, and have several Level 2 features. But Level 3? No.
It is behaving precisely as documented and indeed as the Standard requires\, irrespective of your personal likes or dislikes.
+---------------------------------------------------------------------------+ | This ticket should be closed until Perl's UTS18 Level 3 conformance is a | | documented\, non-experimental\, and supported element of its regex match. | +---------------------------------------------------------------------------+
In the meanwhile\, you are welcome to support a match bringing Perl up to speed with Level 3. I'm sure many would appreciate that.
--tom
On Wed\, Nov 6\, 2013 at 6:19 PM\, Linda Walsh via RT \< perlbug-followup@perl.org> wrote:
On Wed Nov 06 06:29:05 2013\, ikegami@adaelis.com wrote:
On Tue\, Nov 5\, 2013 at 11:26 PM\, Linda Walsh via RT \< perlbug-followup@perl.org> wrote:
On Tue Nov 05 20:12:29 2013\, ikegami@adaelis.com wrote:
Then you must really hate the difference between %u and \u %x and \x ---- None of those have the historical usage.
huh? %u and %x are just as old! --- You are missing the "and" between the pairs.
No\, I'm not. My whole point is that there is no parallel.
Or are you claiming \u used to present strings that were
I'm "claiming" that \u doesn't match an unsigned integer\, just like \d doesn't match an signed integer.
\d was created as a short-hand for [0-9]
Perhaps\, but not for "-1\,234" (or can %d only do "-1234"? no matter). You need to do more than limit \d to [0-9] to match your locale's numbers.
Give me any good usage for "\d" in common usage.
(You can't.)
Was that suppose to addressed to me? That has nothing to do with my comments.
Perl knows the difference between different ranges in Unicode\, but\, unlike most other multi-lingual programs\, it refuses to use locale settings at all\, by default\, and then does it wrong when you do enable them.
Are you now suggesting \d *should* match something other than [0-9]???
On Wed Nov 06 16:19:13 2013\, tom christiansen wrote:
In English: Having to do with locale support\, how? | This ticket should be closed until Perl's UTS18 Level 3 conformance is a | | documented\, non-experimental\, and supported element of its regex match. |
That's not the way it works. Code point U+0666 is a numeric digit by definition\, and this has nothing to do with your so-called "locale".
It's **YOUR** "so called "locale". Go read the 1) perlre page. Specifically :
Perl continues to support the old locale system\, and starting in v5.16\, provides a hybrid way to use the Unicode character set\, along with the other portions of locales that may not be so problematic.
Except that perl never really supported the locale system -- it doesn't seem to support the language or country codes that are part of the locale system perl claims to support.
"
It is a digit because under UAX#44\, the Unicode Standard assigns the character property General_Category=Decimal_Number to that code point\, and this in turn derives from its being a Numeric_Type=Decimal.
But it is NOT a digit in my locale. It is a digit in someplace that uses Arabic-Indic numbers.
I stated that I asked for matching that was specific to my locale\, which is EN_US.\
Character properties are not optional.
Neither are they currently being locally appropriate or useful.
If I use the -CL switch -- I see that as conforming to my locale settings. As such\, \d should match my locales' definition of digits -- not the whole world's definition.
Otherwise\, I will ask you the same Q that Eric dodged. What would be the use case of "\d" outside of anything that is unicode-project specific?
Can I use it to check my input forms (no)... It no longer serves the purpose for which it was intended.
Perl has always:
(1) Taken UTS 18 as part of the de-facto standard. (2) Followed the Standard Recommendation.
In the meanwhile\, you are welcome to support a match
I'll inquire about why the standard for CLDR doesn't include regex matching. The current implementation is conceptually broken (even if technically accurate).
I ask ANYONE\, how is "\d" still useful in standard day-to-day programming?
It's usage has been appropriated by some pay-to-play standards group that is not open.
Adhering to such standards (the new "prescriptive" POSIX also falls into this category; new=post 2002) mindlessly reduces humans to little more than machine parts...
On 7 November 2013 16:09\, Linda Walsh via RT \perlbug\-followup@​perl\.orgwrote:
It's **YOUR** "so called "locale". Go read the 1) perlre page. Specifically :
Perl continues to support the old locale system\, and starting in v5.16\, provides a hybrid way to use the Unicode character set\, along with the other portions of locales that may not be so problematic.
Maybe you could clarify your request as follows:
Your request is not that sprintf %d should interpret its parameter specially.
Your request is more that\, sprintf should render its parameter in a locale sensitive way.
ie:
sprintf "%d"\, 2.5 # should emit a character string that represents 2.5 in the relevant locale
and by proxy\, sprintf "%d"\, $locale_specific_numeric_string should first decode that numeric string via its intended locale to internal representation\, convert it to an integer\, and then re-emit it in a locale sensitive way.
This entirely side steps the argument about regexp's \d
Then the question becomes "Should sprintf do that\, or is sprintf intended to be lower level".
Though for lower level things\, we have pack/unpack where you want machine-level interpretation.
sprintf seems more "human oriented" than machine oriented\, so it makes sense to have some locale support.
But as sprintf is a *print formatting tool*\, not a binary interface tool\, it makes sense that it would be tasked with interpreting values to users in locale relevant forms.
Either that\, or we need a function similar to sprintf tasked with formatting things in locale-sensitive ways.
-- Kent
On Wed Nov 06 20:07:14 2013\, kentfredric@gmail.com wrote:
Maybe you could clarify your request as follows:
Your request is not that sprintf %d should interpret its parameter specially.
Your request is more that\, sprintf should render its parameter in a locale sensitive way. .... But as sprintf is a *print formatting tool*\, not a binary interface tool\, it makes sense that it would be tasked with interpreting values to users in locale relevant forms.
Either that\, or we need a function similar to sprintf tasked with formatting things in locale-sensitive ways.
I agree with what you are saying\, wholeheartedly.
Given that\, and given the situation where a user has asked that their regex pattern match according to their locale\, then would it make sense to also have "\d" **match** in a locale-sensitive way?
I.e. in Indic-arabia\, (?!?) it would match it's numbers. In latin1 based locales\, it would match with the numbers in the basic latin set.
I don't understand why both are not equally valid needs (I.e. one doesn't obviate or preclude the other).
Migrated from rt.perl.org#120448 (status was 'rejected')
Searchable as RT120448$