more documentation lies, plus The Unicode Bug *again*

p5pRT commented 12 years ago

Migrated from rt.perl.org#103492 (status was 'resolved')

Searchable as RT103492$

p5pRT commented 12 years ago

From tchrist@perl.com

The documentation on the %n format in s/printf is lying.

%n special: *stores* the number of characters output so far into the next variable in the parameter list

It does no such thing.

This stores it into the next variable:

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "Naivete"\, 16\, ($count) x 2' Naivete aged 16. That was 16 chars.

The next variable should still be $n here\, but it flubs:

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "Naivete"\, 16..18\, ($count) x 2' Modification of a read-only value attempted at -e line 1.

So it is into the next *argument*\, not the next variable (or it would have found $count)\, which must be a scalar lvalue.

Or something. This it gets wrong:

% perl -lE '$count = "like FF"; printf "%s aged %d.%n That was %s chars.\n"\, "\xDF"\, 16\, substr($count\,-2)\, $count' ß aged 16. That was like FF chars.

And no\, it isn't order of evaluation:

% perl -lE '$count = "like FF"; printf "%s aged %d.%n"\, "\xDF"\, 16\, substr($count\,-2); printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

That's just as wrong as:

% perl -lE '$count = "like FF"; printf "%s aged %d.%n That was %s chars.\n"\, "\xDF"\, 16\, \$count\, $count' ß aged 16. That was like FF chars.

% perl -lE '$count = "like FF"; printf "%s aged %d.%n"\, "\xDF"\, 16\, \$count; printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

But it's silent. What gives?

% perl -lE '$count = "like FF"; printf "%s aged %d.%n"\, "\xDF"\, 16\, ${ \$count }; printf " That was %s chars.\n"\, $count' ß aged 16. That was 10 chars.

In fact\, it doesn't squawk at all:

% perl -lE '$count = "like FF"; printf "%s aged %d.%n"\, "\xDF"\, 16\, \@count; printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

But that's nothing. You probably didn't notice\, but it's lying about "the number of characters output" so far.

1% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "Naivete"\, 16\, ($count) x 2' Naivete aged 16. That was 16 chars.

2% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "Na\xEFvet\xE9"\, 16\, ($count) x 2' Naïveté aged 16. That was 16 chars.

3% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "Nai\x{308}vete\x{301}"\, 16\, ($count) x 2' Naïveté aged 16. That was 20 chars.

That's completely untrue. Watch:

1% perl -le 'printf "%s aged %d."\, "Naivete"\, 16' | uniwc Paras Lines Words Graphs Chars Bytes File 0 undef 3 16 16 16 standard input 2% perl -le 'printf "%s aged %d."\, "Na\xEFvet\xE9"\, 16' | uniwc Paras Lines Words Graphs Chars Bytes File 0 undef 3 16 16 18 standard input 3% perl -le 'printf "%s aged %d."\, "Nai\x{308}vete\x{301}"\, 16' | uniwc Paras Lines Words Graphs Chars Bytes File 0 undef 3 16 18 20 standard input

Even if we pretend that "characters" means code points\, that's just wrong all over the place. And that's not all; it gets still worse as the Unicode Bug rears its ugly head. First\, a baseline for correctness:

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "A"\, 16\, ($count) x 2' A aged 16. That was 10 chars.

Fine.

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "\xDF"\, 16\, ($count) x 2' ß aged 16. That was 10 chars.

Still fine.

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "\x{3b1}"\, 16\, ($count) x 2' α aged 16. That was 11 chars.

Wrong. The length of "A"\, "\xDF"\, and "\x{3b1}" is a constant 1 "character" long. Therefore it should have still been 10 chars\, not 11.

% perl -le 'printf "%s aged %d.%n That was %d chars.\n"\, "\xDF\x{3b1}"\, 16\, ($count) x 2' ßα aged 16. That was 13 chars.

Super wrong! I added one more character to something that was 10 characters.
10 + 1 is 11\, not 13. The length of "\xDF\x{3b1}" is 2\, not 4.

Count\, damn it! What good is a computer that can't count?

It's all pure nonsense\, which is the sweetest way I can put it.

So other than buggery\, what is this %n thing *supposed* to be for? I really have no idea\, considering that there are exactly zero examples of it in the documentation\, at least that I could locate.

Would the person who put this %n gizmo into Perl please speak up\, tell us what you thought it was going to be for\, and supply proper examples of its "correct" use? Assuming that is possible. It may not be.

As it is\, I deem %n far too broken for anyone to be using. It only knows how to lie and to do the wrong thing.

--tom

PS: Assume I'm running with PERL_UNICODE=SA... because I am.

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:
Platform: osname=openbsd\, osvers=4.4\, archname=OpenBSD.i386-openbsd uname='openbsd chthon 4.4 generic#0 i386 ' config_args='-des' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=y\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'\, optimize='-O2'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='3.3.5 (propolice)'\, gccosandvers='openbsd4.4' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=4\, prototype=define Linker and Libraries: ld='cc'\, ldflags ='-Wl\,-E -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-lgdbm -lm -lutil -lc perllibs=-lm -lutil -lc libc=/usr/lib/libc.so.48.0\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags='-DPIC -fPIC '\, lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl): Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under openbsd Compiled at Jun 11 2011 11:48:28 %ENV: PERL_UNICODE="SA" @INC: /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/site_perl/5.14.0 /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/5.14.0 /usr/local/lib/perl5/site_perl/5.12.3 /usr/local/lib/perl5/site_perl/5.11.3 /usr/local/lib/perl5/site_perl/5.10.1 /usr/local/lib/perl5/site_perl/5.10.0 /usr/local/lib/perl5/site_perl/5.8.7 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .

p5pRT commented 12 years ago

From tchrist@perl.com

It's quite possible I'm missing something\, since if there's one thing even scarier than reading through perlfunc's entry on sprintf formats\, it's reading through sv_vcatpvfn in sv.c. Egad\, does that ever scream for a massive rewrite!

With that proviso\, I have more on the %n mystery.

The %n format for printf/spring first appeared in the 5.004 version of perlfunc. Chip\, that was your baby: did *you* introduce this? Or do you know anything about it? If so\, could you please help shed some light on its origin and purpose? I can't uncover anything at all.

I did find evidence that the %n code must has been updated since 5.004 though\, since someone seems to have added the funky native-C types that Jan has said probably nobody should be using outside XS. Who did that and what were they in turn thinking? It turns out that %n is not just %n\, but rather the remarkable *AND WHOLLY UNDOCUMENTED*

%(?:hh?|ll?|[Vztjq])?n

So what *is* this lurking bit of exotica for\, *really*? Please?

As 5.004 antedates Unicode integration\, this perhaps explains why %n reports only the number of *internal* bytes used\, which is absolutely useless. %n does not report the number of Unicode characters (code points)\, and even its byte reporting is unreliable\, because this may be either 1 or 2 for code points 128–255 depending on whether rest of the string contains code points lying above 255. If not\, it "lies" — or at least\, gives erroneous (read: buggy) results.

Since %n does not reliably report any useful number for anything beyond the ASCII range\, I cannot see why %n exists. It seems yet another lurking 7-bit relic that has no place in today’s larger repertoire.

Grepping cpan for 'printf[^\n]*%n\b' turns

http://grep.cpan.me/?q=printf%5B%5E%5Cn%5D*%25n%5Cb

up only three hits\, and these are *not* for the built-in %n:

Chemistry-Mol-0.37/Mol.pm $mol->sprintf("%s - %n (%f). %a atoms\, %b bonds; " . "mass=%m; charge =%q; type=%t; id=%i");

PerlMol-0.3500/inc/BUNDLES/Chemistry-Mol-0.35/Mol.pm $mol->sprintf("%s - %n (%f). %a atoms\, %b bonds; " . "mass=%m; charge =%q; type=%t; id=%i");

PerlMol-0.3500/examples/molgrep/molgrep.pl # %n is the name\, %f the formula\, and %S the canonical SMILES print "$filename\t" . $mol->sprintf("%n\t%f\t%S\n"); } ITUB/PerlMol-0.3500

That’s because Chemistry::Mol has its very own sprintf:

=item $s = $mol->sprintf($format)

Format interesting molecular information in a concise way\, as specified by a printf-like format.

%n - name %f - formula %f{formula with format} - (note: right braces within the format should be escaped with a backslash) %s - SMILES representation %S - canonical SMILES representation %m - mass %8.3m - mass\, formatted as %8.3f with core sprintf %q - formal charge %a - atom count %b - bond count %t - type %i - id %% - %

For example\, if you want just about everything:

$mol->sprintf("%s - %n (%f). %a atoms\, %b bonds; " . "mass=%m; charge =%q; type=%t; id=%i");

So I can’t find anything that even *tries* to use %n.

And yes\, I did look harder. In particular\, considering this from sv_vcatpvfn:

case 'n': if (vectorize) goto unknown; i = SvCUR(sv) - origlen; if (args) { switch (intsize) { case 'c': *(va_arg(*args\, char*)) = i; break; case 'h': *(va_arg(*args\, short*)) = i; break; default: *(va_arg(*args\, int*)) = i; break; case 'l': *(va_arg(*args\, long*)) = i; break; case 'V': *(va_arg(*args\, IV*)) = i; break; case 'z': *(va_arg(*args\, SSize_t*)) = i; break; case 't': *(va_arg(*args\, ptrdiff_t*)) = i; break; #if HAS_C99 case 'j': *(va_arg(*args\, intmax_t*)) = i; break; #endif case 'q': #ifdef HAS_QUAD *(va_arg(*args\, Quad_t*)) = i; break; #else goto unknown; #endif } } else sv_setuv_mg(argsv\, (UV)i); continue; /* not "break" */

I also CPAN-grepped for

printf[^\n]*%(?:hh?|ll?|[Vztjq])?n\b

using

http://grep.cpan.me/?q=printf%5B%5E%5Cn%5D*%25%28%3F%3Ahh%3F%7Cll%3F%7C%5BVztjq%5D%29%3Fn%5Cb

But that turned up no further hits than the simpler grep I showed earlier.

Yet %n has sat there in perlfunc manpage for 14 long years all without a single example — and\, today\, without any applicability to trans-ASCII data.

What gives?

Yes\, using printf widths as in %6s is also useless for non-ASCII. But at least it doesn’t sometimes count \xE9 as one character and sometimes as two!

% perl -E 'say sprintf("|%-6s|"\, "caf\xE9") =~ s/ /./gr' |café..|

% perl -E 'say sprintf("|%-6s|"\, "\x{10d}af\xE9") =~ s/ /./gr' |čafé..|

See? Just two dots no matter whether the internal UTF8 rep takes a single byte for E9 or two of them. %n flipflops here.

Of course as is well-known already\, one can’t actually *use* widths in printf anymore\, since it doesn’t grok print columns the way GCString does. These are wrong\, since they should have two dots not one:

% perl -E 'say sprintf("|%-6s|"\, "cafe\x{301}") =~ s/ /./gr' |café.| % perl -E 'say sprintf("|%-6s|"\, "\x{10d}afe\x{301}") =~ s/ /./gr' |čafé.|

Whereas U::GCString gets all those correct every time:

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "cafe\x{301}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |café..|

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "\x{10d}afe\x{301}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |čafé..|

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "\x{10d}af\x{e9}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |čafé..|

Though this bug lamentably remains for want of my tuits to report it to the author:

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "cafe\x{e9}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' new: Unicode string must be given. at -e line 1. Exit 255

U::GCString aside\, what should be done about %n\, this thing that I can’t find anything on CPAN that uses? I propose that at the barest minimum\, its documentation be amended to the effect that:

* %n does *not* write into the next *variable* as it currently misleadingly states\, but rather into the next *argument* — which must evaluate to a (non-readonly) scalar variable that does *not* hold a reference. [If anyone can explain what the heck it is doing with substr()s or with \$ args\, please tell me.]

* %n is unreliable on anything but 100% ASCII strings alone\, and so cannot be used with Perl’s full native character repertoire.

I feel those are the minimal acceptable and required actions. Is there any disgreement to my proposed amendment to the documentation? What am I not understanding or forgetting? Please let me know. I honestly want to know.

Stronger but still perfectly reasonable measures include one but not both of:

* Having %n emit a mandatory warning if it has to

* (Somehow) fixing it to work correctly with code points over 127\, including and especially code points in the ever-delicate 128–255 range.

* Deprecating it with any eye toward removing it altogether as something that just didn’t work.

Obviously the first is much\, much harder than the second.

Any opinions\, suggestions\, and in particular *experiences* regarding the mysterious %n would be greatly appreciated.

--tom

p5pRT commented 12 years ago

From @Hugmeir

On Mon\, Nov 14\, 2011 at 11:18 AM\, Tom Christiansen \tchrist@perl\.com wrote:

It's quite possible I'm missing something\, since if there's one thing even scarier than reading through perlfunc's entry on sprintf formats\, it's reading through sv_vcatpvfn in sv.c. Egad\, does that ever scream for a massive rewrite!

With that proviso\, I have more on the %n mystery.

The %n format for printf/spring first appeared in the 5.004 version of perlfunc. Chip\, that was your baby: did *you* introduce this? Or do you know anything about it? If so\, could you please help shed some light on its origin and purpose? I can't uncover anything at all.

I did find evidence that the %n code must has been updated since 5.004 though\, since someone seems to have added the funky native-C types that Jan has said probably nobody should be using outside XS. Who did that and what were they in turn thinking? It turns out that %n is not just %n\, but rather the remarkable *AND WHOLLY UNDOCUMENTED*

%(?:hh?|ll?|[Vztjq])?n

So what *is* this lurking bit of exotica for\, *really*? Please?

As 5.004 antedates Unicode integration\, this perhaps explains why %n reports only the number of *internal* bytes used\, which is absolutely useless. %n does not report the number of Unicode characters (code points)\, and even its byte reporting is unreliable\, because this may be either 1 or 2 for code points 128–255 depending on whether rest of the string contains code points lying above 255. If not\, it "lies" — or at least\, gives erroneous (read: buggy) results.

Since %n does not reliably report any useful number for anything beyond the ASCII range\, I cannot see why %n exists. It seems yet another lurking 7-bit relic that has no place in today’s larger repertoire.

Grepping cpan for 'printf[^\n]*%n\b' turns

http://grep.cpan.me/?q=printf%5B%5E%5Cn%5D*%25n%5Cb

up only three hits\, and these are *not* for the built-in %n:

Chemistry-Mol-0.37/Mol.pm $mol->sprintf("%s - %n (%f). %a atoms\, %b bonds; " . "mass=%m; charge =%q; type=%t; id=%i");

PerlMol-0.3500/inc/BUNDLES/Chemistry-Mol-0.35/Mol.pm $mol->sprintf("%s - %n (%f). %a atoms\, %b bonds; " . "mass=%m; charge =%q; type=%t; id=%i");

PerlMol-0.3500/examples/molgrep/molgrep.pl # %n is the name\, %f the formula\, and %S the canonical SMILES print "$filename\t" . $mol->sprintf("%n\t%f\t%S\n"); } ITUB/PerlMol-0.3500

That’s because Chemistry::Mol has its very own sprintf:

=item $s = $mol->sprintf($format)

Format interesting molecular information in a concise way\, as specified by a printf-like format.
   %n \- name
   %f \- formula
   %f\{formula with format\} \- $note&#8203;: right braces within
       the format should be escaped with a backslash$
   %s \- SMILES representation
   %S \- canonical SMILES representation
   %m \- mass
   %8\.3m \- mass\, formatted as %8\.3f with core sprintf
   %q \- formal charge
   %a \- atom count
   %b \- bond count
   %t \- type
   %i \- id
   %% \- %
For example\, if you want just about everything:
   $mol\->sprintf$"%s \- %n \(%f$\. %a atoms\, %b bonds; "
       \. "mass=%m; charge =%q; type=%t; id=%i"\);
So I can’t find anything that even *tries* to use %n.

And yes\, I did look harder. In particular\, considering this from sv_vcatpvfn:
   case 'n'&#8203;:
       if $vectorize$
           goto unknown;
       i = SvCUR$sv$ \- origlen;
       if $args$ \{
           switch $intsize$ \{
           case 'c'&#8203;:       \*$va\_arg\(\*args\, char\*$\) = i; break;
           case 'h'&#8203;:       \*$va\_arg\(\*args\, short\*$\) = i; break;
           default&#8203;:        \*$va\_arg\(\*args\, int\*$\) = i; break;
           case 'l'&#8203;:       \*$va\_arg\(\*args\, long\*$\) = i; break;
           case 'V'&#8203;:       \*$va\_arg\(\*args\, IV\*$\) = i; break;
           case 'z'&#8203;:       \*$va\_arg\(\*args\, SSize\_t\*$\) = i; break;
           case 't'&#8203;:       \*$va\_arg\(\*args\, ptrdiff\_t\*$\) = i; break;
#if HAS_C99 case 'j': *(va_arg(*args\, intmax_t*)) = i; break; #endif case 'q': #ifdef HAS_QUAD *(va_arg(*args\, Quad_t*)) = i; break; #else goto unknown; #endif } } else sv_setuv_mg(argsv\, (UV)i); continue; /* not "break" */

I also CPAN-grepped for

printf[^\n]*%(?:hh?|ll?|[Vztjq])?n\b

using

http://grep.cpan.me/?q=printf%5B%5E%5Cn%5D*%25%28%3F%3Ahh%3F%7Cll%3F%7C%5BVztjq%5D%29%3Fn%5Cb

But that turned up no further hits than the simpler grep I showed earlier.

Yet %n has sat there in perlfunc manpage for 14 long years all without a single example — and\, today\, without any applicability to trans-ASCII data.

What gives?

Yes\, using printf widths as in %6s is also useless for non-ASCII. But at least it doesn’t sometimes count \xE9 as one character and sometimes as two!

% perl -E 'say sprintf("|%-6s|"\, "caf\xE9") =~ s/ /./gr' |café..|

% perl -E 'say sprintf("|%-6s|"\, "\x{10d}af\xE9") =~ s/ /./gr' |čafé..|

See? Just two dots no matter whether the internal UTF8 rep takes a single byte for E9 or two of them. %n flipflops here.

Of course as is well-known already\, one can’t actually *use* widths in printf anymore\, since it doesn’t grok print columns the way GCString does. These are wrong\, since they should have two dots not one:

% perl -E 'say sprintf("|%-6s|"\, "cafe\x{301}") =~ s/ /./gr' |café.| % perl -E 'say sprintf("|%-6s|"\, "\x{10d}afe\x{301}") =~ s/ /./gr' |čafé.|

Whereas U::GCString gets all those correct every time:

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "cafe\x{301}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |café..|

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "\x{10d}afe\x{301}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |čafé..|

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "\x{10d}af\x{e9}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' |čafé..|

Though this bug lamentably remains for want of my tuits to report it to the author:

% perl -MUnicode::GCString -E '$s = new Unicode::GCString "cafe\x{e9}"; say sprintf("|%s%s|"\, $s\, " " x (6 - $s->columns)) =~ s/ /./gr' new: Unicode string must be given. at -e line 1. Exit 255

U::GCString aside\, what should be done about %n\, this thing that I can’t find anything on CPAN that uses? I propose that at the barest minimum\, its documentation be amended to the effect that:

* %n does *not* write into the next *variable* as it currently misleadingly states\, but rather into the next *argument* — which must evaluate to a (non-readonly) scalar variable that does *not* hold a reference. [If anyone can explain what the heck it is doing with substr()s or with \$ args\, please tell me.]

* %n is unreliable on anything but 100% ASCII strings alone\, and so cannot be used with Perl’s full native character repertoire.

I feel those are the minimal acceptable and required actions. Is there any disgreement to my proposed amendment to the documentation? What am I not understanding or forgetting? Please let me know. I honestly want to know.

Stronger but still perfectly reasonable measures include one but not both of:

* Having %n emit a mandatory warning if it has to

* (Somehow) fixing it to work correctly with code points over 127\, including and especially code points in the ever-delicate 128–255 range.

* Deprecating it with any eye toward removing it altogether as something that just didn’t work.

Obviously the first is much\, much harder than the second.

Any opinions\, suggestions\, and in particular *experiences* regarding the mysterious %n would be greatly appreciated.

--tom

Apparently this was introduced here http://perl5.git.perl.org/perl.git/commit/fc36a67e8855d031b2a6921819d899eb149eee2d So there's also a chance that Hugo van der Sanden is the true culprit! (The plot thickens)

I can see this being minimally useful as an alternative (which doesn't require loading B) to $num_octets = () = /\C/g\, but that's hardly an argument to keep it around. So here's a +1 for deprecation\, so that rather than mending the docs\, we could simply remove it from them.

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From @hvds

Brian Fraser \fraserbn@gmail\.com wrote: :Apparently this was introduced here :http://perl5.git.perl.org/perl.git/commit/fc36a67e8855d031b2a6921819d899eb1= :49eee2d :So there's also a chance that Hugo van der Sanden is the true culprit! (The :plot thickens)

There was a bunch of rewriting of sprintf around that time\, initially motivated by a security issue.

If I remember right\, the problem started at the point the original code decided "this is too hard for me to format myself\, let me hand it to the C library instead". The intent of the rewrite was to stop attempting to use the C library sprintf.

I don't remember who added it\, but as far as I know %n was added because C library versions of sprintf had it. They still do - right now on this Ubuntu box\, `man 3 sprintf` gives:

The conversion specifier [...] n The number of characters written so far is stored into the inte- ger indicated by the int * (or variant) pointer argument. No argument is converted.

I do not think it was ever expected to be particularly useful for perl code\, but the functionality is also directly accessible to C/XS code\, for which %n might have greater utility.

Quite what it should do in a modern perl I have no idea: I'm not entirely sure even when it is useful from C\, but I suspect it is more to do with "is this buffer big enough" than anything else\, in which case "number of bytes" seems the more useful thing for it to record. For perl code I find it difficult to imagine any uncontrived use bar obfuscation.

Hugo

p5pRT commented 12 years ago

From tchrist@perl.com

"Hugo van der Sanden via RT" \perlbug\-followup@perl\.org wrote on Tue\, 15 Nov 2011 23:56:44 PST:

Brian Fraser \fraserbn@gmail\.com wrote: :Apparently this was introduced here :http://perl5.git.perl.org/perl.git/commit/fc36a67e8855d031b2a6921819d899eb1= :49eee2d :So there's also a chance that Hugo van der Sanden is the true culprit! (The :plot thickens)

There was a bunch of rewriting of sprintf around that time\, initially motivated by a security issue.

If I remember right\, the problem started at the point the original code decided "this is too hard for me to format myself\, let me hand it to the C library instead". The intent of the rewrite was to stop attempting to use the C library sprintf.

I don't remember who added it\, but as far as I know %n was added because C library versions of sprintf had it. They still do - right now on this Ubuntu box\, `man 3 sprintf` gives:

The conversion specifier [...] n The number of characters written so far is stored into the inte- ger indicated by the int * (or variant) pointer argument. No argument is converted.

I do not think it was ever expected to be particularly useful for perl code\, but the functionality is also directly accessible to C/XS code\, for which %n might have greater utility.

Quite what it should do in a modern perl I have no idea: I'm not entirely sure even when it is useful from C\, but I suspect it is more to do with "is this buffer big enough" than anything else\, in which case "number of bytes" seems the more useful thing for it to record. For perl code I find it difficult to imagine any uncontrived use bar obfuscation.

Thank you\, Hugo\, for taking the time to fill us in with the history.

--tom

p5pRT commented 12 years ago

From @cpansprout

On Wed Nov 16 04:44:12 2011\, tom christiansen wrote:

"Hugo van der Sanden via RT" \perlbug\-followup@perl\.org wrote on Tue\, 15 Nov 2011 23:56:44 PST:

Brian Fraser \fraserbn@gmail\.com wrote: :Apparently this was introduced here

:http://perl5.git.perl.org/perl.git/commit/fc36a67e8855d031b2a6921819d899eb1=

:49eee2d :So there's also a chance that Hugo van der Sanden is the true culprit! (The :plot thickens)

There was a bunch of rewriting of sprintf around that time\, initially motivated by a security issue.

If I remember right\, the problem started at the point the original code decided "this is too hard for me to format myself\, let me hand it to the C library instead". The intent of the rewrite was to stop attempting to use the C library sprintf.

I don't remember who added it\, but as far as I know %n was added because C library versions of sprintf had it. They still do - right now on this Ubuntu box\, `man 3 sprintf` gives:

The conversion specifier [...] n The number of characters written so far is stored into the inte- ger indicated by the int * (or variant) pointer argument. No argument is converted.

I do not think it was ever expected to be particularly useful for perl code\, but the functionality is also directly accessible to C/XS code\, for which %n might have greater utility.

Quite what it should do in a modern perl I have no idea: I'm not entirely sure even when it is useful from C\, but I suspect it is more to do with "is this buffer big enough" than anything else\, in which case "number of bytes" seems the more useful thing for it to record. For perl code I find it difficult to imagine any uncontrived use bar obfuscation.

Thank you\, Hugo\, for taking the time to fill us in with the history.

I did a naïve CPAN grep for /printf.*%n/. The only code using this that I found was \<http://cpansearch.perl.org/src/COMSKIL/Comskil-JIRA-00.11329/lib/Comskil/JWand.pm>\, which uses it by mistake:

my $ilist = eval {
$self->{client_handle}->getIssuesFromJqlSearch($jql\,$self->{':max_results'})

}; if ($@) { carp sprintf("getIssuesFromJqlSearch('%s'\,%n): %s"\,$jql\,$self->{':max_results'}\,$@); last; }

At first glance (just looking at the sprintf line)\, I thought it was for indenting more debugging output to follow\, which would make sense.

So\, by accident\, I’ve come up with a use case.

Allowing the bytes to leak through to the Perl level is certainly a bug. There is no C code using this (afaict)\, and it wouldn’t be particularly helpful to know the buffer size\, anyway\, as it could change during the *same* sprintf operation if Unicode characters are encountered in later arguments.

So I would suggest we just fix this to work with characters. It looks like something that was simply missed during the Unicode overhaul.

--

Father Chrysostomos

p5pRT commented 12 years ago

From @Hugmeir

On Sat\, Dec 31\, 2011 at 9:50 PM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:

On Wed Nov 16 04:44:12 2011\, tom christiansen wrote:

"Hugo van der Sanden via RT" \perlbug\-followup@perl\.org wrote on Tue\, 15 Nov 2011 23:56:44 PST:

Brian Fraser \fraserbn@gmail\.com wrote: :Apparently this was introduced here

: http://perl5.git.perl.org/perl.git/commit/fc36a67e8855d031b2a6921819d899eb1=

:49eee2d :So there's also a chance that Hugo van der Sanden is the true culprit! (The :plot thickens)

There was a bunch of rewriting of sprintf around that time\, initially motivated by a security issue.

If I remember right\, the problem started at the point the original code decided "this is too hard for me to format myself\, let me hand it to the C library instead". The intent of the rewrite was to stop attempting to use the C library sprintf.

I don't remember who added it\, but as far as I know %n was added because C library versions of sprintf had it. They still do - right now on this Ubuntu box\, `man 3 sprintf` gives:

The conversion specifier [...] n The number of characters written so far is stored into the inte- ger indicated by the int * (or variant) pointer argument. No argument is converted.

I do not think it was ever expected to be particularly useful for perl code\, but the functionality is also directly accessible to C/XS code\, for which %n might have greater utility.

Quite what it should do in a modern perl I have no idea: I'm not entirely sure even when it is useful from C\, but I suspect it is more to do with "is this buffer big enough" than anything else\, in which case "number of bytes" seems the more useful thing for it to record. For perl code I find it difficult to imagine any uncontrived use bar obfuscation.

Thank you\, Hugo\, for taking the time to fill us in with the history.

I did a naïve CPAN grep for /printf.*%n/. The only code using this that I found was \< http://cpansearch.perl.org/src/COMSKIL/Comskil-JIRA-00.11329/lib/Comskil/JWand.pm

\, which uses it by mistake:
       my $ilist = eval \{
$self->{client_handle}->getIssuesFromJqlSearch($jql\,$self->{':max_results'})
           \};
       if $$@&#8203;$ \{
           carp sprintf$"getIssuesFromJqlSearch\('%s'\,%n$&#8203;:
%s"\,$jql\,$self->{':max_results'}\,$@); last; }

At first glance (just looking at the sprintf line)\, I thought it was for indenting more debugging output to follow\, which would make sense.

So\, by accident\, I’ve come up with a use case.

Allowing the bytes to leak through to the Perl level is certainly a bug. There is no C code using this (afaict)\, and it wouldn’t be particularly helpful to know the buffer size\, anyway\, as it could change during the *same* sprintf operation if Unicode characters are encountered in later arguments.

So I would suggest we just fix this to work with characters. It looks like something that was simply missed during the Unicode overhaul.

--

Father Chrysostomos

--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=103492

Heh\, timely reply. I was just reading http://www.amazon.com/Professional-Perl-Programming-Simon-Cozens/dp/1861004494which says this about %n:

"%n writes length of current output string into next variable. [...] The %n placeholder is unusual in that it assigns the length of the string generated so far\, to the next item in the list (which must therefore be a variable)."

p5pRT commented 12 years ago

From @cpansprout

On Mon Nov 14 06:19:11 2011\, tom christiansen wrote:

It's quite possible I'm missing something\, since if there's one thing even scarier than reading through perlfunc's entry on sprintf formats\, it's reading through sv_vcatpvfn in sv.c. Egad\, does that ever scream for a massive rewrite!

I see nothing wrong with it.

You clearly haven’t looked in mro.c. :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Sat Nov 12 15:51:48 2011\, tom christiansen wrote:

So it is into the next *argument*\, not the next variable (or it would have found $count)\, which must be a scalar lvalue.

Fixed in commit e38523840.

Or something. This it gets wrong:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n That was %s
chars.\n"\, "\xDF"\, 16\, substr($count\,-2)\, $count' ß aged 16. That was like FF chars.

And no\, it isn't order of evaluation:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n"\, "\\xDF"\,
16\, substr($count\,-2); printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

Fixed in commit 69974ce61.

That's just as wrong as:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n That was %s
chars.\n"\, "\xDF"\, 16\, \$count\, $count' ß aged 16. That was like FF chars.
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n"\, "\\xDF"\,
16\, \$count; printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

But it's silent. What gives?
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n"\, "\\xDF"\,
16\, ${ \$count }; printf " That was %s chars.\n"\, $count' ß aged 16. That was 10 chars.

In fact\, it doesn't squawk at all:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n"\, "\\xDF"\,
16\, \@count; printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.

I doubt there’s anything that can be done about that. It’s the same for sub{ $_[0]=43 }->(...).

But that's nothing. You probably didn't notice\, but it's lying about "the number of characters output" so far.
1% perl \-le 'printf "%s aged %d\.%n That was %d chars\.\\n"\,
"Naivete"\, 16\, ($count) x 2' Naivete aged 16. That was 16 chars.
2% perl \-le 'printf "%s aged %d\.%n That was %d chars\.\\n"\,
"Na\xEFvet\xE9"\, 16\, ($count) x 2' Naïveté aged 16. That was 16 chars.
3% perl \-le 'printf "%s aged %d\.%n That was %d chars\.\\n"\,
"Nai\x{308}vete\x{301}"\, 16\, ($count) x 2' Naïveté aged 16. That was 20 chars.

Fixed in commit 89139cf8b.

Count\, damn it! What good is a computer that can't count?

It won’t count. :-)

So other than...\, what is this %n thing *supposed* to be for? I really have no idea\, considering that there are exactly zero examples of it in the documentation\, at least that I could locate.

One could use it for indenting debugging output.

--

Father Chrysostomos

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 12 years ago

From tchrist@perl.com

"Father Chrysostomos via RT" \perlbug\-followup@perl\.org wrote on Sun\, 01 Jan 2012 00:21:40 PST:

So other than...\, what is this %n thing *supposed* to be for? I really have no idea\, considering that there are exactly zero examples of it in the documentation\, at least that I could locate.

One could use it for indenting debugging output.

Probably not. The number of codepoints so far emitted has nothing to do with indentation level. You have to use the columns() method from Unicode::GCString for that.

--tom

p5pRT commented 12 years ago

From @demerphq

On 1 January 2012 16:31\, Tom Christiansen \tchrist@perl\.com wrote:

"Father Chrysostomos via RT" \perlbug\-followup@perl\.org wrote on Sun\, 01 Jan 2012 00:21:40 PST:

So other than...\, what is this %n thing *supposed* to be for? I really have no idea\, considering that there are exactly zero examples of it in the documentation\, at least that I could locate.

One could use it for indenting debugging output.

Probably not. The number of codepoints so far emitted has nothing to do with indentation level. You have to use the columns() method from Unicode::GCString for that.

Please stop assuming that everyone has to deal with the hairier edge cases of unicode. I think it is quite safe to say that most of us don't work with such data\, and do not care that some Perl features don't deal with them well.

Just because feature will not behave as you would like on the typical data sets you process does not mean that it is similarly useless to everyone else\, and nor is it a good reason to remove it.

Thanks\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From tchrist@perl.com

Please stop assuming that everyone has to deal with the hairier edge cases of unicode. I think it is quite safe to say that most of us don't work with such data\, and do not care that some Perl features don't deal with them well.

That's a really outrageous statement. Even if it were true -- and it's not -- this still doesn't work on plain ASCII. I can see no use for it.

--tom

p5pRT commented 12 years ago

From @cpansprout

On Sun Jan 01 00:21:39 2012\, sprout wrote:

On Sat Nov 12 15:51:48 2011\, tom christiansen wrote:

So it is into the next *argument*\, not the next variable (or it would have found $count)\, which must be a scalar lvalue.

Fixed in commit e38523840.
Or something. This it gets wrong:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n That was %s
chars.\n"\, "\xDF"\, 16\, substr($count\,-2)\, $count' ß aged 16. That was like FF chars.

And no\, it isn't order of evaluation:
% perl \-lE '$count = "like FF"; printf "%s aged %d\.%n"\, "\\xDF"\,
16\, substr($count\,-2); printf " That was %s chars.\n"\, $count' ß aged 16. That was like FF chars.
Fixed in commit 69974ce61.

And this is basically the same as bug #3471 (which you even responded to).

--

Father Chrysostomos

Perl / perl5