Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

[PATCH] Apparent utf8 bug in join() in 5.8.[012] #7023

Closed p5pRT closed 20 years ago

p5pRT commented 20 years ago

Migrated from rt.perl.org#24846 (status was 'resolved')

Searchable as RT24846$

p5pRT commented 20 years ago

From @obra

Yeah\, I'm still not quite sure I believe it myself\, but IO​::Scalar exercises join with UTF8 and non-UTF8 data causing RT to end up with corrupted attachments fairly often. After patching IO​::Scalar to work around this by emulating join using concatenation\, the issue disappears.


From​: Nicholas Adrian Vinen \hb@​pandora\.x256\.com

  Hello\,   I am a consultant for a company which uses RT for their internal   support. They asked me to fix a problem they were having where attaching binary files to a ticket caused the file to become corrupt sometimes. They tracked it down to the case where the mod_perl session which serves the request to add the attachment to the ticket has previously been used to perform some ticket-related operation. I finally tracked down this problem to a bug in perl. Here is a detailed description of the problem​:

  When you attach a file to a ticket using RT it saves the file you   attach into a file into /tmp. It then adds a MIME​::Body​::File record to the MIME​::Entity which represents the ticket. Later\, it calls make_singlepart() on the MIME​::Entity\, which converts the entity into a string. During this process\, it calls as_string() on the MIME​::Body​::File. This causes the file to be read in and printed into a string using the IO​::Scalar object. IO​::Scalar's print() function calls the function join() on the data as it is read in\, before that data is appended onto the destination string.

  The problem occurs inside join(). join() recycles string objects   into which it does the joining\, which it later returns. It never touches the UTF8 flag on these strings. So\, on the initial run\, it has no strings to recycle (or few)\, and when they are created they are set to ASCII. So all the results of join() are ASCII\, which is what MIME and RT wants\, as ASCII is also what is used for processing binary data. The problem is\, on the second and subsequent executions of RT within the perl system\, the recycled strings often have the UTF8 flag set. So\, join (''\, $string)\, where $string is ASCII\, will often return a UTF8 string. When this UTF8 string is later converted into ASCII it is modified\, and so the binary data is corrupted.

  The solution is to apply the following patch to perl (tested with   perl 5.8.2)\, which sets the UTF8 flag on the returned string to something sensible.

Inline Patch ```diff diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c --- perl-5.8.2/doop.c 2003-09-30 10:09:51.000000000 -0700 +++ perl-5.8.2-patched/doop.c 2004-01-05 23:23:13.000000000 -0800 @@ -647,6 +647,9 @@ register STRLEN len; STRLEN delimlen; STRLEN tmplen; + int utf8; + + utf8 = (SvUTF8(del)!=0); (void) SvPV(del, delimlen); /* stringify and get the delimlen */ /* SvCUR assumes it's SvPOK() and woe betide you if it's not. */ @@ -674,22 +677,37 @@ SvTAINTED_off(sv); if (items-- > 0) { - if (*mark) + if (*mark) { + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv, *mark); + } mark++; } if (delimlen) { for (; items > 0; items--,mark++) { sv_catsv(sv,del); + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv,*mark); } } else { - for (; items > 0; items--,mark++) + for (; items > 0; items--,mark++) { + utf8 += (SvUTF8(*mark)!=0); sv_catsv(sv,*mark); + } } SvSETMAGIC(sv); + if( utf8 ) + { + if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) ) + { + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and ASCII strings"); + } + SvUTF8_on(sv); + } else { + SvUTF8_off(sv); + } } void There may be other perl functions with similar problems; this is beyond the scope of my job, however I hope that the maintainers of ```

perl will be proactive in attempting to find and fix any similar problems, as the way they have added UTF8 support to perl doesn't make it obvious when such bugs exist. I'd say that any built-in function that returns a string should be checked for (a) setting the UTF8 flag at all and (b) whether the value it sets it to is sensible. Also I think warnings when mixed types of strings are passed into functions are sensible as this can be dangerous\, and as we don't know what character set the ASCII strings are in\, the routines themselves can't really handle this case properly if any extended characters are present.

  I hope this helps.

  Nicholas

On Tue\, Jan 06\, 2004 at 01​:46​:22PM -0500\, Jesse Vincent wrote​:

Hey Nicholas\,

Thanks very much for the patch\. I've forwarded it on to some

perl 5 porters who pushed back a bit. They've asked for a clear statement of _exactly_ what's being recycled\, along with a simple testcase for reproducing the bug. I'd be happy to contribute myself\, but I'm really not a C person :/

  I wish I knew! I had a really hard time reading the perl code. Here is what I could tell​: somehow calling join() in perl causes Perl_do_join to be called with 3 main arguments. One is the delimiter\, on is the array of strings to be joined\, and one is a string into which to put the result. I don't know where exactly it gets the string which is the destination - the code calls a macro which I chased several levels deep. This is the function which does it\, as far as I can tell​:

PP(pp_join) {   dSP; dMARK; dTARGET;   MARK++;   do_join(TARG\, *MARK\, MARK\, SP);   SP = MARK;   SETs(TARG);   RETURN; }

  PP() looks like this​:

#define PP OP * Perl_##s(pTHX)

  pTHX looks like this​:

#define pTHX register struct perl_thread *thr PERL_UNUSED_DECL

  TARG is the string which is being joined into. dTARGET is a macro which looks like this​:

#define dTARGET SV * GETTARGET

  GETTARGET looks like this​:

#define GETTARGET targ = PAD_SV(PL_op->op_targ)

  PAD_SV looks like this​:

#define PAD_SV(po) (PL_curpad[po])

  PL_curpad looks like this​:

#define PL_curpad (*Perl_Tcurpad_ptr(aTHX))

  aTHX is defined like this​:

#define aTHX thr

  I think you can see why I wasn't very specific :( it's a mess... and here is where I lose the trail because I can't find Perl_Tcurpad_ptr defined anywere. However\, here is what I can tell. The target comes out of the some 'curpad' member of 'thr' which is the current perl thread context. I *think* curpad is like a stack and the returned value from join goes onto the end of the stack. I think it is these values which are being recycled. Certainly the target has to come from somewhere\, and it doesn't look like it is being allocated\, it looks to me like it's being taken from an array. op_targ seems to be which element of the array that the result should go into but I'm not sure what defines this. I'd have to dig a lot more to find out and I'm already quite lost.

  The simple test case is this​: install RT3 on a server with a single process with MaxRequestsPerChildren>1\, using mod_perl\, and attach a binary file to a ticket twice in a row :) That's the simplest test case I have. I've tried to reproduce it in small scripts but I can't\, and now that I've fixed it on the servers that I have access to\, I don't really want to break it again just to write a test case. I think\, however\, the fact that the join function never sets or unsets the utf8 flag on its target string means that it can't be operating 100% correctly. Anyway\, this is as much information as I can give you as I need to get on with the next project now.

  Nicholas

P.S. I tried to cross-post the email I sent onto the rt-devel and rt-users mailing lists but was rejected because I'm not subscribed to them. Perhaps you can do that for me?

-- http​://www.bestpractical.com/rt -- Trouble Ticketing. Free.

p5pRT commented 20 years ago

From nick.ing-simmons@elixent.com

Jesse Vincent \perl5\-porters@​perl\.org writes​:

# New Ticket Created by Jesse Vincent # Please include the string​: [perl #24846] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah\, I'm still not quite sure I believe it myself\, but IO​::Scalar exercises join with UTF8 and non-UTF8 data causing RT to end up with corrupted attachments fairly often.

IO​::Scalar is (or should be) largely redundant in perl5.8.* as you can

open(my $fh\,"+\<"\,\$scalar);

p5pRT commented 20 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 20 years ago

From @timbunce

On Fri\, Jan 09\, 2004 at 02​:22​:03PM +0000\, Nick Ing-Simmons wrote​:

Jesse Vincent \perl5\-porters@&#8203;perl\.org writes​:

# New Ticket Created by Jesse Vincent # Please include the string​: [perl #24846] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah\, I'm still not quite sure I believe it myself\, but IO​::Scalar exercises join with UTF8 and non-UTF8 data causing RT to end up with corrupted attachments fairly often.

IO​::Scalar is (or should be) largely redundant in perl5.8.* as you can

open(my $fh\,"+\<"\,\$scalar);

IO​::Scalar maybe largely redundant in perl5.8.*\, but join() isn't.

Tim.

p5pRT commented 20 years ago

From @eserte

Jesse Vincent (via RT) \perlbug\-followup@&#8203;perl\.org writes​:

# New Ticket Created by Jesse Vincent # Please include the string​: [perl #24846] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=24846 >

Yeah\, I'm still not quite sure I believe it myself\, but IO​::Scalar exercises join with UTF8 and non-UTF8 data causing RT to end up with corrupted attachments fairly often. After patching IO​::Scalar to work around this by emulating join using concatenation\, the issue disappears.

Here's a test case​:

use strict; use Encode qw(is_utf8); use Test​::More qw(no_plan); my $ascii = "abc\304"; my $utf8 = "abc\x{0100}"; for ($utf8\, $ascii) {   my $res = join(""\, $_);   is(is_utf8($res)\, $_ eq $utf8); } __END__

Regards\,   Slaven

-- Slaven Rezic - slaven@​rezic.de

  tksm - Perl/Tk program for searching and replacing in multiple files   http​://ptktools.sourceforge.net/#tksm

p5pRT commented 20 years ago

From BQW10602@nifty.com

  The problem occurs inside join\(\)\. join\(\) recycles string objects
  into which it does the joining\, which it later returns\. It never

touches the UTF8 flag on these strings. So\, on the initial run\, it has no strings to recycle (or few)\, and when they are created they are set to ASCII. So all the results of join() are ASCII\, which is what MIME and RT wants\, as ASCII is also what is used for processing binary data. The problem is\, on the second and subsequent executions of RT within the perl system\, the recycled strings often have the UTF8 flag set. So\, join (''\, $string)\, where $string is ASCII\, will often return a UTF8 string. When this UTF8 string is later converted into ASCII it is modified\, and so the binary data is corrupted.

  The solution is to apply the following patch to perl \(tested with
  perl 5\.8\.2\)\, which sets the UTF8 flag on the returned string to

something sensible.

This is parhaps due to SvPOK_only_UTF8() in sv_setpv() which leaves UTF8 flag as it was.

I disagree warning when UTF8 and ASCII are mixed. I think it would upset encoding.pm which allows byte strings as in arbitrary encoding other than the system-native encoding (ASCII/Latin1 or EBCDIC).

### \A patch against perl-5.8.3 RC1

Inline Patch ```diff diff -urN perl~/doop.c perl/doop.c --- perl~/doop.c Fri Dec 19 05:47:58 2003 +++ perl/doop.c Mon Jan 12 10:08:10 2004 @@ -668,6 +668,10 @@ } sv_setpv(sv, ""); + /* sv_setpv retains old UTF8ness [perl #24846] */ + if (SvUTF8(sv)) + SvUTF8_off(sv); + if (PL_tainting && SvMAGICAL(sv)) SvTAINTED_off(sv); diff -urN perl~/t/op/join.t perl/t/op/join.t --- perl~/t/op/join.t Sat Dec 30 16:16:18 2000 +++ perl/t/op/join.t Mon Jan 12 10:34:22 2004 @@ -1,6 +1,6 @@ #!./perl -print "1..14\n"; +print "1..18\n"; @x = (1, 2, 3); if (join(':',@x) eq '1:2:3') {print "ok 1\n";} else {print "not ok 1\n";} @@ -65,3 +65,29 @@ print "ok 14\n"; } +{ # [perl #24846] $jb2 should be in bytes, not in utf8. + my $b = "abc\304"; + my $u = "abc\x{0100}"; + + sub join_into_my_variable { + my $r = join("", @_); + return $r; + } + + my $jb1 = join_into_my_variable("", $b); + my $ju1 = join_into_my_variable("", $u); + my $jb2 = join_into_my_variable("", $b); + my $ju2 = join_into_my_variable("", $u); + + print "not " unless unpack('H*', $jb1) eq unpack('H*', $b); + print "ok 15\n"; + + print "not " unless unpack('H*', $ju1) eq unpack('H*', $u); + print "ok 16\n"; + + print "not " unless unpack('H*', $jb2) eq unpack('H*', $b); + print "ok 17\n"; + + print "not " unless unpack('H*', $ju2) eq unpack('H*', $u); + print "ok 18\n"; +} ```

\z patch

Regards SADAHIRO Tomoyuki

p5pRT commented 20 years ago

From @rgs

SADAHIRO Tomoyuki wrote​:

I disagree warning when UTF8 and ASCII are mixed.

So do I.

I think it would upset encoding.pm which allows byte strings as in arbitrary encoding other than the system-native encoding (ASCII/Latin1 or EBCDIC).

### \A patch against perl-5.8.3 RC1 diff -urN perl~/doop.c perl/doop.c --- perl~/doop.c Fri Dec 19 05​:47​:58 2003 +++ perl/doop.c Mon Jan 12 10​:08​:10 2004

Thanks\, applied to bleadperl as #22117.

p5pRT commented 20 years ago

@rgs - Status changed from 'open' to 'resolved'