Closed p5pRT closed 7 years ago
After upgrading from debian-wheezy to debian-jessie HTML::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web-pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text::CSV::Encoded since I could narrow the problem to just a few lines of test-case.
In this example: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).
This result is incorrect\, since the file does not contain any "wide chars".
On Mon\, 28 Nov 2016 12:34:02 GMT\, rafal@zorro.ztk-rp.eu wrote:
This is a bug report for perl from rafal@zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.
----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text::CSV::Encoded since I could narrow the problem to just a few lines of test-case.
======================== #!/usr/bin/perl use Text::CSV::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, ":encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text::CSV::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"
In this example: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).
The result of the above program is:
$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $
This result is incorrect\, since the file does not contain any "wide chars".
It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning. Here's what pod/perldiag.pod in perl-5.24.0 says:
##### =item Wide character in %s
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the C\<:utf8> layer to the
output\, e.g. C\<binmode STDOUT\, ':utf8'>. Another way to turn off the
warning is to add C\<no warnings 'utf8';> but that is often closer to
cheating. In general\, you are supposed to explicitly mark the
filehandle with an encoding\, see L\
If I put your test data into a file and run it through 'od -c'\, I observe two characters in the >255 range.
##### $ od -c warsaw.txt 0000000 1 0 ; " S P 323 243 D Z I E L N I A 0000020 \n W A R S Z A W A " ; 6 2 ; " T 0000040 E S T " \n 0000045 #####
Text::CSV::Encoded is not part of the Perl 5 core distribution\, so I think including it in the test script muddies the waters. Here's a pure Perl reduction:
##### $ cat 2-130199-text-csv-encoded.pl # perl use strict; use warnings;
open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, ":encoding(cp1250)"); while ( \<$FH> ) { s/\s+$//; print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl Wide character in print at 2-130199-text-csv-encoded.pl line 9\, \<$FH> line 1. 10;"SPÓŁDZIELNIA WARSZAWA";62;"TEST" #####
I think that warning is appropriate. However\, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is. Other people on list should comment.
Thank you very much.
The RT System itself - Status changed from 'new' to 'open'
On Mon\, 28 Nov 2016 23:03:51 GMT\, jkeenan wrote:
On Mon\, 28 Nov 2016 12:34:02 GMT\, rafal@zorro.ztk-rp.eu wrote:
This is a bug report for perl from rafal@zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.
----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text::CSV::Encoded since I could narrow the problem to just a few lines of test-case.
======================== #!/usr/bin/perl use Text::CSV::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, ":encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text::CSV::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"
In this example: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).
The result of the above program is:
$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $
This result is incorrect\, since the file does not contain any "wide chars".
It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning. Here's what pod/perldiag.pod in perl-5.24.0 says:
##### =item Wide character in %s
(S utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The easiest way to quiet this warning is simply to add the C\<:utf8> layer to the output\, e.g. C\<binmode STDOUT\, ':utf8'>. Another way to turn off the warning is to add C\<no warnings 'utf8';> but that is often closer to cheating. In general\, you are supposed to explicitly mark the filehandle with an encoding\, see L\
and L\<perlfunc/binmode>. ##### If I put your test data into a file and run it through 'od -c'\, I observe two characters in the >255 range.
##### $ od -c warsaw.txt 0000000 1 0 ; " S P 323 243 D Z I E L N I A 0000020 \n W A R S Z A W A " ; 6 2 ; " T 0000040 E S T " \n 0000045 #####
Text::CSV::Encoded is not part of the Perl 5 core distribution\, so I think including it in the test script muddies the waters. Here's a pure Perl reduction:
##### $ cat 2-130199-text-csv-encoded.pl # perl use strict; use warnings;
open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, ":encoding(cp1250)"); while ( \<$FH> ) { s/\s+$//; print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl Wide character in print at 2-130199-text-csv-encoded.pl line 9\, \<$FH> line 1. 10;"SPÓŁDZIELNIA WARSZAWA";62;"TEST" #####
I think that warning is appropriate. However\, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is. Other people on list should comment.
Thank you very much.
On #p5p khw has pointed out an error in my analysis. 'od -c' prints octal. So these characters are below \0377 equivalent to 255.
Also\, in my test program I should have applied binmode to STDOUT as well.
##### # perl use strict; use warnings;
open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, ":encoding(cp1250)"); binmode(STDOUT\, ":encoding(cp1250)"); while ( \<$FH> ) { s/\s+$//; print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" #####
And once I 'binmode' STDOUT\, the "Wide character" warning goes away. So\, notwithstanding my errors\, I still think this is not a bug -- at least not in perl-5.24.0.
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
Dana Mon\, 28 Nov 2016 04:34:02 -0800\, rafal@zorro.ztk-rp.eu reče:
This is a bug report for perl from rafal@zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.
----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text::CSV::Encoded since I could narrow the problem to just a few lines of test-case.
======================== #!/usr/bin/perl use Text::CSV::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, ":encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text::CSV::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"
In this example: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).
The result of the above program is:
$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $
This result is incorrect\, since the file does not contain any "wide chars".
[Please do not change anything below this line] -----------------------------------------------------------------
As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)
Le 28/11/2016 à 13:34\, (via RT) a écrit :
LANG=pl\_PL\.utf8 LANGUAGE=en\_US​:en
Maybe a wild shot but isn't that combination asking for trouble ? FWIW\, see http://stackoverflow.com/a/2510548
On Tue\, 29 Nov 2016 08:24:13 GMT\, slaven@rezic.de wrote:
Dana Mon\, 28 Nov 2016 04:34:02 -0800\, rafal@zorro.ztk-rp.eu reče:
[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)
Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
On Fri\, 02 Dec 2016 21:55:42 GMT\, jkeenan wrote:
On Tue\, 29 Nov 2016 08:24:13 GMT\, slaven@rezic.de wrote:
Dana Mon\, 28 Nov 2016 04:34:02 -0800\, rafal@zorro.ztk-rp.eu reče:
[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)
Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.
If there's no response from the original poster within a week\, I will close this ticket.
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
On Sun\, 25 Dec 2016 02:12:24 GMT\, jkeenan wrote:
On Fri\, 02 Dec 2016 21:55:42 GMT\, jkeenan wrote:
On Tue\, 29 Nov 2016 08:24:13 GMT\, slaven@rezic.de wrote:
Dana Mon\, 28 Nov 2016 04:34:02 -0800\, rafal@zorro.ztk-rp.eu reče:
[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)
Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.
If there's no response from the original poster within a week\, I will close this ticket.
Thank you very much.
Closing as per schedule. Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
@jkeenan - Status changed from 'open' to 'rejected'
Migrated from rt.perl.org#130199 (status was 'rejected')
Searchable as RT130199$