Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

Text::CSV::Encoded is incorrectly forced to parse widechar #15739

Closed p5pRT closed 7 years ago

p5pRT commented 7 years ago

Migrated from rt.perl.org#130199 (status was 'rejected')

Searchable as RT130199$

p5pRT commented 7 years ago

From rafal@zorro.ztk-rp.eu

Created by rafal@zorro.ztk-rp.eu

After upgrading from debian-wheezy to debian-jessie HTML​::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web-pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text​::CSV​::Encoded since I could narrow the problem to just a few lines of test-case.

======================== #!/usr/bin/perl use Text​::CSV​::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, "​:encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250"\,   binary => 1\, eol => $/\, sep_char => ';'\,   } ) or die "Cannot use CSV​: ".Text​::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) {   s/\s+$//;   print;   if ($csv->parse( $_ )) {   print $csv->fields();   } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"

In this example​: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $

This result is incorrect\, since the file does not contain any "wide chars".

Perl Info ``` Flags: category=core severity=high Site configuration information for perl 5.20.2: Configured by Debian Project at Fri Jul 22 15:47:27 UTC 2016. Summary of my perl5 (revision 5 version 20 subversion 2) configuration: Platform: osname=linux, osvers=3.16.0-4-amd64, archname=x86_64-linux-gnu-thread-multi uname='linux himalia 3.16.0-4-amd64 #1 smp debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dusesitecustomize -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.9.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking DEBPKG:fixes/respect_umask - Respect umask during installation DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3+deb8u6 in patchlevel.h DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require DEBPKG:fixes/array-cloning - http://bugs.debian.org/779357 [perl #124127] [902d169] fix cloning arrays with unused elements DEBPKG:fixes/perldb-threads - http://bugs.debian.org/779357 [perl #124127] [41ef2c6] lib/perl5db.pl: Restore noop lock prototype DEBPKG:fixes/CVE-2015-8607_file_spec_taint_fix - ensure File::Spec::canonpath() preserves taint DEBPKG:fixes/encode-unicode-bom - http://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043 DEBPKG:debian/encode-unicode-bom-doc - http://bugs.debian.org/798727 Document Debian backport of Encode::Unicode fix DEBPKG:debian/kfreebsd-softupdates - http://bugs.debian.org/796798 Work around Debian Bug#796798 DEBPKG:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ DEBPKG:debian/debugperl-compat-fix - [perl #127212] http://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds DEBPKG:fixes/CVE-2015-8853_regexp_hang - http://bugs.debian.org/821848 [perl #123562] PATCH [perl #123562] Regexp-matching "hangs" DEBPKG:fixes/utf8_regexp_crash - http://bugs.debian.org/820328 [perl #124109] save_re_context(): do "local $n" with no PL_curpm DEBPKG:fixes/regcomp_whitespace_fix - http://bugs.debian.org/820328 [perl #124109] Perl_save_re_context(): re-indent after last commit DEBPKG:fixes/5.20.3/eval_label_crash - http://bugs.debian.org/822336 [perl #123652] eval {label:} crash DEBPKG:fixes/5.20.3/preserve_record_separator - http://bugs.debian.org/822336 [perl #123218] "preserve" $/ if set to a bad value DEBPKG:fixes/5.20.3/test_count_base_rs - http://bugs.debian.org/822336 Fix test count in t/base/rs.t DEBPKG:fixes/5.20.3/remove_get_magic - http://bugs.debian.org/822336 [perl #123739] Remove get-magic from $/ DEBPKG:fixes/5.20.3/speed_up_scalar_g - http://bugs.debian.org/822336 [perl #123202] speed up scalar //g against tainted strings DEBPKG:fixes/5.20.3/accidental_all_features - http://bugs.debian.org/822336 Stop $^H |= 0x1c020000 from enabling all features DEBPKG:fixes/5.20.3/multidimensional_arrays_utf8 - http://bugs.debian.org/822336 [perl #124113] Make check for multi-dimensional arrays be UTF8-aware DEBPKG:fixes/5.20.3/unquoted_utf8_heredoc_terminators - http://bugs.debian.org/822336 Allow unquoted UTF-8 HERE-document terminators DEBPKG:fixes/5.20.3/parentheses_ambiguous_warning_utf8_functions - http://bugs.debian.org/822336 Fix "...without parentheses is ambuguous" warning for UTF-8 function names DEBPKG:fixes/5.20.3/leak_namepv_copy - http://bugs.debian.org/822336 [perl #123786] don't leak the temp utf8 copy of namepv DEBPKG:fixes/5.20.3/h2ph_hex_constants - http://bugs.debian.org/822336 h2ph: correct handling of hex constants for the preamble DEBPKG:fixes/5.20.3/leftbracket_XTERMORDORDOR - http://bugs.debian.org/822336 [perl #123711] Fix crash with 0-5x-l{0} DEBPKG:fixes/5.20.3/fatalize_warnings_unwinding - http://bugs.debian.org/822336 [perl #123398] don't fatalize warnings during unwinding (#123398) DEBPKG:fixes/5.20.3/setpgrp - http://bugs.debian.org/822336 =?UTF-8?q?Don=E2=80=99t=20treat=20setpgrp($nonzero)=20as=20setpgr?= =?UTF-8?q?p(1)?= DEBPKG:fixes/5.20.3/death_unwinding_crash - http://bugs.debian.org/822336 [perl #124156] RT #124156: death during unwinding causes crash DEBPKG:fixes/5.20.3/stashpvn_crash - http://bugs.debian.org/822336 [perl #125541] Fix crash with %::=(); J->${\"::"} DEBPKG:fixes/5.20.3/possessive_quantifier - http://bugs.debian.org/822336 [perl #125825] PATCH: [perl 125825] {n}+ possessive quantifier broken DEBPKG:fixes/5.20.3/quoted_code_crash - http://bugs.debian.org/822336 [perl #123712] Fix /$a[/ parsing DEBPKG:fixes/5.20.3/checking_sub_inwhat - http://bugs.debian.org/822336 [perl #123712] Don't check sub_inwhat DEBPKG:fixes/5.20.3/yylex_loop - http://bugs.debian.org/822336 Fix hang with "@{" DEBPKG:fixes/5.20.3/docs/op - http://bugs.debian.org/822336 Fix apidocs for OP_TYPE_IS(_OR_WAS) - arguments separated by |, not ,. DEBPKG:fixes/5.20.3/docs/encoding - http://bugs.debian.org/822336 perlpodspec: Corrections/adds to detecting =encoding DEBPKG:fixes/5.20.3/docs/SvPV_set - http://bugs.debian.org/822336 improve SvPV_set's docs, it really shouldn't be public API DEBPKG:fixes/5.20.3/docs/autodie - http://bugs.debian.org/822336 Fix warning message regarding "use autodie" and "use open". DEBPKG:fixes/5.20.3/docs/autodie_2_26 - http://bugs.debian.org/822336 perlunicook: Note that autodie >= 2.26 should be okay with "use open". DEBPKG:fixes/5.20.3/docs/setenv - http://bugs.debian.org/822336 Fix setenv() replacement documentation in perlclib DEBPKG:fixes/5.20.3/docs/clib_caution - http://bugs.debian.org/822336 perlhacktips: Add caution about clib ptr returns to static memory DEBPKG:fixes/5.20.3/docs/perlunicook_typos - http://bugs.debian.org/822336 Fix minor code typos in perlunicook DEBPKG:fixes/5.20.3/docs/ook_example - http://bugs.debian.org/822336 [perl #122322] Update OOK example in perlguts DEBPKG:fixes/5.20.3/docs/study_noop - http://bugs.debian.org/822336 perlfunc: mention that study() is currently a noop DEBPKG:fixes/CVE-2016-1238/remove-dot-when-loading - [perl #127834] (perl #127834) remove . from the end of @INC if complex modules are loaded DEBPKG:fixes/CVE-2016-1238/remove-dot-in-padwalker - [perl #127834] perl5db.pl: ensure PadWalker is loaded from standard paths DEBPKG:fixes/CVE-2016-1238/remove-dot-in-dist - [perl #127834] dist/: remove . from @INC when loading optional modules DEBPKG:fixes/CVE-2016-1238/remove-dot-in-cpan - [perl #127834] cpan/: remove . from @INC when loading optional modules DEBPKG:fixes/CVE-2016-1238/customized-encode - Update customized.dat for cpan/Encode/Encode.pm DEBPKG:debian/CVE-2016-1238/test-suite-without-dot - [perl #127810] Patch unit tests to explicitly insert "." into @INC when needed. DEBPKG:debian/CVE-2016-1238/eumm-without-dot - [perl #127810] Add PERL_USE_UNSAFE_INC support to EU::MM for fortify_inc support. DEBPKG:debian/CVE-2016-1238/cpan-without-dot - [perl #127810] Set PERL_USE_UNSAFE_INC for cpan usage DEBPKG:debian/CVE-2016-1238/mb-without-dot - Make Module::Build set PERL_USE_UNSAFE_INC DEBPKG:debian/CVE-2016-1238/sitecustomize-in-etc - Look for sitecustomize.pl in /etc/perl rather than sitelib on Debian systems DEBPKG:fixes/xsloader-eval - [rt.cpan.org #115808] http://bugs.debian.org/829578 =?UTF-8?q?Don=E2=80=99t=20let=20XSLoader=20load=20relative=20path?= =?UTF-8?q?s?= @INC for perl 5.20.2: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl Environment for perl 5.20.2: HOME=/home/rafal LANG=pl_PL.utf8 LANGUAGE=en_US:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/rafal/bin PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 7 years ago

From @jkeenan

On Mon\, 28 Nov 2016 12​:34​:02 GMT\, rafal@​zorro.ztk-rp.eu wrote​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.

----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML​::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text​::CSV​::Encoded since I could narrow the problem to just a few lines of test-case.

======================== #!/usr/bin/perl use Text​::CSV​::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, "​:encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV​: ".Text​::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"

In this example​: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $

This result is incorrect\, since the file does not contain any "wide chars".

It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning. Here's what pod/perldiag.pod in perl-5.24.0 says​:

##### =item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The easiest way to quiet this warning is simply to add the C\<​:utf8> layer to the output\, e.g. C\<binmode STDOUT\, '​:utf8'>. Another way to turn off the warning is to add C\<no warnings 'utf8';> but that is often closer to cheating. In general\, you are supposed to explicitly mark the filehandle with an encoding\, see L\ and L\<perlfunc/binmode>. #####

If I put your test data into a file and run it through 'od -c'\, I observe two characters in the >255 range.

##### $ od -c warsaw.txt 0000000 1 0 ; " S P 323 243 D Z I E L N I A 0000020 \n W A R S Z A W A " ; 6 2 ; " T 0000040 E S T " \n 0000045 #####

Text​::CSV​::Encoded is not part of the Perl 5 core distribution\, so I think including it in the test script muddies the waters. Here's a pure Perl reduction​:

##### $ cat 2-130199-text-csv-encoded.pl # perl use strict; use warnings;

open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, "​:encoding(cp1250)"); while ( \<$FH> ) {   s/\s+$//;   print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl Wide character in print at 2-130199-text-csv-encoded.pl line 9\, \<$FH> line 1. 10;"SPÓŁDZIELNIA WARSZAWA";62;"TEST" #####

I think that warning is appropriate. However\, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is. Other people on list should comment.

Thank you very much.

p5pRT commented 7 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @jkeenan

On Mon\, 28 Nov 2016 23​:03​:51 GMT\, jkeenan wrote​:

On Mon\, 28 Nov 2016 12​:34​:02 GMT\, rafal@​zorro.ztk-rp.eu wrote​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.

----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML​::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text​::CSV​::Encoded since I could narrow the problem to just a few lines of test-case.

======================== #!/usr/bin/perl use Text​::CSV​::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, "​:encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV​: ".Text​::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"

In this example​: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $

This result is incorrect\, since the file does not contain any "wide chars".

It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning. Here's what pod/perldiag.pod in perl-5.24.0 says​:

##### =item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The easiest way to quiet this warning is simply to add the C\<​:utf8> layer to the output\, e.g. C\<binmode STDOUT\, '​:utf8'>. Another way to turn off the warning is to add C\<no warnings 'utf8';> but that is often closer to cheating. In general\, you are supposed to explicitly mark the filehandle with an encoding\, see L\ and L\<perlfunc/binmode>. #####

If I put your test data into a file and run it through 'od -c'\, I observe two characters in the >255 range.

##### $ od -c warsaw.txt 0000000 1 0 ; " S P 323 243 D Z I E L N I A 0000020 \n W A R S Z A W A " ; 6 2 ; " T 0000040 E S T " \n 0000045 #####

Text​::CSV​::Encoded is not part of the Perl 5 core distribution\, so I think including it in the test script muddies the waters. Here's a pure Perl reduction​:

##### $ cat 2-130199-text-csv-encoded.pl # perl use strict; use warnings;

open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, "​:encoding(cp1250)"); while ( \<$FH> ) { s/\s+$//; print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl Wide character in print at 2-130199-text-csv-encoded.pl line 9\, \<$FH> line 1. 10;"SPÓŁDZIELNIA WARSZAWA";62;"TEST" #####

I think that warning is appropriate. However\, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is. Other people on list should comment.

Thank you very much.

On #p5p khw has pointed out an error in my analysis. 'od -c' prints octal. So these characters are below \0377 equivalent to 255.

Also\, in my test program I should have applied binmode to STDOUT as well.

##### # perl use strict; use warnings;

open(my $FH\, '\<'\, 'warsaw.txt') or die "open"; binmode($FH\, "​:encoding(cp1250)"); binmode(STDOUT\, "​:encoding(cp1250)"); while ( \<$FH> ) {   s/\s+$//;   print "$_\n"; } close $FH or die "close"; ##### $ perl 2-130199-text-csv-encoded.pl 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" #####

And once I 'binmode' STDOUT\, the "Wide character" warning goes away. So\, notwithstanding my errors\, I still think this is not a bug -- at least not in perl-5.24.0.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

From @eserte

Dana Mon\, 28 Nov 2016 04​:34​:02 -0800\, rafal@​zorro.ztk-rp.eu reče​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu\, generated with the help of perlbug 1.40 running under perl 5.20.2.

----------------------------------------------------------------- [Please describe your issue here] After upgrading from debian-wheezy to debian-jessie HTML​::Mason started to behave strangely with respect to UTF8 encoding. Earlier both web- pages and forms were working correctly (in UTF8) without any special setup. As of jessie with Apache 2.4 UTF8 no longer works. 1. I had to add binmode(STDOUT\,'UTF8') to modules. 2. I had to decode_utf8($_) data from forms before passing them over to psql-db This report I file with example code of erratic behavior of Text​::CSV​::Encoded since I could narrow the problem to just a few lines of test-case.

======================== #!/usr/bin/perl use Text​::CSV​::Encoded; open(my $FH\, shift) or die "open"; binmode($FH\, "​:encoding(cp1250) :raw :bytes"); local $/ = "\r\n"; my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250"\, binary => 1\, eol => $/\, sep_char => ';'\, } ) or die "Cannot use CSV​: ".Text​::CSV->error_diag (); $\ = "\n"; while ( \<$FH> ) { s/\s+$//; print; if ($csv->parse( $_ )) { print $csv->fields(); } } __END__ 10;"SPӣDZIELNIA WARSZAWA";62;"TEST"

In this example​: 1. the test file (provided "inline") as \ contains two speciffic characters from CODE-PAGE-1250\, one such char just after another. 1a. this test file IS-NOT UTF8 encoded. 2. the input stream is correctly marked as CP1250 3. the module gets correct information as to that file encoding ... and yet\, the module complains about encoutering a "wide-char"\, which in the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input 10;"SPӣDZIELNIA WARSZAWA";62;"TEST" Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37\, \<$FH> chunk 1. $

This result is incorrect\, since the file does not contain any "wide chars".

[Please do not change anything below this line] -----------------------------------------------------------------

As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)

p5pRT commented 7 years ago

From cm.perl@abtela.com

Le 28/11/2016 à 13​:34\, (via RT) a écrit :

LANG=pl\_PL\.utf8
LANGUAGE=en\_US&#8203;:en

Maybe a wild shot but isn't that combination asking for trouble ? FWIW\, see http​://stackoverflow.com/a/2510548

p5pRT commented 7 years ago

From @jkeenan

On Tue\, 29 Nov 2016 08​:24​:13 GMT\, slaven@​rezic.de wrote​:

Dana Mon\, 28 Nov 2016 04​:34​:02 -0800\, rafal@​zorro.ztk-rp.eu reče​:

[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)

Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

From @jkeenan

On Fri\, 02 Dec 2016 21​:55​:42 GMT\, jkeenan wrote​:

On Tue\, 29 Nov 2016 08​:24​:13 GMT\, slaven@​rezic.de wrote​:

Dana Mon\, 28 Nov 2016 04​:34​:02 -0800\, rafal@​zorro.ztk-rp.eu reče​:

[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)

Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.

If there's no response from the original poster within a week\, I will close this ticket.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

From @jkeenan

On Sun\, 25 Dec 2016 02​:12​:24 GMT\, jkeenan wrote​:

On Fri\, 02 Dec 2016 21​:55​:42 GMT\, jkeenan wrote​:

On Tue\, 29 Nov 2016 08​:24​:13 GMT\, slaven@​rezic.de wrote​:

Dana Mon\, 28 Nov 2016 04​:34​:02 -0800\, rafal@​zorro.ztk-rp.eu reče​:

[snip] As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case\, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)

Rafal\, can you please provide the sample file as an email attachment? We will need this for further diagnosis.

If there's no response from the original poster within a week\, I will close this ticket.

Thank you very much.

Closing as per schedule. Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

@jkeenan - Status changed from 'open' to 'rejected'