Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

UTF-8 scripts with BOM not auto-detected #15960

Open p5pRT opened 7 years ago

p5pRT commented 7 years ago

Migrated from rt.perl.org#131195 (status was 'open')

Searchable as RT131195$

p5pRT commented 7 years ago

From @jimav

(I can't seem to send purely plain text email\, so I'm sending the perlbug file as an attachment to avoid undesired line wraps)

p5pRT commented 7 years ago

From @jimav

This is a bug report for perl from jim.avera@​gmail.com\, generated with the help of perlbug 1.40 running under perl 5.22.1.


According to perlunicode(1)​:   "... if a Perl script begins with the Unicode "BOM" (UTF-16LE\, UTF16-BE\,   or UTF-8)\, or if the script looks like non-"BOM"-marked UTF-16 of either   endianness\, Perl will correctly read in the script as the appropriate   Unicode encoding.

That is true for UTF-16 variants\, but not UTF-8.

#!/usr/bin/perl # # Test to see if perl can auto-detect script encodings from a BOM # (it's best to view the output on a utf-8 terminal) # use strict; use warnings;

# Do everything in a temporary directory my $tdir = "/tmp/test.dir"; system "set -x; rm -rf $tdir; mkdir $tdir"; chdir $tdir || die;

# Some Perl source code which uses Unicode in identifiers and strings my $sourcecode = \<\<EOF;   use strict; use warnings;   my \$\N{U+0444}\N{U+043E}\N{U+043E} = 42; # \$фоо = 42;   my \$\N{U+041E}\N{U+0442}\N{U+0440}\N{U+043E} = "\N{U+2169}\N{U+216C}\N{U+2161}"; # \$Отро = "XLII";

  use open '​:std'\, '​:encoding(utf8)';

  print "ABC"   ."\N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}"   ."DEF"   ."\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}"   ."GHI\\n";

  print "The anser is \$\N{U+0444}\N{U+043E}\N{U+043E} (\$\N{U+041E}\N{U+0442}\N{U+0440}\N{U+043E})\\n";

  exit \$\N{U+0444}\N{U+043E}\N{U+043E}; EOF

# Write out the perl script in various encodings\, preceeded by # the BOM character. # # According to perlunicode(1)​: # "... if a Perl script begins with the Unicode "BOM" (UTF-16LE\, UTF16-BE\, # or UTF-8)\, or if the script looks like non-"BOM"-marked UTF-16 of either # endianness\, Perl will correctly read in the script as the appropriate # Unicode encoding. # for ('UTF-8'\, 'UTF-16LE'\, 'UTF-16BE'\,   'UTF-16LE-nobom'\, 'UTF-16BE-nobom'\,   'UTF-32LE'\, 'UTF-32BE') {   print "=================================================\n";   my $enc = $_;   my $nobom = $enc =~ s/-nobom$//;   my $path = "test_${_}.pl";   open my $fh\, ">​:encoding($enc)"\, $path or die;   print $fh "\N{U+FEFF}" unless $nobom; # the BOM character   print $fh $sourcecode;   close $fh or die "write error ($!)";   system "set -x; od -N 16 -t x1 $path";   system "set -x; perl $path"; }



Flags​:   category=core   severity=low


Site configuration information for perl 5.22.1​:

Configured by Debian Project at Sun Mar 13 11​:54​:18 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 1) configuration​:  
  Platform​:   osname=linux\, osvers=3.16.0\, archname=x86_64-linux-gnu-thread-multi   uname='linux localhost 3.16.0 #1 smp debian 3.16.0 x86_64 gnulinux '   config_args='-Dusethreads -Duselargefiles -Dcc=x86_64-linux-gnu-gcc -Dcpp=x86_64-linux-gnu-cpp -Dld=x86_64-linux-gnu-gcc -Dccflags=-DDEBIAN -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl\,-Bsymbolic-functions -Wl\,-z\,relro -Dlddlflags=-shared -Wl\,-Bsymbolic-functions -Wl\,-z\,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.22 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.22 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.22 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.22.1 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.22.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -dEs -Duseshrplib -Dlibperl=libperl.so.5.22.1'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=define\, usemultiplicity=define   use64bitint=define\, use64bitall=define\, uselongdouble=undef   usemymalloc=n\, bincompat5005=undef   Compiler​:   cc='x86_64-linux-gnu-gcc'\, ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\,   optimize='-O2 -g'\,   cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'   ccversion=''\, gccversion='5.3.1 20160311'\, gccosandvers=''   intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678\, doublekind=3   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16\, longdblkind=3   ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=8\, prototype=define   Linker and Libraries​:   ld='x86_64-linux-gnu-gcc'\, ldflags =' -fstack-protector-strong -L/usr/local/lib'   libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/5/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib   libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt   perllibs=-ldl -lm -lpthread -lc -lcrypt   libc=libc-2.21.so\, so=so\, useshrplib=true\, libperl=libperl.so.5.22   gnulibc_version='2.21'   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E'   cccdlflags='-fPIC'\, lddlflags='-shared -L/usr/local/lib -fstack-protector-strong'

Locally applied patches​:   DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.   DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove overly restrictive DB_File version check.   DEBPKG​:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.   DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @​INC directories.   DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.   DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking   DEBPKG​:fixes/respect_umask - Respect umask during installation   DEBPKG​:debian/writable_site_dirs - Set umask approproately for site install directories   DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: set location of libperl.a under /usr/lib   DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor   DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.   DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.   DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.   DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.   DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian   DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.   DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038] http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option   DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local   DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules   DEBPKG​:debian/squelch-locale-warnings - http​://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts   DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository   DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List packaged patches for 5.22.1-9 in patchlevel.h   DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD   DEBPKG​:fixes/document_makemaker_ccflags - http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}   DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479 Configure CPAN​::Distribution with correct name of html2text   DEBPKG​:debian/perl5db-x-terminal-emulator.patch - http​://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl   DEBPKG​:debian/cpan-missing-site-dirs - http​://bugs.debian.org/688842 Fix CPAN​::FirstTime defaults with nonexisting site dirs if a parent is writable   DEBPKG​:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http​://bugs.debian.org/587650 Memoize​::Storable​: respect 'nstore' option not respected   DEBPKG​:debian/regen-skip - Skip a regeneration check in unrelated git repositories   DEBPKG​:debian/makemaker-pasthru - http​://bugs.debian.org/758471 Pass LD settings through to subdirectories   DEBPKG​:fixes/pod_man_reproducible_date - http​://bugs.debian.org/759405 Support POD_MAN_DATE in Pod​::Man for the left-hand footer   DEBPKG​:debian/locale-robustness - http​://bugs.debian.org/782068 [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL   DEBPKG​:fixes/podman-utc - http​://bugs.debian.org/780259 Make the embedded date from Pod​::Man reproducible   DEBPKG​:fixes/podman-utc-docs - http​://bugs.debian.org/780259 Documentation and test suite updates for UTC fix   DEBPKG​:fixes/podman-empty-date - http​://bugs.debian.org/780259 Support an empty POD_MAN_DATE environment variable   DEBPKG​:fixes/podman-pipe - http​://bugs.debian.org/777405 Better errors for man pages from standard input   DEBPKG​:debian/pod2man-customized - Update porting/customized.dat for pod2man modifications   DEBPKG​:debian/makemaker-manext - http​://bugs.debian.org/247370 Make EU​::MakeMaker honour MANnEXT settings in generated manpage headers   DEBPKG​:debian/makemaker_customized - Update t/porting/customized.dat for files patched in Debian   DEBPKG​:debian/do-not-record-build-date - [6baa8db] http​://bugs.debian.org/774422 [perl #125830] Allow overriding the compile time in "perl -V" output   DEBPKG​:fixes/podman-source-date-epoch - http​://bugs.debian.org/801621 Make Pod​::Man honor the SOURCE_DATE_EPOCH environment variable   DEBPKG​:fixes/podman-source-date-epoch-cleanups - http​://bugs.debian.org/801621 Coding style and documentation for SOURCE_EPOCH_DATE   DEBPKG​:fixes/podman-source-date-epoch-testfix - http​://bugs.debian.org/807086 Guard for building with SOURCE_DATE_EPOCH or POD_MAN_DATE set   DEBPKG​:debian/devel-ppport-reproducibility - http​://bugs.debian.org/801523 Sort the list of XS code files when generating RealPPPort.xs   DEBPKG​:fixes/encode-unicode-bom - http​://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043   DEBPKG​:debian/encode-unicode-bom-doc - http​://bugs.debian.org/798727 Document Debian backport of Encode​::Unicode fix   DEBPKG​:debian/kfreebsd-softupdates - http​://bugs.debian.org/796798 Work around Debian Bug#796798   DEBPKG​:fixes/autodie-scope - http​://bugs.debian.org/798096 Fix a scoping issue with "no autodie" and the "system" sub   DEBPKG​:debian/debugperl-compat-fix - [perl #127212] http​://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds   DEBPKG​:fixes/CVE-2015-8607_file_spec_taint_fix - http​://bugs.debian.org/810719 [perl #126862] ensure File​::Spec​::canonpath() preserves taint   DEBPKG​:fixes/mkstemp-umask - http​://bugs.debian.org/810924 [perl #127322] [e57270b] Fix umask for mkstemp(3) calls   DEBPKG​:fixes/crosscompile-no-targethost - [perl #127234] Fix the Configure escape with usecrosscompile but no targethost   DEBPKG​:fixes/podlators-no-encode - [rt.cpan.org #111156] Degrade gracefully if utf8 is requested but Encode is not available   DEBPKG​:debian/cross-time-hires - [rt.cpan.org #111391] Add an environment variable to skip running configuration probes   DEBPKG​:fixes/encode-unicode-pod - Unicode.pm​: Fix POD error   DEBPKG​:fixes/memoize-pod - [rt.cpan.org #89441] Fix POD errors in Memoize   DEBPKG​:fixes/ok-pod - Added encoding for pod.   DEBPKG​:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ


@​INC for perl 5.22.1​:   /home/jima/perl5/lib/perl5/5.22.1/x86_64-linux-gnu-thread-multi   /home/jima/perl5/lib/perl5/5.22.1   /home/jima/perl5/lib/perl5/x86_64-linux-gnu-thread-multi   /home/jima/perl5/lib/perl5   /home/jima/lib/perl   /etc/perl   /usr/local/lib/x86_64-linux-gnu/perl/5.22.1   /usr/local/share/perl/5.22.1   /usr/lib/x86_64-linux-gnu/perl5/5.22   /usr/share/perl5   /usr/lib/x86_64-linux-gnu/perl/5.22   /usr/share/perl/5.22   /usr/local/lib/site_perl   /usr/lib/x86_64-linux-gnu/perl-base   .


Environment for perl 5.22.1​:   HOME=/home/jima   LANG=en_US.UTF-8   LANGUAGE=en_US   LC_COLLATE=C   LD_LIBRARY_PATH (unset)   LOGDIR (unset)   PATH=/home/jima/perl5/bin​:/home/jima/bin​:/home/jima/jima_tools/x86_64/bin​:/home/jima/jima_tools/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/bin/X11​:/usr/local/bin​:/usr/local/sbin​:/usr/games​:/usr/local/games​:/usr/lib/jvm/java-8-oracle/bin​:/usr/lib/jvm/java-8-oracle/db/bin​:/usr/lib/jvm/java-8-oracle/jre/bin​:.   PERL5LIB=/home/jima/perl5/lib/perl5​:/home/jima/lib/perl   PERL_BADLANG (unset)   PERL_LOCAL_LIB_ROOT=/home/jima/perl5   PERL_MB_OPT=--install_base "/home/jima/perl5"   PERL_MM_OPT=INSTALL_BASE=/home/jima/perl5   SHELL=/bin/bash

p5pRT commented 7 years ago

From @mauke

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by # Please include the string​: [perl #131195] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​: "... if a Perl script begins with the Unicode "BOM" (UTF-16LE\, UTF16-BE\, > or UTF-8)\, or if the script looks like non-"BOM"-marked UTF-16 of either endianness\, Perl will correctly read in the script as the appropriate Unicode encoding.

That is true for UTF-16 variants\, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

-- Lukas Mai \plokinom@&#8203;gmail\.com

From perl5-porters-return-244160-rt-listener=rtperl.dev@​perl.org Sun Apr 23 02​:27​:20 2017 Return-Path​: \perl5\-porters\-return\-244160\-rt\-listener=rtperl\.dev@&#8203;perl\.org X-Original-To​: rt-listener@​rtperl.dev Delivered-To​: rt-listener@​rtperl.dev Received​: from x6.develooper.com (x6.dev [10.0.100.16])   by rtperl.develooper.com (Postfix) with ESMTP id 8AA8B1FD   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 02​:27​:20 -0700 (PDT) Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])   by x6.develooper.com (Postfix) with SMTP id 0F0701FB5   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 02​:27​:19 -0700 (PDT) Received​: (qmail 6748 invoked by uid 514); 23 Apr 2017 09​:27​:15 -0000 Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm list-help​: \mailto&#8203;:perl5\-porters\-help@&#8203;perl\.org list-unsubscribe​: \mailto&#8203;:perl5\-porters\-unsubscribe@&#8203;perl\.org list-post​: \mailto&#8203;:perl5\-porters@&#8203;perl\.org X-List-Archive​: \<http​://nntp.perl.org/group/perl.perl5.porters/244160> List-Id​: \<perl5-porters.perl.org> Delivered-To​: mailing list perl5-porters@​perl.org Received​: (qmail 6732 invoked from network); 23 Apr 2017 09​:27​:14 -0000 X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com X-Spam-Status​: No\, score=-1.5 required=6.0 tests=BAYES_00\,DKIM_SIGNED\, DKIM_VALID\,DKIM_VALID_AU\,FREEMAIL_FROM\,RCVD_IN_DNSWL_NONE\,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.1 DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version​:in-reply-to​:references​:from​:date​:message-id​:subject​:to :cc; bh=yfKFy3lvVy47D3oZT/T5VCoPVtEmFMA94YJVSgHBmMI=; b=UtcgES9npoPzWGnrYX9BKikoBjkIwl1XJqkta6MDEHIGiLbRfp8dfhUl0g2PSy0+1m u6uiFAAeKGvdyLp4uZyhMsWbD/rUBIY2qqz30B37zyKErsw8v3a4g/oc8oVSt8e7gmvA ns2VWlrn4O7+YvQsFrlHHg0gQyWNmqAsy2h/qPLZO6lOq9zRmfmPatSVym8Df0R+4oJK Aa2b5pzIMy722qRLE+AoNONISEFldF8wq6M3GDP8n/TBJJwvLfu7GRqslAWar88KjMC0 Jm6uNS5kjDBTgSA/Bq2DwHG9XC0rabX9tgX/DT0miNlgOBV+8ASIgPrTLOHk4GKqZBJl zxqQ== X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state​:mime-version​:in-reply-to​:references​:from​:date :message-id​:subject​:to​:cc; bh=yfKFy3lvVy47D3oZT/T5VCoPVtEmFMA94YJVSgHBmMI=; b=ovDgZ23LdhojvhiGxRLlL5XSpT1emiXnaoaNZ8zAqy6Iv6AA5+333n4BL/P6NR4iRw kEUjK/+PA/fJNCxM1ZxJfclcaqc2NiVtwQ04DkY/8ifDLRBJ4aOkzM44Nb0IJSNJpkUh kl5eCvv6/YE+aV1cXEQcdh8BpPUaVhDODK6khIn2SJhalXlzToe/Z5qubTPQ629szo49 ngwuGjq4gHzP0ddk691z0uzZiUB+T4dlNnJW0q7EhTeOzyZwEHNeCVcMGYWvbeGtdwrj bcnhj5zIOeMR3D3wuPrAD3RdmPxfzYzfjhpBiWSuH1YWh570cLtDwL2MvkRnUqdeTZo1 AqqA== X-Gm-Message-State​: AN3rC/6ePWHx/3qnKMAceYK02xUo/iDqjQffu+PAyZsDtZLl8hLsvbqY 4ui7dhmqnQj4TFuJbyaTRUG8ImmTAg== X-Received​: by 10.46.88.76 with SMTP id x12mr7444547ljd.90.1492939624726; Sun\, 23 Apr 2017 02​:27​:04 -0700 (PDT) MIME-Version​: 1.0 In-Reply-To​: \20f9058a\-1f90\-7d8c\-e736\-79cb26038800@&#8203;gmail\.com References​: \RT\-Ticket\-131195@&#8203;perl\.org \f6ac076b\-54e1\-e4f8\-e25a\-537ccf974f4e@&#8203;gmail\.com \rt\-4\.0\.24\-17912\-1492909883\-845\.131195\-75\-0@&#8203;perl\.org \20f9058a\-1f90\-7d8c\-e736\-79cb26038800@&#8203;gmail\.com From​: demerphq \demerphq@&#8203;gmail\.com Date​: Sun\, 23 Apr 2017 11​:27​:04 +0200 Message-ID​: \CANgJU\+UFhDpzo1L=Qrq9DxM2\+MivRa\_Xg7COFEDae5x9bMt40w@&#8203;mail\.gmail\.com Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected To​: Lukas Mai \plokinom@&#8203;gmail\.com Cc​: Perl5 Porteros \perl5\-porters@&#8203;perl\.org Content-Type​: text/plain; charset=UTF-8 X-PMX-Version​: 5.6.1.2065439\, Antispam-Engine​: 2.7.2.376379\, Antispam-Data​: 2017.4.23.91816 X-PMX-Spam​: Gauge=IIIIIIII\, Probability=8%\, Report=' FROM_NAME_ONE_WORD 0.05\, HTML_00_01 0.05\, HTML_00_10 0.05\, BODYTEXTP_SIZE_3000_LESS 0\, BODY_SIZE_1800_1899 0\, BODY_SIZE_2000_LESS 0\, BODY_SIZE_5000_LESS 0\, BODY_SIZE_7000_LESS 0\, CT_TEXT_PLAIN_UTF8_CAPS 0\, DKIM_SIGNATURE 0\, FROM_SAME_AS_TO_DOMAIN 0\, IN_REP_TO 0\, LEGITIMATE_SIGNS 0\, MSG_THREAD 0\, REFERENCES 0\, SPF_PASS 0\, URI_ENDS_IN_HTML 0\, URI_WITH_PATH_ONLY 0\, WEBMAIL_SOURCE 0\, __ANY_URI 0\, __BOUNCE_CHALLENGE_SUBJ 0\, __BOUNCE_NDR_SUBJ_EXEMPT 0\, __CC_NAME 0\, __CC_NAME_DIFF_FROM_ACC 0\, __CC_REAL_NAMES 0\, __CP_URI_IN_BODY 0\, __CT 0\, __CT_TEXT_PLAIN 0\, __DQ_NEG_HEUR 0\, __DQ_NEG_IP 0\, __FORWARDED_MSG 0\, __FRAUD_BODY_WEBMAIL 0\, __FRAUD_WEBMAIL 0\, __FRAUD_WEBMAIL_FROM 0\, __FROM_DOMAIN_IN_ANY_TO1 0\, __FROM_DOMAIN_IN_RCPT 0\, __FROM_GMAIL 0\, __HAS_CC_HDR 0\, __HAS_FROM 0\, __HAS_MSGID 0\, __HELO_GMAIL 0\, __HTTPS_URI 0\, __IN_REP_TO 0\, __MIME_TEXT_ONLY 0\, __MIME_TEXT_P 0\, __MIME_TEXT_P1 0\, __MIME_VERSION 0\, __MULTIPLE_URI_TEXT 0\, __NO_HTML_TAG_RAW 0\, __PHISH_SPEAR_HTTP_RECEIVED 0\, __PHISH_SPEAR_STRUCTURE_1 0\, __PHISH_SPEAR_STRUCTURE_2 0\, __RDNS_GMAIL 0\, __REFERENCES 0\, __SANE_MSGID 0\, __SUBJ_ALPHA_END 0\, __SUBJ_ALPHA_NEGATE 0\, __TO_MALFORMED_2 0\, __TO_NAME 0\, __TO_NAME_DIFF_FROM_ACC 0\, __TO_REAL_NAMES 0\, __URI_IN_BODY 0\, __URI_NOT_IMG 0\, __URI_NO_WWW 0\, __URI_NS \, __URI_WITH_PATH 0\, __YOUTUBE_RCVD 0\, __zen.spamhaus.org_ERROR ' X-Original-Precedence​: bulk

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by # Please include the string​: [perl #131195] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​: "... if a Perl script begins with the Unicode "BOM" (UTF-16LE\, UTF16-BE\, > or UTF-8)\, or if the script looks like non-"BOM"-marked UTF-16 of

either

endianness\, Perl will correctly read in the script as the appropriate Unicode encoding.

That is true for UTF-16 variants\, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

From perl5-porters-return-244161-rt-listener=rtperl.dev@​perl.org Sun Apr 23 03​:14​:59 2017 Return-Path​: \perl5\-porters\-return\-244161\-rt\-listener=rtperl\.dev@&#8203;perl\.org X-Original-To​: rt-listener@​rtperl.dev Delivered-To​: rt-listener@​rtperl.dev Received​: from x6.develooper.com (x6.dev [10.0.100.16])   by rtperl.develooper.com (Postfix) with ESMTP id DECE2119   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 03​:14​:58 -0700 (PDT) Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])   by x6.develooper.com (Postfix) with SMTP id 9DBFF24BC   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 03​:14​:55 -0700 (PDT) Received​: (qmail 19068 invoked by uid 514); 23 Apr 2017 10​:14​:47 -0000 Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm list-help​: \mailto&#8203;:perl5\-porters\-help@&#8203;perl\.org list-unsubscribe​: \mailto&#8203;:perl5\-porters\-unsubscribe@&#8203;perl\.org list-post​: \mailto&#8203;:perl5\-porters@&#8203;perl\.org X-List-Archive​: \<http​://nntp.perl.org/group/perl.perl5.porters/244161> List-Id​: \<perl5-porters.perl.org> Delivered-To​: mailing list perl5-porters@​perl.org Received​: (qmail 19052 invoked from network); 23 Apr 2017 10​:14​:47 -0000 X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com X-Spam-Status​: No\, score=-2.0 required=6.0 tests=BAYES_00\,DKIM_SIGNED\, DKIM_VALID\,DKIM_VALID_AU\,FREEMAIL_FROM\,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject​:to​:references​:from​:message-id​:date​:user-agent​:mime-version :in-reply-to​:content-language​:content-transfer-encoding; bh=WCmSRh3bu4U0FvN74Vk3D18j+RakGhhcqHdCbXVchLw=; b=klSmYutCwQmKTPr3TKiiGcnDkCQ7oytxETlMXqB4hKuHFEdIYbJnW6mcmFCvRKZ6Bv zj4O/qa0+JEZqEGQesbWeVNLq3Fy2AYTl7yrLsJp653GFaiES7fvH4lq9OGhES4mB2OZ /KkCyBwBBgD3PESHCIxk9umY0ohBCJkS4BdNv3wPYBsVzUiNpx5YBa2MOKJw4ogRRk6+ zJFD6Tm1mB0MinBclAkxiNXZ2qSqbQMiJqWDZ6cB/zc19zhznkaZxnr5zuEIUXO5n57E Nxr9s+LPoJsMq2LH6SqW6w2zW8OrL5I6pX73ZJt81w1NuVYU+BXJQW+7UJbe7qIJ+POr 6Ryw== X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state​:subject​:to​:references​:from​:message-id​:date :user-agent​:mime-version​:in-reply-to​:content-language :content-transfer-encoding; bh=WCmSRh3bu4U0FvN74Vk3D18j+RakGhhcqHdCbXVchLw=; b=oyw9TNLbN+2rCmZa92v/IvXyxDevaRojukq00XLnwXB+rLqz/xgImsHkJj0vLpQQz4 TP1gLr333KLhq3+rt8slO2av2U56akYXU69YYd74cOkBdJYenp8oHO48AQwqkZ+Nn8X9 MSyCHGa2HN7ISYfkJqWt8/wsnN+7VemUU7SKhfjXB3Nb+KtXd0sIIWo0mp3gxVM9NkgK x3ft7guVUBWt2mgDcxOutVL0BUKAWCozAJTWj5+5O7wiiXB2tKqMy1CtbPb9XNeDAmEk tp+IRSEKctHoZehWyZ+Dw/Mchvija8PZ55MLUmQNeeMh1CEJYfgxCz0uP/Tbs5vysPvR I2bQ== X-Gm-Message-State​: AN3rC/6bPo5wQqbrNeL1RqZr9NfZypjN/ZTLizXNmm4JwDwGAFWYGUMu 9rQMRlJweqDF0ld/ X-Received​: by 10.223.154.240 with SMTP id a103mr1422431wrc.5.1492942471455; Sun\, 23 Apr 2017 03​:14​:31 -0700 (PDT) Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected To​: Perl5 Porteros \perl5\-porters@&#8203;perl\.org References​: \RT\-Ticket\-131195@&#8203;perl\.org \f6ac076b\-54e1\-e4f8\-e25a\-537ccf974f4e@&#8203;gmail\.com \rt\-4\.0\.24\-17912\-1492909883\-845\.131195\-75\-0@&#8203;perl\.org \20f9058a\-1f90\-7d8c\-e736\-79cb26038800@&#8203;gmail\.com \CANgJU\+UFhDpzo1L=Qrq9DxM2\+MivRa\_Xg7COFEDae5x9bMt40w@&#8203;mail\.gmail\.com From​: Lukas Mai \plokinom@&#8203;gmail\.com Message-ID​: \5769caa8\-b752\-7772\-166c\-57475242e30e@&#8203;gmail\.com Date​: Sun\, 23 Apr 2017 12​:14​:27 +0200 User-Agent​: Mozilla/5.0 (Windows NT 6.1; WOW64; rv​:52.0) Gecko/20100101 Thunderbird/52.0.1 MIME-Version​: 1.0 In-Reply-To​: \CANgJU\+UFhDpzo1L=Qrq9DxM2\+MivRa\_Xg7COFEDae5x9bMt40w@&#8203;mail\.gmail\.com Content-Type​: text/plain; charset=utf-8; format=flowed Content-Language​: en-US Content-Transfer-Encoding​: 7bit X-PMX-Version​: 5.6.1.2065439\, Antispam-Engine​: 2.7.2.376379\, Antispam-Data​: 2017.4.23.100617 X-PMX-Spam​: Gauge=IIIIIIII\, Probability=8%\, Report=' HTML_00_01 0.05\, HTML_00_10 0.05\, BODYTEXTP_SIZE_3000_LESS 0\, BODY_SIZE_1800_1899 0\, BODY_SIZE_2000_LESS 0\, BODY_SIZE_5000_LESS 0\, BODY_SIZE_7000_LESS 0\, DKIM_SIGNATURE 0\, IN_REP_TO 0\, LEGITIMATE_SIGNS 0\, MSG_THREAD 0\, REFERENCES 0\, SINGLE_URI_IN_BODY 0\, SPF_PASS 0\, URI_ENDS_IN_HTML 0\, URI_WITH_PATH_ONLY 0\, WEBMAIL_SOURCE 0\, __ANY_URI 0\, __BOUNCE_CHALLENGE_SUBJ 0\, __BOUNCE_NDR_SUBJ_EXEMPT 0\, __CP_URI_IN_BODY 0\, __CT 0\, __CTE 0\, __CT_TEXT_PLAIN 0\, __DQ_NEG_HEUR 0\, __DQ_NEG_IP 0\, __FORWARDED_MSG 0\, __FRAUD_BODY_WEBMAIL 0\, __FRAUD_MONEY_CURRENCY 0\, __FRAUD_MONEY_CURRENCY_DOLLAR 0\, __FRAUD_WEBMAIL 0\, __FRAUD_WEBMAIL_FROM 0\, __FROM_GMAIL 0\, __HAS_FROM 0\, __HAS_MSGID 0\, __HELO_GMAIL 0\, __HTTPS_URI 0\, __IN_REP_TO 0\, __MIME_TEXT_ONLY 0\, __MIME_TEXT_P 0\, __MIME_TEXT_P1 0\, __MIME_VERSION 0\, __MOZILLA_USER_AGENT 0\, __NO_HTML_TAG_RAW 0\, __PHISH_SPEAR_STRUCTURE_1 0\, __RDNS_GMAIL 0\, __REFERENCES 0\, __SANE_MSGID 0\, __SINGLE_URI_TEXT 0\, __SUBJ_ALPHA_END 0\, __SUBJ_ALPHA_NEGATE 0\, __TO_MALFORMED_2 0\, __TO_NAME 0\, __TO_NAME_DIFF_FROM_ACC 0\, __TO_REAL_NAMES 0\, __URI_IN_BODY 0\, __URI_NOT_IMG 0\, __URI_NO_WWW 0\, __URI_NS \, __URI_WITH_PATH 0\, __USER_AGENT 0\, __YOUTUBE_RCVD 0\, __blackholes.mail-abuse.org_ERROR \, __zen.spamhaus.org_ERROR ' X-Original-Precedence​: bulk

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

PS​: I like "Perl5 Porteros" :-)

-- Lukas Mai \plokinom@&#8203;gmail\.com

From perl5-porters-return-244162-rt-listener=rtperl.dev@​perl.org Sun Apr 23 04​:11​:23 2017 Return-Path​: \perl5\-porters\-return\-244162\-rt\-listener=rtperl\.dev@&#8203;perl\.org X-Original-To​: rt-listener@​rtperl.dev Delivered-To​: rt-listener@​rtperl.dev Received​: from x6.develooper.com (x6.dev [10.0.100.16])   by rtperl.develooper.com (Postfix) with ESMTP id DA39C314   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 04​:11​:23 -0700 (PDT) Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])   by x6.develooper.com (Postfix) with SMTP id 5170624BF   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 04​:11​:19 -0700 (PDT) Received​: (qmail 29807 invoked by uid 514); 23 Apr 2017 11​:11​:15 -0000 Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm list-help​: \mailto&#8203;:perl5\-porters\-help@&#8203;perl\.org list-unsubscribe​: \mailto&#8203;:perl5\-porters\-unsubscribe@&#8203;perl\.org list-post​: \mailto&#8203;:perl5\-porters@&#8203;perl\.org X-List-Archive​: \<http​://nntp.perl.org/group/perl.perl5.porters/244162> List-Id​: \<perl5-porters.perl.org> Delivered-To​: mailing list perl5-porters@​perl.org Received​: (qmail 29791 invoked from network); 23 Apr 2017 11​:11​:14 -0000 X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com X-Spam-Status​: No\, score=-1.5 required=6.0 tests=BAYES_00\,DKIM_SIGNED\, DKIM_VALID\,DKIM_VALID_AU\,FREEMAIL_FROM\,RCVD_IN_DNSWL_NONE\,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.1 DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version​:in-reply-to​:references​:from​:date​:message-id​:subject​:to :cc; bh=4t92FQbOHrxO6HFKpwQsq5FUkyQgyohzFGeDpHZkXVY=; b=q0LCvRI9S2M6DSXdAH/odn21T3lU5eSlxdiiV1V74a3RgnGh+Cm6K/SFjgdAyG1T/X N6oHN9z5/5F+kaE4kN7TeMREfyGvgnQk6n1qyOHmLwDA+2xgmjkz5cmt0dkmbDzkcd9J cRM3N7pdIAm1PpaQUaO8tJPykv10wfGzf8gyN79pRrFfvcPbbqzEPOgger2tHzDndRWr ZPKde1gIoG5Hp6sL6k+JiEBb91Bz/SsVjgcfp9SsVSQxuOUms90fPTtb6hEjYSF0yJCi Nse+CqTRM0eHZ+skYuxCb2IcQCxsnchWy07xsT1djRNCr9Mh9NIRYlrdTytDEhGDTObJ C1hg== X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state​:mime-version​:in-reply-to​:references​:from​:date :message-id​:subject​:to​:cc; bh=4t92FQbOHrxO6HFKpwQsq5FUkyQgyohzFGeDpHZkXVY=; b=ke/6UvWPvQxb1N7kQhCCsTfRV3DPmM4UJQ3RrJ7feTv8KYp7IcqZanEePwgy3DRaXJ jjXf3qZjlkjkbVK1P4G7BwVJO6jP23LO/K3xKki7s0BaYfvnZ6wXtDlvyhwtvpdTwl9Z WsQOVWCAeNpy+jAKwMDDhn6bnSpI2lc6jj0hYObIH2/w0nRcROS0c4G9Zh9UgQ0/dYQv RSlDrjCkbLMWlWg4XRiuaEmawt8bAJkCloUHC2IxqBhJ8P/JJfuKeiy4GYD1HJD3BKjx AC9gu2dN5rfIJwcZzHRuL7W0F59zrfK93SX/8GdBWzGzPVAAUAY2eStijm0fwEN8PLxK ud6Q== X-Gm-Message-State​: AN3rC/45l41g3N9geTBr4lpXhoEMiENgA69hdUpaRWGEdZXnFJ3KhfRh eZ6NFe4iSrA6bdOaRW00fMklvOIQy2Yt X-Received​: by 10.25.77.135 with SMTP id a129mr7615447lfb.143.1492945848942; Sun\, 23 Apr 2017 04​:10​:48 -0700 (PDT) MIME-Version​: 1.0 In-Reply-To​: \5769caa8\-b752\-7772\-166c\-57475242e30e@&#8203;gmail\.com References​: \RT\-Ticket\-131195@&#8203;perl\.org \f6ac076b\-54e1\-e4f8\-e25a\-537ccf974f4e@&#8203;gmail\.com \rt\-4\.0\.24\-17912\-1492909883\-845\.131195\-75\-0@&#8203;perl\.org \20f9058a\-1f90\-7d8c\-e736\-79cb26038800@&#8203;gmail\.com \CANgJU\+UFhDpzo1L=Qrq9DxM2\+MivRa\_Xg7COFEDae5x9bMt40w@&#8203;mail\.gmail\.com \5769caa8\-b752\-7772\-166c\-57475242e30e@&#8203;gmail\.com From​: demerphq \demerphq@&#8203;gmail\.com Date​: Sun\, 23 Apr 2017 13​:10​:48 +0200 Message-ID​: \CANgJU\+XchTRErFK9EzX1Tm\_cVAFPqK7RPAsck1H0pmZu4VmFLg@&#8203;mail\.gmail\.com Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected To​: Lukas Mai \plokinom@&#8203;gmail\.com Cc​: Perl5 Porteros \perl5\-porters@&#8203;perl\.org Content-Type​: text/plain; charset=UTF-8 X-PMX-Version​: 5.6.1.2065439\, Antispam-Engine​: 2.7.2.376379\, Antispam-Data​: 2017.4.23.31816 X-PMX-Spam​: Gauge=IIIIIIII\, Probability=8%\, Report=' FROM_NAME_ONE_WORD 0.05\, HTML_00_01 0.05\, HTML_00_10 0.05\, BODYTEXTP_SIZE_3000_LESS 0\, BODY_SIZE_2000_2999 0\, BODY_SIZE_5000_LESS 0\, BODY_SIZE_7000_LESS 0\, CT_TEXT_PLAIN_UTF8_CAPS 0\, DKIM_SIGNATURE 0\, FROM_SAME_AS_TO_DOMAIN 0\, IN_REP_TO 0\, LEGITIMATE_SIGNS 0\, MSG_THREAD 0\, REFERENCES 0\, SINGLE_URI_IN_BODY 0\, SPF_PASS 0\, URI_ENDS_IN_HTML 0\, URI_WITH_PATH_ONLY 0\, WEBMAIL_SOURCE 0\, __ANY_URI 0\, __BOUNCE_CHALLENGE_SUBJ 0\, __BOUNCE_NDR_SUBJ_EXEMPT 0\, __CC_NAME 0\, __CC_NAME_DIFF_FROM_ACC 0\, __CC_REAL_NAMES 0\, __CP_URI_IN_BODY 0\, __CT 0\, __CT_TEXT_PLAIN 0\, __DQ_NEG_HEUR 0\, __DQ_NEG_IP 0\, __FORWARDED_MSG 0\, __FRAUD_BODY_WEBMAIL 0\, __FRAUD_MONEY_CURRENCY 0\, __FRAUD_MONEY_CURRENCY_DOLLAR 0\, __FRAUD_WEBMAIL 0\, __FRAUD_WEBMAIL_FROM 0\, __FROM_DOMAIN_IN_ANY_TO1 0\, __FROM_DOMAIN_IN_RCPT 0\, __FROM_GMAIL 0\, __HAS_CC_HDR 0\, __HAS_FROM 0\, __HAS_MSGID 0\, __HELO_GMAIL 0\, __HTTPS_URI 0\, __IN_REP_TO 0\, __MIME_TEXT_ONLY 0\, __MIME_TEXT_P 0\, __MIME_TEXT_P1 0\, __MIME_VERSION 0\, __NO_HTML_TAG_RAW 0\, __PHISH_SPEAR_HTTP_RECEIVED 0\, __PHISH_SPEAR_STRUCTURE_1 0\, __PHISH_SPEAR_STRUCTURE_2 0\, __RDNS_GMAIL 0\, __REFERENCES 0\, __SANE_MSGID 0\, __SINGLE_URI_TEXT 0\, __SUBJ_ALPHA_END 0\, __SUBJ_ALPHA_NEGATE 0\, __TO_MALFORMED_2 0\, __TO_NAME 0\, __TO_NAME_DIFF_FROM_ACC 0\, __TO_REAL_NAMES 0\, __URI_IN_BODY 0\, __URI_NOT_IMG 0\, __URI_NO_WWW 0\, __URI_NS \, __URI_WITH_PATH 0\, __YOUTUBE_RCVD 0\, __zen.spamhaus.org_ERROR ' X-Original-Precedence​: bulk

On 23 April 2017 at 12​:14\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well\, and not inconvenience our users." I mean if we see the \r maybe we should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes are of that irritating type where Perl knows what is wrong\, and could do something reasonable\, but doesn't.

PS​: I like "Perl5 Porteros" :-)

I think that was the name someone had given it who I replied to first on list. Gmail remembered it\, and despite a few lazy attempts to fix it gmail has stubbornly refused to use anything else. I gave up caring after a while. :-)

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

From perl5-porters-return-244163-rt-listener=rtperl.dev@​perl.org Sun Apr 23 05​:01​:22 2017 Return-Path​: \perl5\-porters\-return\-244163\-rt\-listener=rtperl\.dev@&#8203;perl\.org X-Original-To​: rt-listener@​rtperl.dev Delivered-To​: rt-listener@​rtperl.dev Received​: from x6.develooper.com (x6.dev [10.0.100.16])   by rtperl.develooper.com (Postfix) with ESMTP id C1F313B8   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 05​:01​:22 -0700 (PDT) Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])   by x6.develooper.com (Postfix) with SMTP id 45AC52488   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 05​:01​:21 -0700 (PDT) Received​: (qmail 21874 invoked by uid 514); 23 Apr 2017 12​:01​:12 -0000 Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm list-help​: \mailto&#8203;:perl5\-porters\-help@&#8203;perl\.org list-unsubscribe​: \mailto&#8203;:perl5\-porters\-unsubscribe@&#8203;perl\.org list-post​: \mailto&#8203;:perl5\-porters@&#8203;perl\.org X-List-Archive​: \<http​://nntp.perl.org/group/perl.perl5.porters/244163> List-Id​: \<perl5-porters.perl.org> Delivered-To​: mailing list perl5-porters@​perl.org Received​: (qmail 21833 invoked from network); 23 Apr 2017 12​:00​:56 -0000 X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com X-Spam-Status​: No\, score=-2.0 required=6.0 tests=BAYES_00\,DKIM_SIGNED\, DKIM_VALID\,DKIM_VALID_AU\,FREEMAIL_FROM\,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject​:to​:references​:from​:message-id​:date​:user-agent​:mime-version :in-reply-to​:content-language​:content-transfer-encoding; bh=GBkaaOMz2gK0D2Vk6+uuBo5abhsvcrvefQKRkie+WcI=; b=FwwctEvNT9hQhMttkalc1kNsxDgbIbJVphGWt4mGg0VbnlG2FUm2pVCMTjFehoFh2T 50oZpkHkCqKkRK84+R4AHxhlD8GEeYDqwK/njNjH09ZYZ+ROuZuqZGdRA49yAM5Nf9th cpYnhLkSGW5dse3DjAt6PcSncufk7785J/F5b66E+qjNPSY3Fr1a8TONBxl1Cmdp8r/t kb+GepWFYdTzOlrLcfVT9n8CE2hwi3hUzS2FQ3JNFUPeyO2o8X4y/kMIRhNXaIdOAP1+ /xZpl4SIOI0iLqw8varR2VX1jCd2gOtWrgXg8Q1ISPQEJZPWjvT+/1RfquyZo6mIyGjj fDdw== X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state​:subject​:to​:references​:from​:message-id​:date :user-agent​:mime-version​:in-reply-to​:content-language :content-transfer-encoding; bh=GBkaaOMz2gK0D2Vk6+uuBo5abhsvcrvefQKRkie+WcI=; b=AQ6gDgHfjP7t1UZxe3w8D77upE/shbGT1lDxcL6a/D9dotAJ/A0F0rzIJ56sfvGafT PO0G0X+bJ14+Gyijrb4xkGpywQdUXGo9bmJktj7PGkdteWUMU30mOupMt4ABKMS5MjvC K4XUcgspsH+N+nO1kD/CMtBslgcp5ZJ65oYHcUnQFfW9wzS88sJXEQU9IMInOtPP5l0/ PfIJVcSDOmV/DZ5A+8IYOM6sM7z/04NtMcXaZBuvZdtBzzHpL0VF0wYzfplnWcix8SwT /bprbwIDCUwZ/yPbeYTo0F0y2pIUARpkb8TjfLtLrqKC3U9KSudlsviFm7wPAp2u5kCR 2I6w== X-Gm-Message-State​: AN3rC/6OprAHdIRYMikfkDIyEctUQo9UiKvjHu+fE8eV0sud+lICO+ed 4RkmmOzADaG0EA== X-Received​: by 10.223.136.235 with SMTP id g40mr1777803wrg.107.1492948816671; Sun\, 23 Apr 2017 05​:00​:16 -0700 (PDT) Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected To​: Perl5 Porteros \perl5\-porters@&#8203;perl\.org References​: \RT\-Ticket\-131195@&#8203;perl\.org \f6ac076b\-54e1\-e4f8\-e25a\-537ccf974f4e@&#8203;gmail\.com \rt\-4\.0\.24\-17912\-1492909883\-845\.131195\-75\-0@&#8203;perl\.org \20f9058a\-1f90\-7d8c\-e736\-79cb26038800@&#8203;gmail\.com \CANgJU\+UFhDpzo1L=Qrq9DxM2\+MivRa\_Xg7COFEDae5x9bMt40w@&#8203;mail\.gmail\.com \5769caa8\-b752\-7772\-166c\-57475242e30e@&#8203;gmail\.com \CANgJU\+XchTRErFK9EzX1Tm\_cVAFPqK7RPAsck1H0pmZu4VmFLg@&#8203;mail\.gmail\.com From​: Lukas Mai \plokinom@&#8203;gmail\.com Message-ID​: \c4ebf403\-15c0\-1b98\-df5e\-4bc0718ba561@&#8203;gmail\.com Date​: Sun\, 23 Apr 2017 14​:00​:12 +0200 User-Agent​: Mozilla/5.0 (Windows NT 6.1; WOW64; rv​:52.0) Gecko/20100101 Thunderbird/52.0.1 MIME-Version​: 1.0 In-Reply-To​: \CANgJU\+XchTRErFK9EzX1Tm\_cVAFPqK7RPAsck1H0pmZu4VmFLg@&#8203;mail\.gmail\.com Content-Type​: text/plain; charset=utf-8; format=flowed Content-Language​: en-US Content-Transfer-Encoding​: 7bit X-PMX-Version​: 5.6.1.2065439\, Antispam-Engine​: 2.7.2.376379\, Antispam-Data​: 2017.4.23.114516 X-PMX-Spam​: Gauge=IIIIIIII\, Probability=8%\, Report=' HTML_00_01 0.05\, HTML_00_10 0.05\, BODYTEXTP_SIZE_3000_LESS 0\, BODY_SIZE_1800_1899 0\, BODY_SIZE_2000_LESS 0\, BODY_SIZE_5000_LESS 0\, BODY_SIZE_7000_LESS 0\, DKIM_SIGNATURE 0\, IN_REP_TO 0\, LEGITIMATE_SIGNS 0\, MSG_THREAD 0\, NO_URI_HTTPS 0\, REFERENCES 0\, SINGLE_URI_IN_BODY 0\, SPF_PASS 0\, URI_WITH_PATH_ONLY 0\, WEBMAIL_SOURCE 0\, __ANY_URI 0\, __BOUNCE_CHALLENGE_SUBJ 0\, __BOUNCE_NDR_SUBJ_EXEMPT 0\, __CP_URI_IN_BODY 0\, __CT 0\, __CTE 0\, __CT_TEXT_PLAIN 0\, __DQ_NEG_HEUR 0\, __DQ_NEG_IP 0\, __FORWARDED_MSG 0\, __FRAUD_BODY_WEBMAIL 0\, __FRAUD_MONEY_CURRENCY 0\, __FRAUD_MONEY_CURRENCY_DOLLAR 0\, __FRAUD_WEBMAIL 0\, __FRAUD_WEBMAIL_FROM 0\, __FROM_GMAIL 0\, __HAS_FROM 0\, __HAS_MSGID 0\, __HELO_GMAIL 0\, __IN_REP_TO 0\, __MIME_TEXT_ONLY 0\, __MIME_TEXT_P 0\, __MIME_TEXT_P1 0\, __MIME_VERSION 0\, __MOZILLA_USER_AGENT 0\, __NO_HTML_TAG_RAW 0\, __PHISH_SPEAR_STRUCTURE_1 0\, __RDNS_GMAIL 0\, __REFERENCES 0\, __SANE_MSGID 0\, __SINGLE_URI_TEXT 0\, __SUBJ_ALPHA_END 0\, __SUBJ_ALPHA_NEGATE 0\, __TO_MALFORMED_2 0\, __TO_NAME 0\, __TO_NAME_DIFF_FROM_ACC 0\, __TO_REAL_NAMES 0\, __URI_IN_BODY 0\, __URI_NOT_IMG 0\, __URI_NS \, __URI_WITH_PATH 0\, __USER_AGENT 0\, __YOUTUBE_RCVD 0\, __blackholes.mail-abuse.org_ERROR \, __zen.spamhaus.org_ERROR ' X-Original-Precedence​: bulk

Am 23.04.2017 um 13​:10 schrieb demerphq​:

On 23 April 2017 at 12​:14\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well\, and not inconvenience our users." I mean if we see the \r maybe we should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes are of that irritating type where Perl knows what is wrong\, and could do something reasonable\, but doesn't.

If you want to make that work\, you have to go out and patch all unixish kernels. Perl doesn't even run because there is no file called "/usr/bin/perl\r" on the system.

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'` as part of the install step\, but ... eugh.)

But even that won't help you with a BOM​: Either it will fail outright (unknown executable format (not ELF\, doesn't start with "#!")) or the shell will "helpfully" try to run it as a shell script. That's why http​://www.unicode.org/faq/utf_bom.html#bom10 says "Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols\, use of the BOM as encoding form signature should be avoided."

-- Lukas Mai \plokinom@&#8203;gmail\.com

From perl5-porters-return-244165-rt-listener=rtperl.dev@​perl.org Sun Apr 23 12​:33​:30 2017 Return-Path​: \perl5\-porters\-return\-244165\-rt\-listener=rtperl\.dev@&#8203;perl\.org X-Original-To​: rt-listener@​rtperl.dev Delivered-To​: rt-listener@​rtperl.dev Received​: from x6.develooper.com (x6.dev [10.0.100.16])   by rtperl.develooper.com (Postfix) with ESMTP id 251B11FD   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 12​:33​:30 -0700 (PDT) Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])   by x6.develooper.com (Postfix) with SMTP id F1BB92400   for \rt\-listener@&#8203;rtperl\.dev; Sun\, 23 Apr 2017 12​:33​:28 -0700 (PDT) Received​: (qmail 25603 invoked by uid 514); 23 Apr 2017 19​:33​:23 -0000 Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm list-help​: \mailto&#8203;:perl5\-porters\-help@&#8203;perl\.org list-unsubscribe​: \mailto&#8203;:perl5\-porters\-unsubscribe@&#8203;perl\.org list-post​: \mailto&#8203;:perl5\-porters@&#8203;perl\.org X-List-Archive​: \<http​://nntp.perl.org/group/perl.perl5.porters/244165> List-Id​: \<perl5-porters.perl.org> Delivered-To​: mailing list perl5-porters@​perl.org Received​: (qmail 25587 invoked from network); 23 Apr 2017 19​:33​:23 -0000 X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com X-Spam-Status​: No\, score=-2.0 required=6.0 tests=BAYES_00\,DKIM_SIGNED\, DKIM_VALID\,DKIM_VALID_AU\,RCVD_IN_DNSWL_NONE\,SPF_HELO_PASS autolearn=ham version=3.3.1 DKIM-Signature​: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=fysh.org; s=20170316; h=In-Reply-To​:Content-Type​:MIME-Version​:References​:Message-ID​:Subject​:To​:From​:Date; bh=NxN4ecHpanfE49L8tG/CCOeckuMvb0fbWdyHjSjVpU8=; b=D+YjBWW2a8jD3+PE9JSNh5Zd/5J4if+htsJpskTg/z+xbCWtaXTmaNQwzHVBadQ3NrRN6Q5F+Q+ZcGa3yRAKaNmP1ex31Psb7488UOyksw4Y8vwcOLgcOKr511pPjVqcWrN2Pog8XZOd9tbtTwDUNddV3eCNnTN6AARJAwo2+8g=; Date​: Sun\, 23 Apr 2017 20​:33​:12 +0100 From​: Zefram \zefram@&#8203;fysh\.org To​: perl5-porters@​perl.org Subject​: Re​: [perl #131190] erroneous regex warning after utf8 conversion Message-ID​: \20170423193312\.GK6765@&#8203;fysh\.org References​: \RT\-Ticket\-131190@&#8203;perl\.org \58fa9233\.8133620a\.d8df3\.9c89@&#8203;mx\.google\.com \rt\-4\.0\.24\-2143\-1492816473\-77\.131190\-75\-0@&#8203;perl\.org MIME-Version​: 1.0 Content-Type​: text/plain; charset=us-ascii Content-Disposition​: inline In-Reply-To​: \rt\-4\.0\.24\-2143\-1492816473\-77\.131190\-75\-0@&#8203;perl\.org X-PMX-Version​: 5.6.1.2065439\, Antispam-Engine​: 2.7.2.376379\, Antispam-Data​: 2017.4.23.192417 X-PMX-Spam​: Gauge=IIIIIIII\, Probability=8%\, Report=' FROM_NAME_ONE_WORD 0.05\, HTML_00_01 0.05\, HTML_00_10 0.05\, BODYTEXTP_SIZE_3000_LESS 0\, BODY_SIZE_1000_1099 0\, BODY_SIZE_2000_LESS 0\, BODY_SIZE_5000_LESS 0\, BODY_SIZE_7000_LESS 0\, DKIM_SIGNATURE 0\, IN_REP_TO 0\, LEGITIMATE_SIGNS 0\, MSG_THREAD 0\, NO_CTA_URI_FOUND 0\, NO_URI_FOUND 0\, NO_URI_HTTPS 0\, REFERENCES 0\, SPF_PASS 0\, __BOUNCE_CHALLENGE_SUBJ 0\, __BOUNCE_NDR_SUBJ_EXEMPT 0\, __CD 0\, __CT 0\, __CT_TEXT_PLAIN 0\, __FRAUD_MONEY_CURRENCY 0\, __FRAUD_MONEY_CURRENCY_DOLLAR 0\, __HAS_FROM 0\, __HAS_MSGID 0\, __IN_REP_TO 0\, __MIME_TEXT_ONLY 0\, __MIME_TEXT_P 0\, __MIME_TEXT_P1 0\, __MIME_VERSION 0\, __NO_HTML_TAG_RAW 0\, __PHISH_SPEAR_SUBJECT 0\, __REFERENCES 0\, __SANE_MSGID 0\, __SUBJ_ALPHA_END 0\, __SUBJ_ALPHA_NEGATE 0\, __TO_MALFORMED_2 0\, __TO_NO_NAME 0\, __zen.spamhaus.org_ERROR ' X-Original-Precedence​: bulk

Bisecting shows that the warning started appearing for that test script at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte locale".

Attempting to minimise the test script\, it turns out that the "use experimental" line is not required for any reason relating to smartmatch\, but simply for its effect on lexical warning flags. Anything touching lexical warnings will do\, such as the simpler "use warnings". And thus enabling all warnings produces an additional warning that sheds some light on the matter​:

$ perl ../rt131190 Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8. Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be passing in the UTF8_ALLOW_EMPTY flag. Without this\, when it's at end of string it perceives a malformed character\, for which it warns about malformation and substitutes in a replacement character\, which is wide and therefore triggers the wide character warning.

-zefram

p5pRT commented 7 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @demerphq

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by # Please include the string​: [perl #131195] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​: "... if a Perl script begins with the Unicode "BOM" (UTF-16LE\, UTF16-BE\, > or UTF-8)\, or if the script looks like non-"BOM"-marked UTF-16 of

either

endianness\, Perl will correctly read in the script as the appropriate Unicode encoding.

That is true for UTF-16 variants\, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @mauke

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

PS​: I like "Perl5 Porteros" :-)

-- Lukas Mai \plokinom@&#8203;gmail\.com

p5pRT commented 7 years ago

From @demerphq

On 23 April 2017 at 12​:14\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I think that ticket was decided wrongly.

I think we should have respected the docs and added support for utf8-bom's. Strictly speaking they are unrequired\, but they are common in Windows workflow\, and I don't see what harm is caused by respecting them as compared to respecting UTF-16 BOM's. As far as I can tell the only difference is that with UTF16 BOM's are required to properly discriminate UTF-16BE and UTF-16LE data\, whereas utf8 strictly speaking is endianness neutral. However\, in windows it is traditional to use BOM's to signal any format of unicode\, so we force people using utf8 on windows to scrub their BOM's. I never understood why\, especially since most people who object to this are on *nix platforms where such BOM's almost never show up. (I remember getting bitten by utf8 BOM's when I worked on Windows a lot\, but have never seen a utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows\, or make it be a build option.

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well\, and not inconvenience our users." I mean if we see the \r maybe we should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes are of that irritating type where Perl knows what is wrong\, and could do something reasonable\, but doesn't.

PS​: I like "Perl5 Porteros" :-)

I think that was the name someone had given it who I replied to first on list. Gmail remembered it\, and despite a few lazy attempts to fix it gmail has stubbornly refused to use anything else. I gave up caring after a while. :-)

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 7 years ago

From @mauke

Am 23.04.2017 um 13​:10 schrieb demerphq​:

On 23 April 2017 at 12​:14\, Lukas Mai \plokinom@&#8203;gmail\.com wrote​:

The problem I'm worried about is that we already see problems from users who write scripts on Windows (or copy them from somewhere in Windows format)\, then run them on Unix\, only to get​:

$ ./my_script.pl ./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly\, an invisible BOM at the beginning would completely break the "#!" mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well\, and not inconvenience our users." I mean if we see the \r maybe we should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes are of that irritating type where Perl knows what is wrong\, and could do something reasonable\, but doesn't.

If you want to make that work\, you have to go out and patch all unixish kernels. Perl doesn't even run because there is no file called "/usr/bin/perl\r" on the system.

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'` as part of the install step\, but ... eugh.)

But even that won't help you with a BOM​: Either it will fail outright (unknown executable format (not ELF\, doesn't start with "#!")) or the shell will "helpfully" try to run it as a shell script. That's why http​://www.unicode.org/faq/utf_bom.html#bom10 says "Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols\, use of the BOM as encoding form signature should be avoided."

-- Lukas Mai \plokinom@&#8203;gmail\.com

p5pRT commented 7 years ago

From @jimav

On Sun\, 23 Apr 2017 14​:32​:33 -0700\, plokinom@​gmail.com wrote​:

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'` as part of the install step\, but ... eugh.) But even that won't help you with a BOM​: Either it will fail outright (unknown executable format (not ELF\, doesn't start with "#!")) or the shell will "helpfully" try to run it as a shell script...

IMO\, #! support is semi-off-topic. The problem at hand is that you can't say   perl file.pl and have it work if file.pl starts with a UTF-8 BOM. As noted by others\, you automatically get a BOM when saving a file in UTF-8 format on Windows.

I just dont' see how any harm could come to *nix users if Perl recognizes the BOM *and* acts accordingly (right now perl recognizes the BOM but simply ignores it\, and decodes the rest of the file incorrectly).

Deliberately making life harder for users\, even (gasp) users on Windows\, should be done only with very compelling reasons!

I don't think a file starting with a BOM could legitimately contain non-Unicode characters. If there is a BOM\, the file was created by Unicode-aware software (e.g. Notepad)\, and absent a bug or major user shenanigans\, the file is certain to contain Unicode characters encoded as the BOM indicates.

BTW\, a BOM is not invisible if you look at the file with vim -b.