Closed p5pRT closed 9 years ago
Apologies if this has already been discussed. I don't subscribe to p5p but have searched the list before posting\, and didn't find anything relating to this issue.
This patch replaces the example/recommendation of using File::Slurp with Path::Tiny in perlfaq5.
The reasoning behind this is an outstanding critical bug with File::Slurp:
https://rt.cpan.org/Public/Bug/Display.html?id=83126
This is a bug report for perl from mcgrath.martin@gmail.com\, generated with the help of perlbug 1.40 running under perl 5.19.12.
an array to a file so that accessing an element of the array actually accesses the corresponding line in the file.
-If you want to load the entire file\, you can use the L\<File::Slurp> +If you want to load the entire file\, you can use the L\<Path::Tiny> module to do it in one simple and efficient step:
- use File::Slurp; + use Path::Tiny;
- my $all_of_it = read_file($filename); # entire file in scalar - my @all_lines = read_file($filename); # one line per element + my $all_of_it = path($filename)->slurp; # entire file in scalar + my @all_lines = path($filename)->lines; # one line per element
Or you can read the entire file contents into a scalar like this:
--------------1.8.3.2--
Flags: category=docs severity=wishlist
Site configuration information for perl 5.19.12:
Configured by marto at Sat May 10 12:15:37 BST 2014.
Summary of my perl5 (revision 5 version 19 subversion 12) configuration: Derived from: f78f6d13e94ec2059acb931e3ca33869aa58f1e7 Platform: osname=linux\, osvers=3.11.0-20-generic\, archname=x86_64-linux uname='linux marto-virtualbox 3.11.0-20-generic #34-ubuntu smp tue apr 1 20:40:25 utc 2014 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Dusedevel -Dprefix=/home/marto/bleedperl' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef use64bitint=define\, use64bitall=define\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2'\, cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.8.1'\, gccosandvers='' intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.17.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.17' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'
Locally applied patches: uncommitted-changes
@INC for perl 5.19.12: /home/marto/bleedperl/lib/site_perl/5.19.12/x86_64-linux /home/marto/bleedperl/lib/site_perl/5.19.12 /home/marto/bleedperl/lib/5.19.12/x86_64-linux /home/marto/bleedperl/lib/5.19.12 .
Environment for perl 5.19.12: HOME=/home/marto LANG=en_GB.UTF-8 LANGUAGE=en_GB:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games PERL_BADLANG (unset) SHELL=/bin/bash
On Wednesday-201405-14\, 5:11\, Martin McGrath (via RT) wrote:
# New Ticket Created by Martin McGrath # Please include the string: [perl #121870] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121870 >
Apologies if this has already been discussed. I don't subscribe to p5p but have searched the list before posting\, and didn't find anything relating to this issue.
This patch replaces the example/recommendation of using File::Slurp with Path::Tiny in perlfaq5.
The reasoning behind this is an outstanding critical bug with File::Slurp:
In addition to File::Slurp / perlfaq problem\, shouldn't this:
sysread treats any :encoding(...) as effectively :utf8. Thus\, requesting { binmode => ":encoding(UTF-8)" } (e.g. strict UTF-8 compliance) actually results in Perl's lax\, insecure utf8 decoding being used instead. ... More importantly\, it will interpret any encoding as :utf8\, even for example :encoding(UTF-16).
be classified as a serious issue in sysread?
Though\, I think\, one could well argue that trying to read anything else than bytes (be it utf8 of any kind\, or any other encoding) with sysread is a dubious proposition to begin with.
This is a bug report for perl from mcgrath.martin@gmail.com\, generated with the help of perlbug 1.40 running under perl 5.19.12.
The RT System itself - Status changed from 'new' to 'open'
On Wed\, May 14\, 2014 at 4:14 PM\, Jarkko Hietaniemi \jhi@​iki\.fi wrote:
In addition to File::Slurp / perlfaq problem\, shouldn't this:
sysread treats any :encoding(...) as effectively :utf8. Thus\, requesting { binmode => ":encoding(UTF-8)" } (e.g. strict UTF-8 compliance) actually results in Perl's lax\, insecure utf8 decoding being used instead. ... More importantly\, it will interpret any encoding as :utf8\, even for example :encoding(UTF-16).
be classified as a serious issue in sysread?
Though\, I think\, one could well argue that trying to read anything else than bytes (be it utf8 of any kind\, or any other encoding) with sysread is a dubious proposition to begin with.
Yes. This is essentially a core bug that's leaking out because File::Slurp is stupid. sysread might as well die in this particular case IMO\, but I'm not sure if we can retroactively make it do that.
Leon
Interesting. Have you suggested this as a Perl::Critic policy too?
-- Ed Avis \eda@​waniasset\.com
On Wednesday-201405-14\, 10:23\, Leon Timmermans wrote:
Yes. This is essentially a core bug that's leaking out because File::Slurp is stupid. sysread might as well die in this particular case IMO\, but I'm not sure if we can retroactively make it do that.
Probably not for 5.20\, harhar... but certainly we could try purposefully breaking this for 5.21.
Yes. This is essentially a core bug that's leaking out because File::Slurp is stupid. sysread might as well die in this particular case IMO\, but I'm not sure if we can retroactively make it do that.
I'll file a bug on this.
Leon
On 05/14/2014 10:23 AM\, Leon Timmermans wrote:
On Wed\, May 14\, 2014 at 4:14 PM\, Jarkko Hietaniemi \jhi@​iki\.fi wrote:
In addition to File::Slurp / perlfaq problem\, shouldn't this:
sysread treats any :encoding(...) as effectively :utf8. Thus\, requesting { binmode => ":encoding(UTF-8)" } (e.g. strict UTF-8 compliance) actually results in Perl's lax\, insecure utf8 decoding being used instead. ... More importantly\, it will interpret any encoding as :utf8\, even for example :encoding(UTF-16).
be classified as a serious issue in sysread?
Though\, I think\, one could well argue that trying to read anything else than bytes (be it utf8 of any kind\, or any other encoding) with sysread is a dubious proposition to begin with.
Yes. This is essentially a core bug that's leaking out because File::Slurp is stupid. sysread might as well die in this particular case IMO\, but I'm not sure if we can retroactively make it do that.
and how would you go about fixing it in file::slurp? i am not a unicode/utf guru by any stretch (i say i am an ascii bigot! :). would just doing a raw sysread all the time and then converting the buffer as directed by binmode work? any other suggestions for how to read utf files with ease and speed?
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Wed\, May 14\, 2014 at 5:14 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
and how would you go about fixing it in file::slurp? i am not a unicode/utf guru by any stretch (i say i am an ascii bigot! :). would just doing a raw sysread all the time and then converting the buffer as directed by binmode work? any other suggestions for how to read utf files with ease and speed?
File::Slurp is nothing more than a workaround for a performance bug in perl. There is no reason «my $foo = do { local $/; \<$fh> };» would need to be any slower than what File::Slurp is doing right now. In fact I'm planning to fix just that in 5.21.
Leon
On Wednesday-201405-14\, 11:30\, Leon Timmermans wrote:
On Wed\, May 14\, 2014 at 5:14 PM\, Uri Guttman \<uri@stemsystems.com \mailto​:uri@​stemsystems\.com> wrote:
and how would you go about fixing it in file​::slurp? i am not a unicode/utf guru by any stretch \(i say i am an ascii bigot\! :\)\. would just doing a raw sysread all the time and then converting the buffer as directed by binmode work? any other suggestions for how to read utf files with ease and speed?
File::Slurp is nothing more than a workaround for a performance bug in perl. There is no reason «my $foo = do { local $/; \<$fh> };» would need to be any slower than what File::Slurp is doing right now. In fact I'm planning to fix just that in 5.21.
Leon
FWIW re the UTF-8 validity/strictness: I think we're well within our rights to break things\, quoting Standards (via UTF-8 Wikipedia page):
RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." ... The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
On Wed\, May 14\, 2014 at 12:34 PM\, Jarkko Hietaniemi \jhi@​iki\.fi wrote:
RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." ... The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
It's worth considering two separate issues:
* invalid UTF-8 encoding * private/nonchar codepoints
As I understand it\, :utf8 is lax in both respects. I'd be very supportive of fixing the first\, even if it breaks backwards compatibility\, but don't think we should address the second.
-- David Golden \xdg@​xdg\.me Twitter/IRC: @xdg
On 2014-05-14 17:30\, Leon Timmermans wrote:
On Wed\, May 14\, 2014 at 5:14 PM\, Uri Guttman \<uri@stemsystems.com \mailto​:uri@​stemsystems\.com> wrote:
and how would you go about fixing it in file​::slurp? i am not a unicode/utf guru by any stretch \(i say i am an ascii bigot\! :\)\. would just doing a raw sysread all the time and then converting the buffer as directed by binmode work? any other suggestions for how to read utf files with ease and speed?
File::Slurp is nothing more than a workaround for a performance bug in perl. There is no reason «my $foo = do { local $/; \<$fh> };» would need to be any slower than what File::Slurp is doing right now. In fact I'm planning to fix just that in 5.21.
Not sure if this is still true for the latest Perls\, but you better write it as:
my $foo; { local $/; $foo = \<$fh> }
or it will use twice the memory.
-- Ruud
On 05/14/2014 11:30 AM\, Leon Timmermans wrote:
On Wed\, May 14\, 2014 at 5:14 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
and how would you go about fixing it in file::slurp? i am not a unicode/utf guru by any stretch (i say i am an ascii bigot! :). would just doing a raw sysread all the time and then converting the buffer as directed by binmode work? any other suggestions for how to read utf files with ease and speed?
File::Slurp is nothing more than a workaround for a performance bug in perl. There is no reason «my $foo = do { local $/; \<$fh> };» would need to be any slower than what File::Slurp is doing right now. In fact I'm planning to fix just that in 5.21.
if you think that is all there is to it\, then do the workaround. most coders think read_file( $filename) is a much better piece of code than the infamous do/local hack. it isn't just a speed up (and check out the benchmark script to see how much it does blows away do/local) but also has a better api\, better error handling options\, more flexible ways to get the data back. of course you can do atomic writing on your own. edit_file and edit_file_lines would be trivial for you to write with open/read/write and all the other options. and where is the equivilent of write_file with do local? gonna need at least an open call and a print along with error handling and more.
over 500 modules and distros use file::slurp. tell them all to use the do/local thing. i am sure you will be rewarded with gold and jewels.
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On 14 May 2014 15:45\, Ed Avis via RT \perlbug\-followup@​perl\.org wrote:
Interesting. Have you suggested this as a Perl::Critic policy too?
Not as yet\, I've forked the repo and will work on this tonight.
Thanks
On Wednesday-201405-14\, 22:57\, Uri Guttman wrote:
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
Based on some torture testing I just did\, sysread() isn't the only thing broken regarding :encoding(blah)... working on it\, but it will be 5.21 before fixes happen.
On Wed\, May 14\, 2014 at 10:57:41PM -0400\, Uri Guttman wrote:
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
Don't use sysread() if there are any layers - since it ignores everything but the :utf8 flag[1]
I'd tend to fallback to read()\, whether that meets your performance requirements I don't know.
Tony
[1] in a scary way\, eg. if your file is UCS-2BE and you open with :encoding(UCS-2BE)\, sysread() will (try to) treat that UCS-2BE data as UTF-8 data.
On Thu\, May 15\, 2014 at 4:57 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
if you think that is all there is to it\, then do the workaround. most coders think read_file( $filename) is a much better piece of code than the infamous do/local hack. it isn't just a speed up (and check out the benchmark script to see how much it does blows away do/local) but also has a better api\, better error handling options\, more flexible ways to get the data back. of course you can do atomic writing on your own.
All of that is true\, but you don't need hundreds of lines of conceptually buggy code to achieve that.
edit_file and edit_file_lines would be trivial for you to write with open/read/write and all the other options. and where is the equivilent of write_file with do local? gonna need at least an open call and a print along with error handling and more.
edit_file and edit_file_lines are horribly broken by ignoring encodings altogether.
over 500 modules and distros use file::slurp. tell them all to use the do/local thing. i am sure you will be rewarded with gold and jewels.
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
Don't use sysread\, at least not unless you're absolutely sure it gives the right answer. It's really that simple.
Leon
On 05/15/2014 08:43 AM\, Tony Cook wrote:
On Wed\, May 14\, 2014 at 10:57:41PM -0400\, Uri Guttman wrote:
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
Don't use sysread() if there are any layers - since it ignores everything but the :utf8 flag[1]
I'd tend to fallback to read()\, whether that meets your performance requirements I don't know.
so what about sysread in raw mode and then converting the buffer to the requested format? that should bypass any sysread issues and then rely on the utf conversion subs directly.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On 05/15/2014 08:54 AM\, Leon Timmermans wrote:
On Thu\, May 15\, 2014 at 4:57 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
if you think that is all there is to it\, then do the workaround. most coders think read_file( $filename) is a much better piece of code than the infamous do/local hack. it isn't just a speed up (and check out the benchmark script to see how much it does blows away do/local) but also has a better api\, better error handling options\, more flexible ways to get the data back. of course you can do atomic writing on your own.
All of that is true\, but you don't need hundreds of lines of conceptually buggy code to achieve that.
and there aren't many bug reports on those hundreds of lines. in fact most are utf related. the code is very solid and that isn't just my opinion.
edit_file and edit_file_lines would be trivial for you to write with open/read/write and all the other options. and where is the equivilent of write_file with do local? gonna need at least an open call and a print along with error handling and more.
edit_file and edit_file_lines are horribly broken by ignoring encodings altogether.
they don't ignore them. they pass on the encoding options to read_file and write_file.
over 500 modules and distros use file::slurp. tell them all to use the do/local thing. i am sure you will be rewarded with gold and jewels.
in the mean time\, instead of saying it uses sysread and can't handle utf correctly\, how about saying how to properly handle it? maybe those 500 modules will like that instead.
Don't use sysread\, at least not unless you're absolutely sure it gives the right answer. It's really that simple.
and you didn't answer the question about calling sysread in raw mode and encoding afterwards. try answering that and being useful here. it isn't my fault that sysread encoding is broken and i wrote file::slurp 11 years ago when sysread was fine and dandy.
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, May 15\, 2014 at 9:26 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so what about sysread in raw mode and then converting the buffer to the requested format? that should bypass any sysread issues and then rely on the utf conversion subs directly.
The main problem I'm aware of is that there is no guarantee the buffer ends on a character boundary of the data are in a multi-byte encoding. The end of the buffer might have a partial character in it and the rest of that character hasn't been read in yet. You can't convert a partial character.
On 05/15/2014 10:46 AM\, Craig A. Berry wrote:
On Thu\, May 15\, 2014 at 9:26 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so what about sysread in raw mode and then converting the buffer to the requested format? that should bypass any sysread issues and then rely on the utf conversion subs directly.
The main problem I'm aware of is that there is no guarantee the buffer ends on a character boundary of the data are in a multi-byte encoding. The end of the buffer might have a partial character in it and the rest of that character hasn't been read in yet. You can't convert a partial character.
the whole idea of slurp is to read in the entire file into a string. any partial chars at the end at the fault of whatever wrote the file. this conversion wouldn't be done on a block by block basis (and i see your point there as it finally is clear why sysread would be broken if used that way).
so again\, i will ask the simple question to which i have not gotten a proper answer from all you unicode experts. if file::slurp reads in an ENTIRE file with raw mode sysread into a single string\, and THEN converts its to the requested format\, will that work? given the actual description of the bug above\, i can't see why it won't work. and it will be an EASY fix (patches welcome as slurp is now on github under perlhunter).
so i will be blunt here\, put up or shut up. i am asking for help on this unicode/utf issue where i am very weak. i know many of you are experts on perl/utf so what is the best way to do this? what subs do i call to do the conversion of a full buffer? i can make sysread do raw mode and delay the binmode option to a later conversion.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On 15 May 2014 17:09\, Uri Guttman \uri@​stemsystems\.com wrote:
On 05/15/2014 10:46 AM\, Craig A. Berry wrote:
On Thu\, May 15\, 2014 at 9:26 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so what about sysread in raw mode and then converting the buffer to the
requested format? that should bypass any sysread issues and then rely on the utf conversion subs directly.
The main problem I'm aware of is that there is no guarantee the buffer ends on a character boundary of the data are in a multi-byte encoding. The end of the buffer might have a partial character in it and the rest of that character hasn't been read in yet. You can't convert a partial character.
the whole idea of slurp is to read in the entire file into a string. any partial chars at the end at the fault of whatever wrote the file. this conversion wouldn't be done on a block by block basis (and i see your point there as it finally is clear why sysread would be broken if used that way).
so again\, i will ask the simple question to which i have not gotten a proper answer from all you unicode experts. if file::slurp reads in an ENTIRE file with raw mode sysread into a single string\, and THEN converts its to the requested format\, will that work? given the actual description of the bug above\, i can't see why it won't work. and it will be an EASY fix (patches welcome as slurp is now on github under perlhunter).
so i will be blunt here\, put up or shut up. i am asking for help on this unicode/utf issue where i am very weak. i know many of you are experts on perl/utf so what is the best way to do this? what subs do i call to do the conversion of a full buffer? i can make sysread do raw mode and delay the binmode option to a later conversion.
Reading the full file and then calling Encode::decode() on it with the right flags would work fine.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On 05/15/2014 11:31 AM\, demerphq wrote:
On 15 May 2014 17:09\, Uri Guttman \uri@​stemsystems\.com wrote:
On 05/15/2014 10:46 AM\, Craig A. Berry wrote:
On Thu\, May 15\, 2014 at 9:26 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so what about sysread in raw mode and then converting the buffer to the
requested format? that should bypass any sysread issues and then rely on the utf conversion subs directly.
The main problem I'm aware of is that there is no guarantee the buffer ends on a character boundary of the data are in a multi-byte encoding. The end of the buffer might have a partial character in it and the rest of that character hasn't been read in yet. You can't convert a partial character.
the whole idea of slurp is to read in the entire file into a string. any partial chars at the end at the fault of whatever wrote the file. this conversion wouldn't be done on a block by block basis (and i see your point there as it finally is clear why sysread would be broken if used that way).
so again\, i will ask the simple question to which i have not gotten a proper answer from all you unicode experts. if file::slurp reads in an ENTIRE file with raw mode sysread into a single string\, and THEN converts its to the requested format\, will that work? given the actual description of the bug above\, i can't see why it won't work. and it will be an EASY fix (patches welcome as slurp is now on github under perlhunter).
so i will be blunt here\, put up or shut up. i am asking for help on this unicode/utf issue where i am very weak. i know many of you are experts on perl/utf so what is the best way to do this? what subs do i call to do the conversion of a full buffer? i can make sysread do raw mode and delay the binmode option to a later conversion.
Reading the full file and then calling Encode::decode() on it with the right flags would work fine.
yves\,
thanks for the direct answer. now the question is should i just pass in what i get in the bin_mode option to decode or something else? i have used decode before but i don't recall if it uses the same args as binmode.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, 15 May 2014\, Uri Guttman wrote:
thanks for the direct answer. now the question is should i just pass in what i get in the bin_mode option to decode or something else? i have used decode before but i don't recall if it uses the same args as binmode.
This will only sort of work. With binmode you can set multiple layers and also do things unrelated to character sets (like gzip encode/decode\, line ending translation\, etc.).
Encode can accept anything that would be passed to the ":encoding(...)" layer\, as well as ":utf8"\, but the two don't really correspond.
I suspect a better fix would be to simply _not_ use sysread if binmode is anything besides ":utf8" or ":raw".
-dave
/*============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================*/
On 05/15/2014 12:09 PM\, Dave Rolsky wrote:
On Thu\, 15 May 2014\, Uri Guttman wrote:
thanks for the direct answer. now the question is should i just pass in what i get in the bin_mode option to decode or something else? i have used decode before but i don't recall if it uses the same args as binmode.
This will only sort of work. With binmode you can set multiple layers and also do things unrelated to character sets (like gzip encode/decode\, line ending translation\, etc.).
Encode can accept anything that would be passed to the ":encoding(...)" layer\, as well as ":utf8"\, but the two don't really correspond.
I suspect a better fix would be to simply _not_ use sysread if binmode is anything besides ":utf8" or ":raw".
what about adding another option that is passed directly to decode? read() would still fail as files are read in blocks and that would also trigger the same error if a char was partially read. what i can do is always do :raw on sysread and call encode when the binmode is something allowed by decode like utf8 or something. it would ignore it when the mode is requested to be :raw (could be a binary file). i don't see why i can't keep sysread if somehow the user can specify how to decode. an alternative is for callers to always use :raw and then do decode on their own. that could be added to the pod saying if the new decode feature doesn't work for them\, roll your own.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, May 15\, 2014 at 11:09 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so again\, i will ask the simple question to which i have not gotten a proper answer from all you unicode experts. if file::slurp reads in an ENTIRE file with raw mode sysread into a single string\, and THEN converts its to the requested format\, will that work?
Yes\, but it won't handle layers other than :encoding\, and it won't handle already-opened handles with layers.
On 05/15/2014 01:16 PM\, Eric Brine wrote:
On Thu\, May 15\, 2014 at 11:09 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
so again\, i will ask the simple question to which i have not gotten a proper answer from all you unicode experts. if file::slurp reads in an ENTIRE file with raw mode sysread into a single string\, and THEN converts its to the requested format\, will that work?
Yes\, but it won't handle layers other than :encoding\, and it won't handle already-opened handles with layers.
that can be documented. most callers of read_file won't use it with already open files. maybe for sockets and other handles (e.g. \) which may not have layers. also with slurping the only common layer i know is already done which is line splitting (works on all platforms). so the decoding is the only issue. if you need more complex stuff\, then do it yourself. this is for common basic read in a whole file with utf decoding as needed. and that is by far the common case and what the module is intended for. it isn't meant to replace complex i/o\, just make common i/o simpler and faster.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, May 15\, 2014 at 4:31 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
and there aren't many bug reports on those hundreds of lines. in fact most are utf related. the code is very solid and that isn't just my opinion.
It's just *completely* broken with encodings\, which is a fairly critical bug. It just happens to get explicit crlf wrong. Its dealing with layers is dubious at best and flat out wrong more often than it should.
they don't ignore them. they pass on the encoding options to read_file and
write_file.
Sorry\, it seems I had misread the docs there.
Don't use sysread\, at least not unless you're absolutely sure it gives the
right answer. It's really that simple.
and you didn't answer the question about calling sysread in raw mode and encoding afterwards. try answering that and being useful here. it isn't my fault that sysread encoding is broken and i wrote file::slurp 11 years ago when sysread was fine and dandy.
Yes\, you can\, but why would you go through these complications? What makes you think you'd be faster/better/whatever than
sub read_file($filename\, $encoding = 'utf8') { open my $fh\, '\<:encoding($encoding)'\, $filename or die "Can't open $filename: $!"; return do { local $/; \<$fh> } }
It's not that this can't be made to work\, it's that a 100+ lines of code doing tricky things for optimizations that can be achieved using significantly less code\, and you're proposing to add more kludges to this.
Leon
On 05/15/2014 02:10 PM\, Leon Timmermans wrote:
On Thu\, May 15\, 2014 at 4:31 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
and there aren't many bug reports on those hundreds of lines. in fact most are utf related. the code is very solid and that isn't just my opinion.
It's just *completely* broken with encodings\, which is a fairly critical bug. It just happens to get explicit crlf wrong. Its dealing with layers is dubious at best and flat out wrong more often than it should.
and why would you need to explicitly pass in cdlf when it can do line splitting already? and as i said\, layers aren't a critical part of the concept. if you want complex i/o do it yourself. slurping is mostly for common text files (think config\, source\, etc.) which don't need anything but a possible decode. this isn't line by line or some other layered thing.
they don't ignore them. they pass on the encoding options to read_file and
write_file.
Sorry\, it seems I had misread the docs there.
i am not that dumb as to not pass in those options. the idea of those is to use read_file/write_file as they are working. eating my own dog food.
Don't use sysread\, at least not unless you're absolutely sure it gives the
right answer. It's really that simple.
and you didn't answer the question about calling sysread in raw mode and encoding afterwards. try answering that and being useful here. it isn't my fault that sysread encoding is broken and i wrote file::slurp 11 years ago when sysread was fine and dandy.
Yes\, you can\, but why would you go through these complications? What makes you think you'd be faster/better/whatever than
sub read_file($filename\, $encoding = 'utf8') { open my $fh\, '\<:encoding($encoding)'\, $filename or die "Can't open $filename: $!"; return do { local $/; \<$fh> } }
so do that your self. and what about all the users out there who don't need that and want a simple call? you can say the same thing for so many modules which may have a simple workaround in 6 lines of code but do a whole lot more. so stop your rant on this because slurp is used all over and it isn't going away. don't use it if you don't want to. helping to fix the decode problem is useful as other here have done.
It's not that this can't be made to work\, it's that a 100+ lines of code doing tricky things for optimizations that can be achieved using significantly less code\, and you're proposing to add more kludges to this.
tricky things for optimizations? you still haven't really read the code. most of the 'tricky' code is handling options\, line splitting on winblows (that is a bad special case)\, and such. very little other than looping over a LARGE sysread call of 1mb is done for optimization. and run the benchmark already. see the difference in MANY styles of slurping in files. choose a file size\, etc. the do/local way is almost the slowest possible way and FUGLY as all hell. i would never allow someone to write that code when i review it. even a sub wrapper on that is better but then you might as well use file::slurp. and yes\, i review code for all my work\, both consulting and placement so i have some experience on what good code looks like.
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, May 15\, 2014 at 2:04 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
also with slurping the only common layer i know is already done which is line splitting (works on all platforms). so the decoding is the only issue
:crlf is also very common.
On 05/15/2014 04:00 PM\, Eric Brine wrote:
On Thu\, May 15\, 2014 at 2:04 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
also with slurping the only common layer i know is already done which is line splitting (works on all platforms). so the decoding is the only issue
:crlf is also very common.
from what i see in perldoc perlio it just converts crlf to \n on platforms that need it. read_file has already done that for a long long time. this is true for both files slurped into a scalar and into an array of lines. so that isn't anything i need to be concerned about with layers and decoding.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Thu\, May 15\, 2014 at 10:00 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Thu\, May 15\, 2014 at 2:04 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
also with slurping the only common layer i know is already done which is line splitting (works on all platforms). so the decoding is the only issue
:crlf is also very common.
«binmode => ':crlf'» completely broken on File::Slurp. On Unix it's a no-op\, on Windows it will turn crlf translation *off*. Because:
${$buf_ref} =~ s/\015\012/\n/g if $is_win32 && !$opts->{'binmode'} ;
Emulating PerlIO layers really is a deep rabbit hole.
Leon
On 05/15/2014 04:25 PM\, Leon Timmermans wrote:
On Thu\, May 15\, 2014 at 10:00 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Thu\, May 15\, 2014 at 2:04 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
also with slurping the only common layer i know is already done which is line splitting (works on all platforms). so the decoding is the only issue
:crlf is also very common.
«binmode => ':crlf'» completely broken on File::Slurp. On Unix it's a no-op\, on Windows it will turn crlf translation *off*. Because:
${$buf_ref} =~ s/\015\012/\n/g if $is_win32 && !$opts->{'binmode'} ;
Emulating PerlIO layers really is a deep rabbit hole.
so you don't pass in :crlf to slurp. is that so hard to understand?? it isn't NEEDED. and it is emulated just fine and it works. no reports on that feature in a long time. try to find something constructive to say or just keep quiet on this. i get it. you don't like the module. good for you. doesn't mean you have to keep ripping it with unhelpful comments and such. crlf is handled - period. no need to deal with the :crlf layer. next?
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
Just to be difficult\, has anyone benchmarked
sub read_file($) { `/bin/cat \Q$_[0]\E` } # or "type" on windows
as an alternative\, recently?
On Thu\, May 15\, 2014 at 4:43 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
crlf is handled - period.
Yes\, but you were talking of breaking it by providing an implementation that only supported :encoding. I simply reminded you that you also needed to support :crlf too.
i get it. you don't like the module. good for you. doesn't mean you have to
keep ripping it with unhelpful comments and such.
Yeah\, we hate it so much we just had to try to help you fix it???
On Thu\, May 15\, 2014 at 4:50 PM\, David Nicol \davidnicol@​gmail\.com wrote:
Just to be difficult\, has anyone benchmarked
sub read_file($) { `/bin/cat \Q$_[0]\E` } # or "type" on windows
as an alternative\, recently?
Backticks read the same way C\
use Benchmark qw( cmpthese );
sub _read { my $buf; open(my $fh\, '\<'\, $ARGV[0]) or die $!; my $sz = -s $fh; read($fh\, $buf\, $sz); }
sub _sysread { my $buf; open(my $fh\, '\<'\, $ARGV[0]) or die $!; my $sz = -s $fh; sysread($fh\, $buf\, $sz); }
cmpthese (-5\, { qx => sub { my $buf = `/bin/cat \Q$ARGV[0]\E`; 1 }\, read => \&_read\, sysread => \&_read\, });
Rate qx sysread read qx 59.7/s -- -87% -87% sysread 451/s 655% -- -1% read 455/s 662% 1% --
1.5MB file
On 05/16/2014 12:33 AM\, Eric Brine wrote:
On Thu\, May 15\, 2014 at 4:43 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
crlf is handled - period.
Yes\, but you were talking of breaking it by providing an implementation that only supported :encoding. I simply reminded you that you also needed to support :crlf too.
how does it need to be supported more than it is now? it can be ignored if passed in as crlf conversion always happens now on winblows.
i get it. you don't like the module. good for you. doesn't mean you have to
keep ripping it with unhelpful comments and such.
Yeah\, we hate it so much we just had to try to help you fix it???
i was referring to leon\, not you.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On 05/16/2014 12:46 AM\, Eric Brine wrote:
On Thu\, May 15\, 2014 at 4:50 PM\, David Nicol \davidnicol@​gmail\.com wrote:
Just to be difficult\, has anyone benchmarked
sub read\_file\($\) \{ \`/bin/cat \\Q$\_\[0\]\\E\` \} \# or "type" on windows
as an alternative\, recently?
Backticks read the same way C\
does (8K at a time)\, so it's slower than using C\ to read from a file. (There's quite a number of other issues too\, in general and as a File::Slurp implementation). use Benchmark qw( cmpthese );
sub _read { my $buf; open(my $fh\, '\<'\, $ARGV[0]) or die $!; my $sz = -s $fh; read($fh\, $buf\, $sz); }
sub _sysread { my $buf; open(my $fh\, '\<'\, $ARGV[0]) or die $!; my $sz = -s $fh; sysread($fh\, $buf\, $sz); }
cmpthese (-5\, { qx => sub { my $buf = `/bin/cat \Q$ARGV[0]\E`; 1 }\, read => \&_read\, sysread => \&_read\, });
is that a mistake with both read and sysread calling _read?
Rate qx sysread read
qx 59.7/s -- -87% -87% sysread 451/s 655% -- -1% read 455/s 662% 1% --
those last two are basically the same result as they call the same code. sysread should be much faster than read. file::slurp has a good benchmark script in extras which can easily be added to. it has many variations of slurping already in there and you can set the file size and other things.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
Leon Timmermans writes:
There is no reason «my $foo = do { local $/; \<$fh> };» would need to be any slower than what File::Slurp is doing right now. In fact I'm planning to fix just that in 5.21.
That's great to hear\, Leon. As well as speeding up people using that idiom directly\, it'll also help users of Path::Class and Path::Tiny\, both of which have -> slurp methods which are implemented using it.
Uri Guttman writes:
On 05/15/2014 02:10 PM\, Leon Timmermans wrote:
sub read_file($filename\, $encoding = 'utf8') { open my $fh\, '\<:encoding($encoding)'\, $filename or die "Can't open $filename: $!"; return do { local $/; \<$fh> } }
so do that your self. and what about all the users out there who don't need that and want a simple call?
Indeed\, File::Slurp provides a convenient interface for many users\, who will wish to use it for that reason regardless of any efficiency concerns.
I interpreted Leon's suggestion above not as something users should do _instead_ of using read_file\, but as a potential _implementation_ of read_file in File::Slurp (or at least the nub of such).
Once using \<$fh> has been sped up sufficiently\, one possibility would be forFile::Slurp to switch to using that\, thereby simplifying its implementation and avoiding any of the issues with using sysread.
Best wishes
Smylers -- Why drug companies shouldn't be allowed to hide research results that they don't like: http://gu.com/p/3zat8 — UK gov wasted millions on Tamiflu Sign the AllTrials petition: http://www.alltrials.net/
I just wrote:
As well as speeding up people using that idiom directly\, ...
Erm\, I meant speeding up their code\, not actually speeding up the people.
Smylers -- Why drug companies shouldn't be allowed to hide research results that they don't like: http://gu.com/p/3zat8 — UK gov wasted millions on Tamiflu Sign the AllTrials petition: http://www.alltrials.net/
On Thu\, May 15\, 2014 at 10:43 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
so you don't pass in :crlf to slurp. is that so hard to understand?? it isn't NEEDED. and it is emulated just fine and it works. no reports on that feature in a long time.
You need it to (portably) open text files created on Windows. That is not an obscure use-case.
try to find something constructive to say or just keep quiet on this. i get it. you don't like the module. good for you. doesn't mean you have to keep ripping it with unhelpful comments and such. crlf is handled - period. no need to deal with the :crlf layer. next?
How is pointing out bugs "unhelpful comments"? How is offering solutions (like not using sysread when that's not appropriate) not "constructive"?
Leon
On Fri\, May 16\, 2014 at 1:38 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
is that a mistake with both read and sysread calling _read?
Of course. Was super duper tired. Fixed:
Rate qx read sysread qx 62.3/s -- -86% -91% read 430/s 591% -- -40% sysread 718/s 1052% 67% --
On 05/16/2014 07:38 AM\, Leon Timmermans wrote:
On Thu\, May 15\, 2014 at 10:43 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
so you don't pass in :crlf to slurp. is that so hard to understand?? it isn't NEEDED. and it is emulated just fine and it works. no reports on that feature in a long time.
You need it to (portably) open text files created on Windows. That is not an obscure use-case.
so it is a trivial fix to check for :crlf\, do the same line conversion as now and not set the actual layer. this isn't rocket surgery.
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On 05/16/2014 08:12 AM\, Eric Brine wrote:
On Fri\, May 16\, 2014 at 1:38 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
is that a mistake with both read and sysread calling _read?
Of course. Was super duper tired. Fixed:
Rate qx read sysread
qx 62.3/s -- -86% -91% read 430/s 591% -- -40% sysread 718/s 1052% 67% --
more like what i have seen. that is a lot of speed to make up. i don't see how do/local or read can get to sysread speed given the i/o layers overhead. but as i said\, even with a speedup\, the slurp api is cleaner. and replacing the implementation isn't an issue as i said most of the code is handling options\, errors and such. little code other than sysread and large read sizes are optimizations. i even found that just the overhead of option and error checking and handling is noticeable in benchmarks. a plain sysread in the benchmark script beats out read_file for just that reason. i keep telling you all to look at and run the benchmark.
also the most common homebrew code i have seen is more like this:
$text = join( ''\, \<$fh> ) ;
the use of $/ is not well known so advocating it is not a good solution IMNSHO.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
On Fri\, May 16\, 2014 at 4:29 PM\, Uri Guttman \uri@​stemsystems\.com wrote:
more like what i have seen. that is a lot of speed to make up. i don't see how do/local or read can get to sysread speed given the i/o layers overhead.
It's not so much layer overhead per se\, but buffering overhead in a specific case that doesn't need buffering. Turning it off will remarkably increase the performance. Something along the lines of:
sub read_file ($filename\, $layer = ':unix') { open my $fh\, "\<:$layer"\, $filename or die "Couldn't open $filename: $!"; read $fh\, my $buffer\, -s $fh or die "Couldn't read $filename: $!"; return $buffer; }
is just as fast as the code in File::Slurp (the read itself is the bottleneck\, there isn't much left to optimize).
As soon as you do anything that uses a buffer this performance benefits goes away. This includes crlf translation\, line based IO and decoding.
Leon
On Wed May 14 02:11:21 2014\, mcgrath.martin@gmail.com wrote:
Apologies if this has already been discussed. I don't subscribe to p5p but have searched the list before posting\, and didn't find anything relating to this issue.
This patch replaces the example/recommendation of using File::Slurp with Path::Tiny in perlfaq5.
The reasoning behind this is an outstanding critical bug with File::Slurp:
Discussion in this RT appears to have shifted to how best to fix File::Slurp. File::Slurp is not a core module\, so discussion of how to improve it would normally have been conducted in its own bug queue (presumably\, https://rt.cpan.org//Dist/Display.html?Queue=File-Slurp). I realize that shutting off discussion in p5p at this point would be Canutian\, but can I at least pose these questions:
Do we want to modify perlfaq5 in the manner suggested by the OP two days ago?
Do we want to modify perlfaq5 at all?
If we do\, does this go into 5.20?
My two cents:
File::Slurp is a well-known library. I can recall lightning talks Uri gave about it at YAPCs a decade ago. Whatever problems it has can be hashed out\, preferably in its own bug queue and don't warrant its deletion from perlfaq5. "Mend it\, don't end it." Perhaps some discussion of Path::Tiny's slurping capabilities ought to be added to perlfaq5. And\, no\, this does not go into 5.20.
Thank you very much. Jim Keenan
* James E Keenan via RT \perlbug\-followup@​perl\.org [2014-05-16T17:56:32]
If we do\, does this go into 5.20?
No.
-- rjbs
On Fri\, May 16\, 2014 at 10:29 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
more like what i have seen. that is a lot of speed to make up. i don't see how do/local or read can get to sysread speed given the i/o layers overhead.
FWIW\, if you open the file with :unix\, then you don't get :perlio and then you don't have the buffering. At that point\, read() on the file size (for ordinary files)\, does sysread looping in C -- i.e. adding up bytes read in chunks in C and not doing byte arithmetic and logic using perl ops.
In my benchmarking\, that was as fast or faster in some cases. I've tried to quickly reconstruct that here:
Benchmark: https://gist.github.com/dagolden/c4dad291cd0b7156f268
Source: https://gist.github.com/dagolden/973bcfa11816e2d96c65
(Small was 50k\, medium was 500k\, large was 5000k)
I'm not claiming it's an entirely fair benchmark\, but it does show a way to get even faster performance in some circumstances.
Path::Tiny has a bunch of method call overhead and abstraction (plus locking) that the others don't have\, which is why it's a bit slower.
David
On 05/16/2014 10:53 PM\, David Golden wrote:
On Fri\, May 16\, 2014 at 10:29 AM\, Uri Guttman \uri@​stemsystems\.com wrote:
more like what i have seen. that is a lot of speed to make up. i don't see how do/local or read can get to sysread speed given the i/o layers overhead.
FWIW\, if you open the file with :unix\, then you don't get :perlio and then you don't have the buffering. At that point\, read() on the file size (for ordinary files)\, does sysread looping in C -- i.e. adding up bytes read in chunks in C and not doing byte arithmetic and logic using perl ops.
and why would that be better than calling sysread directly? given the large read size of 1mb (most text files are way below that) you only get one loop and little overhead. i benchmarked a variation which does only one read on smaller files but the overhead of checking that was seen (compared to a bare sysread with no extra code).
In my benchmarking\, that was as fast or faster in some cases. I've tried to quickly reconstruct that here:
Benchmark: https://gist.github.com/dagolden/c4dad291cd0b7156f268
Source: https://gist.github.com/dagolden/973bcfa11816e2d96c65
(Small was 50k\, medium was 500k\, large was 5000k)
I'm not claiming it's an entirely fair benchmark\, but it does show a way to get even faster performance in some circumstances.
Path::Tiny has a bunch of method call overhead and abstraction (plus locking) that the others don't have\, which is why it's a bit slower.
you can easily add an entry into the benchmark that come with file::slurp. i browsed the code of path::tiny today (not a tiny module! :). if you want to talk about some ideas back and forth\, i am open to it.
thanx\,
uri
-- Uri Guttman - The Perl Hunter The Best Perl Jobs\, The Best Perl Hackers http://PerlHunter.com
Migrated from rt.perl.org#121870 (status was 'resolved')
Searchable as RT121870$