Closed p5pRT closed 17 years ago
Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when the same exact script gives very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)
Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and a confused person can easily spend several hours confused...
Cheers\, Ben
Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when identical Perl scripts give very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)
Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and in many cases a confused person will spend several hours confused...
Cheers\, Ben
In message \OF47CC4F38\.CF9B333E\-ON8525682E\.0054550B@​trepp\.com\, Ben_Tilly@trepp.com writes:
: Is there any possibility of having Perl's chomp() command be modified to : recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty : confusion for people working in a cross-platform environment is when : identical Perl scripts give very different results on the same exact file : depending on whether you are running under *nix or Windows. (Particularly : an issue with Samba because people wind up reading under one system files : created under the other.)
Have a look at \<URL:http://language.perl.com/ppt/src/nlcvt/nlcvt> for an example of how to do what you want (or at least something similar).
Greg
In message \OF47CC4F38\.CF9B333E\-ON8525682E\.0054550B@​trepp\.com\, Ben_Tilly@trepp.com writes:
: Is there any possibility of having Perl's chomp() command be modified to : recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty : confusion for people working in a cross-platform environment is when : identical Perl scripts give very different results on the same exact file : depending on whether you are running under *nix or Windows. (Particularly : an issue with Samba because people wind up reading under one system files : created under the other.)
Have a look at \<URL:http://language.perl.com/ppt/src/nlcvt/nlcvt> for an example of how to do what you want (or at least something similar).
When I said\, "I talked to" I meant that. I don't need the pointer - I know how to handle it. But I wind up answering questions from people who do not. In my experience most of them use chomp() so the confusion is preventable. (After all you expect chomp() to get rid of line endings\, right?)
Cheers\, Ben
When I said\, "I talked to" I meant that. I don't need the pointer - I know how to handle it. But I wind up answering questions from people who do not. In my experience most of them use chomp() so the confusion is preventable. (After all you expect chomp() to get rid of line endings\, right?)
I expect chomp() to remove one and only one terminating instance of the precise string to which $/ has been set; no more\, no less. What were you expecting?
--tom
At 10.22 -0500 1999.11.19\, Ben_Tilly@trepp.com wrote:
I have just talked to one too many people who have been bitten by this...
Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when identical Perl scripts give very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)
Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and in many cases a confused person will spend several hours confused...
Well\, these sam people will also have a problem with readline and \<>. And I certainly don't want behavior where readline and chomp treat different things as record separators.
I would like to see\, perhaps\, a regex IRS\, so you could do:
$/ = qr/(?:\015\012?|\012)/;
or whatever. Of course\, that is flawed\, in that it won't catch the special (usually broken) case of a file having CR\, LF\, or CRLF mixed in the same file. Oh well.
Another solution would involve per-filehandle IRS\, where you could call a function (say\, textmode()) that would inspect the filehandle and set the IRS appropriately for that filehandle. This is more subject to failure for sockets\, though\, because it would involve reading\, looking at the data\, and then seeking back to the beginning.
I have a prototype of something that tied filehandles to do this\, but it fails with sockets\, and doesn't do anything for chomp() anyway (and didn't work great anyway because of some flaws in tied filehandles and prototypes ... I was using 5.004\, I don't know if the flaws have been fixed or whatnot).
-- Chris Nandor mailto:pudge@pobox.com http://pudge.net/ %PGPKey = ('B76E72AD'\, [1024\, '0824090B CE73CA10 1FF77F13 8180B6B6'])
Ben_Tilly@trepp.com wrote
Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp?
I hope not. chomp() should match the string in $?\, no more or less.
Your problem is not with chomp(). Rather it is with the I/O subsystem. If you are reading a file as a newline-terminated text file\, then what your Perl code should see is "\n" and nothing else.
You can achieve this with tied filehandles\, but I understand that isn't what you're looking for.
I think the "right" way of doing this is by providing some sort of filter apparatus on files. Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.
Mike Guy
I think the "right" way of doing this is by providing some sort of filter apparatus on files.
Well\, that makes two of us.
Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.
Check mjd's summaries?
--tom
I wrote
I hope not. chomp() should match the string in $?\, no more or less. $/
Damn shift key.
Mike Guy
Tom Christiansen writes: : >When I said\, "I talked to" I meant that. I don't need the pointer - I know : >how to handle it. But I wind up answering questions from people who do : >not. In my experience most of them use chomp() so the confusion is : >preventable. (After all you expect chomp() to get rid of line endings\, : >right?) : : I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?
I expect people to expect Perl to do the right thing.
Larry
M.J.T. Guy writes: : I think the "right" way of doing this is by providing some sort of : filter apparatus on files. Things of this sort were discussed in : the context of unicode. I don't recall where (if anywhere) that : ended.
Ended? It hasn't started yet...
(Can you tell I've spent too much time rewriting the Camel book today? :-)
Yes\, input filters should handle this. And a good case can be made that the *default* input filter should handle it\, along with UTF-8 recognition. It also has to be blazing fast\, of course\, along with reading your mind. But that's Perl for the coarse. Or something like that.
Larry
: I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?
I expect people to expect Perl to do the right thing.
And that would be what\, sniff around the stdio buffer the first time you play with it and figure out what it smells like?
--tom
Tom Christiansen writes: : >: I expect chomp() to remove one and only one terminating instance of the : >: precise string to which $/ has been set; no more\, no less. What were : >: you expecting? : : >I expect people to expect Perl to do the right thing. : : And that would be what\, sniff around the stdio buffer the first time you : play with it and figure out what it smells like?
Why do you say "you"? Did I say I expect Perl to do the right thing? :-)
Seriously\, we are entering an era when dwimmerly action on input will be a necessary evil. I could wish it were otherwise\, but my supply of divine fiats is low. And I don't think anyone else has enough fiats to pull it off either. For the near future I only see a chaotic dance around the UTF-8 strange attractor\, in part because a lot of butterflies are flapping their wings near the UTF-16 attractor instead. We're going to live in interesting times\, whether or not that's an ancient Chinese curse.
Larry
Tom Christiansen \tchrist@​jhereg\.perl\.com writes:
: I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?
I expect people to expect Perl to do the right thing.
And that would be what\, sniff around the stdio buffer the first time you play with it and figure out what it smells like?
That is far from daft. sv_gets() (the internals of readline) would know what it had used to find the end of the line. It could leave the information around for chomp to use.
But the "right thing" is just to return \n as a logical newline however it was represented in the buffer (unless in binmode of course). Then chomp'ing \n is fine.
--tom -- Nick Ing-Simmons
Nick Ing-Simmons \nick@​ing\-simmons\.net writes: | Tom Christiansen \tchrist@​jhereg\.perl\.com writes: | >>: I expect chomp() to remove one and only one terminating instance of the | >>: precise string to which $/ has been set; no more\, no less. What were | >>: you expecting? | > | >>I expect people to expect Perl to do the right thing. | > | >And that would be what\, sniff around the stdio buffer the first time you | >play with it and figure out what it smells like? | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
That is far from daft. sv_gets() (the internals of readline) would know | ||||||||||||||||||
what it had used to find the end of the line. It could leave the | ||||||||||||||||||
information around for chomp to use. |
But as soon as you have a program that opens multiple files of differing formats\, this breaks down. You end up with taint-like tracing of strings to track which form of file each string came from.
Which means that...
|| But the "right thing" is || just to return \n as a logical newline however it was represented in the || buffer (unless in binmode of course). Then chomp'ing \n is fine.
is really a much better choice. That just leaves the issue of determining the right filtering to do for an output file so that it matches the input file it is derived from or the target it is being written to or whatever is the most significant issue - which the programmer will have to deal with.
-- John Macdonald jmm@jmm.pickering.elegant.com
Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.
Check mjd's summaries?
http://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991114.html#More_About_Line_Disciplines
http://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991107.html#Record_Separators_that_Contain_NUL
My summary of the summaries:
1. Larry said it would be important to have `line disciplines' settable on filehandles\, and that it would be important for the default ones to be fast.
2. Sam Tregar said he would do it\, but I don't know if he will.
3. This is the third week in a row that it has cropped up.
John Macdonald \jmm@​elegant\.com writes:
|| || That is far from daft. sv_gets() (the internals of readline) would know || what it had used to find the end of the line. It could leave the || information around for chomp to use.
But as soon as you have a program that opens multiple files of differing formats\, this breaks down. You end up with taint-like tracing of strings to track which form of file each string came from.
Yes\, the EOLN string would have to be annotated on the SV somewhere presumably as "magic". Having chomp look for "EOLN magic" on the SV would be easy to do. The 'set' part of the magic would clear the field.
Which means that...
|| But the "right thing" is || just to return \n as a logical newline however it was represented in the || buffer (unless in binmode of course). Then chomp'ing \n is fine.
is really a much better choice.
I know ;-) I am delinquent in implementing it.
That just leaves the issue of determining the right filtering to do for an output file so that it matches the input file it is derived from or the target it is being written to or whatever is the most significant issue - which the programmer will have to deal with. -- Nick Ing-Simmons \nik@​tiuk\.ti\.com Via\, but not speaking for: Texas Instruments Ltd.
Nick Ing-Simmons wrote : | John Macdonald \jmm@​elegant\.com writes: | > | > | That is far from daft. sv_gets() (the internals of readline) would know | > | what it had used to find the end of the line. It could leave the | > | information around for chomp to use. | > | >But as soon as you have a program that opens multiple files of | >differing formats\, this breaks down. You end up with taint-like | >tracing of strings to track which form of file each string came | >from. | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Yes\, the EOLN string would have to be annotated on the SV somewhere | ||||||||||||||||||||||||||||
presumably as "magic". Having chomp look for "EOLN magic" on the SV | ||||||||||||||||||||||||||||
would be easy to do. The 'set' part of the magic would clear the field. |
It gets messier...
$para = "$file1_lines$file2_lines$file3_lines";
Which of the three EOLN magics gets assigned to $para?
>Which means that... | > | > | But the "right thing" is | > | just to return \n as a logical newline however it was represented in the | > | buffer (unless in binmode of course). Then chomp'ing \n is fine. | > | >is really a much better choice. | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I know ;-) I am delinquent in implementing it. | ||||||||||||||||||
>That just leaves the issue of | ||||||||||||||||||
>determining the right filtering to do for an output file so that | ||||||||||||||||||
>it matches the input file it is derived from or the target it is | ||||||||||||||||||
>being written to or whatever is the most significant issue - | ||||||||||||||||||
>which the programmer will have to deal with. |
-- objects: | John Macdonald Think of them as data with an attitude. | jmm@elegant.com
Tom Christiansen writes: : >When I said\, "I talked to" I meant that. I don't need the pointer - I know : >how to handle it. But I wind up answering questions from people who do : >not. In my experience most of them use chomp() so the confusion is : >preventable. (After all you expect chomp() to get rid of line endings\, : >right?) : : I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?
I consider $/ the mechanism through which you can change "line ending" to "paragraph ending" etc...
I expect people to expect Perl to do the right thing.
Ah yes\, now I remember why I love this language. :-)
The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(
Ben
PS Random note. A random idea a co-worker and I have been throwing around (based on my massively speeding up a program by doing this) is "lazy concatentation" of strings. If someone is building up a string through interpolation and concatentation\, it makes sense to internally use something closer to an array of strings\, and then join that into one string if you ever need to. (print() has no need to join them\, the RE engine does.) This should transparently accelerate a lot of current Perl code...
The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(
Anything that deviates from the notion of internally representing the line terminator as a single character (the virtual "\n") is a grave error.
--tom
At 07.54 -0700 1999.11.22\, Tom Christiansen wrote:
The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(
Anything that deviates from the notion of internally representing the line terminator as a single character (the virtual "\n") is a grave error.
I had another idea ... which maybe completely useless\, but it is interesting to think about. If $/ is a regex\, then it is only a regex until it matches the first time\, at which point $/ becomes equal to $1. So:
$/ = qr/(\015?\012|\015)/;
As soon as it sees \015\012\, \012\, or \015\, $/ becomes whatever it matched. Again\, this would require per-filehandle IRS to be useful. Not advocating\, just throwing it out for fun. :D
-- Chris Nandor mailto:pudge@pobox.com http://pudge.net/ %PGPKey = ('B76E72AD'\, [1024\, '0824090B CE73CA10 1FF77F13 8180B6B6'])
John Macdonald \jmm@​elegant\.com writes:
Nick Ing-Simmons wrote : John Macdonald \jmm@​elegant\.com writes: > > That is far from daft. sv_gets() (the internals of readline) would know > what it had used to find the end of the line. It could leave the > information around for chomp to use. > >But as soon as you have a program that opens multiple files of >differing formats\, this breaks down. You end up with taint-like >tracing of strings to track which form of file each string came >from. Yes\, the EOLN string would have to be annotated on the SV somewhere presumably as "magic". Having chomp look for "EOLN magic" on the SV would be easy to do. The 'set' part of the magic would clear the field. It gets messier...
$para = "$file1_lines$file2_lines$file3_lines";
Which of the three EOLN magics gets assigned to $para?
In my purely fictional implementation none would.
-- Nick Ing-Simmons \nik@​tiuk\.ti\.com Via\, but not speaking for: Texas Instruments Ltd.
Nick Ing-Simmons wrote : | John Macdonald \jmm@​elegant\.com writes: | >Nick Ing-Simmons wrote : | > | John Macdonald \jmm@​elegant\.com writes: | > | > | > | > | That is far from daft. sv_gets() (the internals of readline) would know | > | > | what it had used to find the end of the line. It could leave the | > | > | information around for chomp to use. | > | > | > | >But as soon as you have a program that opens multiple files of | > | >differing formats\, this breaks down. You end up with taint-like | > | >tracing of strings to track which form of file each string came | > | >from. | > | > | Yes\, the EOLN string would have to be annotated on the SV somewhere | > | presumably as "magic". Having chomp look for "EOLN magic" on the SV | > | would be easy to do. The 'set' part of the magic would clear the field. | > | >It gets messier... | > | > $para = "$file1_lines$file2_lines$file3_lines"; | > | >Which of the three EOLN magics gets assigned to $para? | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
In my purely fictional implementation none would. |
so then:
chomp $file3_lines; chomp $para;
could remove different values from two strings with the same termination value originating from the same source file line\, which would violate the principle of least astonishment for some users.
Good thing this whole issue is fictional. :-)
-- objects: | John Macdonald Think of them as data with an attitude. | jmm@elegant.com
On Mon\, 22 Nov 1999\, Mark-Jason Dominus wrote:
2. Sam Tregar said he would do it\, but I don't know if he will.
Unfortunately Sam Tregar is just a novice perl hacker! I've been poking around a bit but I'm not convinced I've even found the right place to start working yet.
If this is a high-priority item\, perhaps someone more experienced should consider giving it a try.
-sam
[Ben_Tilly@trepp.com - Thu Nov 18 23:18:09 1999]:
Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp?
Do you mean that chomp()\, rather than being equivalent to:
s{$/\z}{};
should be:
s{(\r|\n|\r\n)\z}{};
?
The RT System itself - Status changed from 'stalled' to 'open'
[Ben_Tilly@trepp.com - Thu Nov 18 23:18:09 1999]:
Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when the same exact script gives very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)
The record separator that is defaulted to '\n' on Unix\, will change depending on the operating system Perl is running on - chomp() relies heavily upon the value of $/. Chomping native files shouldn't cause much noise\, whereas chomping 'foreign' files with differing file-endings would require that you localize $/ in the scope of operation.
Example: { local $/ = "\r\n"; $chomped = chomp(@lines); }
Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and a confused person can easily spend several hours confused...
I'd say\, it's rather clearly documented\, without lack of accurate description. Although the behaviour requested is desirable\, it doesn't seem possible to integrate the inevitable changes to doop.c:Perl_do_chomp\, where the record separator\, known as global PL_rs\, is extensively utilized and relied upon - allowing for multiple values would require tremendous changes and furthermore\, would likely break backwards compatibility.
Rejected\, mostly for backwards compatibility reasons.
@rgs - Status changed from 'open' to 'rejected'
Migrated from rt.perl.org#1807 (status was 'rejected')
Searchable as RT1807$