Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 560 forks source link

chomp() can be confusing #876

Closed p5pRT closed 17 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#1807 (status was 'rejected')

Searchable as RT1807$

p5pRT commented 24 years ago

From Ben_Tilly@trepp.com

Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when the same exact script gives very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)

Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and a confused person can easily spend several hours confused...

Cheers\, Ben

p5pRT commented 24 years ago

From Ben_Tilly@trepp.com

Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when identical Perl scripts give very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)

Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and in many cases a confused person will spend several hours confused...

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

In message \OF47CC4F38\.CF9B333E\-ON8525682E\.0054550B@​trepp\.com\,   Ben_Tilly@​trepp.com writes​:

: Is there any possibility of having Perl's chomp() command be modified to : recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty : confusion for people working in a cross-platform environment is when : identical Perl scripts give very different results on the same exact file : depending on whether you are running under *nix or Windows. (Particularly : an issue with Samba because people wind up reading under one system files : created under the other.)

Have a look at \<URL​:http​://language.perl.com/ppt/src/nlcvt/nlcvt> for an example of how to do what you want (or at least something similar).

Greg

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

In message \OF47CC4F38\.CF9B333E\-ON8525682E\.0054550B@&#8203;trepp\.com\, Ben_Tilly@​trepp.com writes​:

: Is there any possibility of having Perl's chomp() command be modified to : recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty : confusion for people working in a cross-platform environment is when : identical Perl scripts give very different results on the same exact file : depending on whether you are running under *nix or Windows. (Particularly : an issue with Samba because people wind up reading under one system files : created under the other.)

Have a look at \<URL​:http​://language.perl.com/ppt/src/nlcvt/nlcvt> for an example of how to do what you want (or at least something similar).

When I said\, "I talked to" I meant that. I don't need the pointer - I know how to handle it. But I wind up answering questions from people who do not. In my experience most of them use chomp() so the confusion is preventable. (After all you expect chomp() to get rid of line endings\, right?)

Cheers\, Ben

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

When I said\, "I talked to" I meant that. I don't need the pointer - I know how to handle it. But I wind up answering questions from people who do not. In my experience most of them use chomp() so the confusion is preventable. (After all you expect chomp() to get rid of line endings\, right?)

I expect chomp() to remove one and only one terminating instance of the precise string to which $/ has been set; no more\, no less. What were you expecting?

--tom

p5pRT commented 24 years ago

From @pudge

At 10.22 -0500 1999.11.19\, Ben_Tilly@​trepp.com wrote​:

I have just talked to one too many people who have been bitten by this...

Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when identical Perl scripts give very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)

Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and in many cases a confused person will spend several hours confused...

Well\, these sam people will also have a problem with readline and \<>. And I certainly don't want behavior where readline and chomp treat different things as record separators.

I would like to see\, perhaps\, a regex IRS\, so you could do​:

  $/ = qr/(?​:\015\012?|\012)/;

or whatever. Of course\, that is flawed\, in that it won't catch the special (usually broken) case of a file having CR\, LF\, or CRLF mixed in the same file. Oh well.

Another solution would involve per-filehandle IRS\, where you could call a function (say\, textmode()) that would inspect the filehandle and set the IRS appropriately for that filehandle. This is more subject to failure for sockets\, though\, because it would involve reading\, looking at the data\, and then seeking back to the beginning.

I have a prototype of something that tied filehandles to do this\, but it fails with sockets\, and doesn't do anything for chomp() anyway (and didn't work great anyway because of some flaws in tied filehandles and prototypes ... I was using 5.004\, I don't know if the flaws have been fixed or whatnot).

-- Chris Nandor mailto​:pudge@​pobox.com http​://pudge.net/ %PGPKey = ('B76E72AD'\, [1024\, '0824090B CE73CA10 1FF77F13 8180B6B6'])

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com wrote

Is there any possibility of having Perl's chomp() command be modified to recognize \n\, \r\, and \r\n as line-endings to chomp?

I hope not. chomp() should match the string in $?\, no more or less.

Your problem is not with chomp(). Rather it is with the I/O subsystem. If you are reading a file as a newline-terminated text file\, then what your Perl code should see is "\n" and nothing else.

You can achieve this with tied filehandles\, but I understand that isn't what you're looking for.

I think the "right" way of doing this is by providing some sort of filter apparatus on files. Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.

Mike Guy

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

I think the "right" way of doing this is by providing some sort of filter apparatus on files.

Well\, that makes two of us.

Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.

Check mjd's summaries?

--tom

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

I wrote

I hope not. chomp() should match the string in $?\, no more or less.   $/

Damn shift key.

Mike Guy

p5pRT commented 24 years ago

From @TimToady

Tom Christiansen writes​: : >When I said\, "I talked to" I meant that. I don't need the pointer - I know : >how to handle it. But I wind up answering questions from people who do : >not. In my experience most of them use chomp() so the confusion is : >preventable. (After all you expect chomp() to get rid of line endings\, : >right?) : : I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?

I expect people to expect Perl to do the right thing.

Larry

p5pRT commented 24 years ago

From @TimToady

M.J.T. Guy writes​: : I think the "right" way of doing this is by providing some sort of : filter apparatus on files. Things of this sort were discussed in : the context of unicode. I don't recall where (if anywhere) that : ended.

Ended? It hasn't started yet...

(Can you tell I've spent too much time rewriting the Camel book today? :-)

Yes\, input filters should handle this. And a good case can be made that the *default* input filter should handle it\, along with UTF-8 recognition. It also has to be blazing fast\, of course\, along with reading your mind. But that's Perl for the coarse. Or something like that.

Larry

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

​: I expect chomp() to remove one and only one terminating instance of the ​: precise string to which $/ has been set; no more\, no less. What were ​: you expecting?

I expect people to expect Perl to do the right thing.

And that would be what\, sniff around the stdio buffer the first time you play with it and figure out what it smells like?

--tom

p5pRT commented 24 years ago

From @TimToady

Tom Christiansen writes​: : >​: I expect chomp() to remove one and only one terminating instance of the : >​: precise string to which $/ has been set; no more\, no less. What were : >​: you expecting? : : >I expect people to expect Perl to do the right thing. : : And that would be what\, sniff around the stdio buffer the first time you : play with it and figure out what it smells like?

Why do you say "you"? Did I say I expect Perl to do the right thing? :-)

Seriously\, we are entering an era when dwimmerly action on input will be a necessary evil. I could wish it were otherwise\, but my supply of divine fiats is low. And I don't think anyone else has enough fiats to pull it off either. For the near future I only see a chaotic dance around the UTF-8 strange attractor\, in part because a lot of butterflies are flapping their wings near the UTF-16 attractor instead. We're going to live in interesting times\, whether or not that's an ancient Chinese curse.

Larry

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Tom Christiansen \tchrist@&#8203;jhereg\.perl\.com writes​:

​: I expect chomp() to remove one and only one terminating instance of the ​: precise string to which $/ has been set; no more\, no less. What were ​: you expecting?

I expect people to expect Perl to do the right thing.

And that would be what\, sniff around the stdio buffer the first time you play with it and figure out what it smells like?

That is far from daft. sv_gets() (the internals of readline) would know what it had used to find the end of the line. It could leave the information around for chomp to use.

But the "right thing" is just to return \n as a logical newline however it was represented in the buffer (unless in binmode of course). Then chomp'ing \n is fine.

--tom -- Nick Ing-Simmons

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Nick Ing-Simmons \nick@&#8203;ing\-simmons\.net writes​: Tom Christiansen \tchrist@&#8203;jhereg\.perl\.com writes​: >>​: I expect chomp() to remove one and only one terminating instance of the >>​: precise string to which $/ has been set; no more\, no less. What were >>​: you expecting? > >>I expect people to expect Perl to do the right thing. > >And that would be what\, sniff around the stdio buffer the first time you >play with it and figure out what it smells like?
That is far from daft. sv_gets() (the internals of readline) would know
what it had used to find the end of the line. It could leave the
information around for chomp to use.

But as soon as you have a program that opens multiple files of differing formats\, this breaks down. You end up with taint-like tracing of strings to track which form of file each string came from.

Which means that...

|| But the "right thing" is || just to return \n as a logical newline however it was represented in the || buffer (unless in binmode of course). Then chomp'ing \n is fine.

is really a much better choice. That just leaves the issue of determining the right filtering to do for an output file so that it matches the input file it is derived from or the target it is being written to or whatever is the most significant issue - which the programmer will have to deal with.

-- John Macdonald jmm@​jmm.pickering.elegant.com

p5pRT commented 24 years ago

From @mjdominus

Things of this sort were discussed in the context of unicode. I don't recall where (if anywhere) that ended.

Check mjd's summaries?

http​://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991114.html#More_About_Line_Disciplines

http​://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991107.html#Record_Separators_that_Contain_NUL

My summary of the summaries​:

1. Larry said it would be important to have `line disciplines'   settable on filehandles\, and that it would be important for the   default ones to be fast.

2. Sam Tregar said he would do it\, but I don't know if he will.

3. This is the third week in a row that it has cropped up.

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

John Macdonald \jmm@&#8203;elegant\.com writes​:

|| || That is far from daft. sv_gets() (the internals of readline) would know || what it had used to find the end of the line. It could leave the || information around for chomp to use.

But as soon as you have a program that opens multiple files of differing formats\, this breaks down. You end up with taint-like tracing of strings to track which form of file each string came from.

Yes\, the EOLN string would have to be annotated on the SV somewhere presumably as "magic". Having chomp look for "EOLN magic" on the SV would be easy to do. The 'set' part of the magic would clear the field.

Which means that...

|| But the "right thing" is || just to return \n as a logical newline however it was represented in the || buffer (unless in binmode of course). Then chomp'ing \n is fine.

is really a much better choice.

I know ;-) I am delinquent in implementing it.

That just leaves the issue of determining the right filtering to do for an output file so that it matches the input file it is derived from or the target it is being written to or whatever is the most significant issue - which the programmer will have to deal with. -- Nick Ing-Simmons \nik@&#8203;tiuk\.ti\.com Via\, but not speaking for​: Texas Instruments Ltd.

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Nick Ing-Simmons wrote : John Macdonald \jmm@&#8203;elegant\.com writes​: > > That is far from daft. sv_gets() (the internals of readline) would know > what it had used to find the end of the line. It could leave the > information around for chomp to use. > >But as soon as you have a program that opens multiple files of >differing formats\, this breaks down. You end up with taint-like >tracing of strings to track which form of file each string came >from.
Yes\, the EOLN string would have to be annotated on the SV somewhere
presumably as "magic". Having chomp look for "EOLN magic" on the SV
would be easy to do. The 'set' part of the magic would clear the field.

It gets messier...

  $para = "$file1_lines$file2_lines$file3_lines";

Which of the three EOLN magics gets assigned to $para?

>Which means that... > > But the "right thing" is > just to return \n as a logical newline however it was represented in the > buffer (unless in binmode of course). Then chomp'ing \n is fine. > >is really a much better choice.
I know ;-) I am delinquent in implementing it.
>That just leaves the issue of
>determining the right filtering to do for an output file so that
>it matches the input file it is derived from or the target it is
>being written to or whatever is the most significant issue -
>which the programmer will have to deal with.

-- objects​: | John Macdonald   Think of them as data with an attitude. | jmm@​elegant.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Tom Christiansen writes​: : >When I said\, "I talked to" I meant that. I don't need the pointer - I know : >how to handle it. But I wind up answering questions from people who do : >not. In my experience most of them use chomp() so the confusion is : >preventable. (After all you expect chomp() to get rid of line endings\, : >right?) : : I expect chomp() to remove one and only one terminating instance of the : precise string to which $/ has been set; no more\, no less. What were : you expecting?

I consider $/ the mechanism through which you can change "line ending" to "paragraph ending" etc...

I expect people to expect Perl to do the right thing.

Ah yes\, now I remember why I love this language. :-)

The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(

Ben

PS Random note. A random idea a co-worker and I have been throwing around (based on my massively speeding up a program by doing this) is "lazy concatentation" of strings. If someone is building up a string through interpolation and concatentation\, it makes sense to internally use something closer to an array of strings\, and then join that into one string if you ever need to. (print() has no need to join them\, the RE engine does.) This should transparently accelerate a lot of current Perl code...

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(

Anything that deviates from the notion of internally representing the line terminator as a single character (the virtual "\n") is a grave error.

--tom

p5pRT commented 24 years ago

From @pudge

At 07.54 -0700 1999.11.22\, Tom Christiansen wrote​:

The suggestion that I heard which I most like is letting $/ be an RE. So you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on virtually any file. However integrating this logic into the RE engine could be interesting. After all if you apply the pattern I gave to a string that ends with \r\, it will match even if the next character to be read is \n. This is manifestly not the right thing to do. :-(

Anything that deviates from the notion of internally representing the line terminator as a single character (the virtual "\n") is a grave error.

I had another idea ... which maybe completely useless\, but it is interesting to think about. If $/ is a regex\, then it is only a regex until it matches the first time\, at which point $/ becomes equal to $1. So​:

  $/ = qr/(\015?\012|\015)/;

As soon as it sees \015\012\, \012\, or \015\, $/ becomes whatever it matched. Again\, this would require per-filehandle IRS to be useful. Not advocating\, just throwing it out for fun. :D

-- Chris Nandor mailto​:pudge@​pobox.com http​://pudge.net/ %PGPKey = ('B76E72AD'\, [1024\, '0824090B CE73CA10 1FF77F13 8180B6B6'])

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

John Macdonald \jmm@&#8203;elegant\.com writes​:

Nick Ing-Simmons wrote : John Macdonald \jmm@&#8203;elegant\.com writes​: > > That is far from daft. sv_gets() (the internals of readline) would know > what it had used to find the end of the line. It could leave the > information around for chomp to use. > >But as soon as you have a program that opens multiple files of >differing formats\, this breaks down. You end up with taint-like >tracing of strings to track which form of file each string came >from.
Yes\, the EOLN string would have to be annotated on the SV somewhere
presumably as "magic". Having chomp look for "EOLN magic" on the SV
would be easy to do. The 'set' part of the magic would clear the field.

It gets messier...

$para = "$file1_lines$file2_lines$file3_lines";

Which of the three EOLN magics gets assigned to $para?

In my purely fictional implementation none would.

-- Nick Ing-Simmons \nik@&#8203;tiuk\.ti\.com Via\, but not speaking for​: Texas Instruments Ltd.

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Nick Ing-Simmons wrote : John Macdonald \jmm@&#8203;elegant\.com writes​: >Nick Ing-Simmons wrote : > John Macdonald \jmm@&#8203;elegant\.com writes​: > > > > That is far from daft. sv_gets() (the internals of readline) would know > > what it had used to find the end of the line. It could leave the > > information around for chomp to use. > > > >But as soon as you have a program that opens multiple files of > >differing formats\, this breaks down. You end up with taint-like > >tracing of strings to track which form of file each string came > >from. > > Yes\, the EOLN string would have to be annotated on the SV somewhere > presumably as "magic". Having chomp look for "EOLN magic" on the SV > would be easy to do. The 'set' part of the magic would clear the field. > >It gets messier... > > $para = "$file1_lines$file2_lines$file3_lines"; > >Which of the three EOLN magics gets assigned to $para?
In my purely fictional implementation none would.

so then​:

  chomp $file3_lines;   chomp $para;

could remove different values from two strings with the same termination value originating from the same source file line\, which would violate the principle of least astonishment for some users.

Good thing this whole issue is fictional. :-)

-- objects​: | John Macdonald   Think of them as data with an attitude. | jmm@​elegant.com

p5pRT commented 24 years ago

From @samtregar

On Mon\, 22 Nov 1999\, Mark-Jason Dominus wrote​:

2. Sam Tregar said he would do it\, but I don't know if he will.

Unfortunately Sam Tregar is just a novice perl hacker! I've been poking around a bit but I'm not convinced I've even found the right place to start working yet.

If this is a high-priority item\, perhaps someone more experienced should consider giving it a try.

-sam

p5pRT commented 19 years ago

From @schwern

[Ben_Tilly@​trepp.com - Thu Nov 18 23​:18​:09 1999]​:

Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp?

Do you mean that chomp()\, rather than being equivalent to​:

  s{$/\z}{};

should be​:

  s{(\r|\n|\r\n)\z}{};

?

p5pRT commented 19 years ago

The RT System itself - Status changed from 'stalled' to 'open'

p5pRT commented 19 years ago

From schubiger@cpan.org

[Ben_Tilly@​trepp.com - Thu Nov 18 23​:18​:09 1999]​:

Is there any possibility of having chomp() be modified to recognize \n\, \r\, and \r\n as line-endings to chomp? A source of nasty confusion for people working in a cross-platform environment is when the same exact script gives very different results on the same exact file depending on whether you are running under *nix or Windows. (Particularly an issue with Samba because people wind up reading under one system files created under the other.)

The record separator that is defaulted to '\n' on Unix\, will change depending on the operating system Perl is running on - chomp() relies heavily upon the value of $/. Chomping native files shouldn't cause much noise\, whereas chomping 'foreign' files with differing file-endings would require that you localize $/ in the scope of operation.

Example​: {   local $/ = "\r\n";   $chomped = chomp(@​lines); }

Yes\, the current behaviour works as documented. But it leads to code not doing what people expect\, and a confused person can easily spend several hours confused...

I'd say\, it's rather clearly documented\, without lack of accurate description. Although the behaviour requested is desirable\, it doesn't seem possible to integrate the inevitable changes to doop.c​:Perl_do_chomp\, where the record separator\, known as global PL_rs\, is extensively utilized and relied upon - allowing for multiple values would require tremendous changes and furthermore\, would likely break backwards compatibility.

p5pRT commented 17 years ago

From @rgs

Rejected\, mostly for backwards compatibility reasons.

p5pRT commented 17 years ago

@rgs - Status changed from 'open' to 'rejected'