Open mpastell opened 7 years ago
Inserting \r
bytes into input or output that weren't present in the contents of a file or string isn't a behavior we should perpetuate. Code that's written and executed on unix also needs to be able to deal correctly with \r\n
files.
I forgot to say that adding \r
when writing doesn't seem that useful. Although Python only inserts '\r' when writing text files and that can also be changed.
When reading \r\n
is translated to \n
by default which I think is very useful and I have never had the need to change the default during 10 years.
It is also useful to read the original Python PEP https://www.python.org/dev/peps/pep-0278/
I'd be fine with making any line-oriented functions more permissive in what they considered to be a line ending. But I think I/O should generally be binary-faithful. \r\n
newlines are almost always best dealt with using dos2unix
if you have write permissions on the file in question.
I wasn't much of a fan of https://github.com/JuliaLang/julia/pull/14073, but parsing of Julia source files does normalize line endings from \r\n
to \n
for the sake of multiline string literals. I don't think we should do the same with all non-Julia I/O.
I think code should to be written in a way that there is no need to call dos2unix
. I would find it strange to tell Windows users not to use \r\n
in their files.
Of course using the Python way is not the only solution, but I think it works very well.
For instance I recently had a bug in Weave.jl (https://github.com/mpastell/Weave.jl/issues/72) related to this, which I fixed using replace(s, "\r\n", "\n")
on all input files. Also #656 would have been avoided using the universal newlines approach.
Line-oriented functions do work with \r\n
and \n
(you just need to remember \r
exists when you are developing on Windows). I'm not sure if they should also work with other EOL characters as well.
I just opened #19815 . It would not be fixed with this proposal, but shows I'm not the only one who forgot to deal with \r
;)
After giving it some thought I agree fixing line oriented functions as @tkelman suggested is a good solution. I think it would be best to follow unicode standard (http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf ) and consider all of the following as line ends.
Then
readline
to break on all of the above optionsreadlines
and eachline
use readline
so it means they will follow same behaviorchomp
should also be fixed to work with NEL and CRsplitlines
according to #19759newline2lf
and that can be used to convert line endings to \n
(usign utf8proc)If this is OK I can try to implement it and make pull request.
Actually the Unicode standard also has the following recommendation
I think it is more common to keep the separator in the return value as readline
currently does. Any ideas?
+1, but I'd start with a PR to change eachline
and chomp
; whether we need splitlines
and newline2lf
is a different question.
Regarding whether to include the newlines character(s), could you check what popular languages do (in particular recent ones like Rust, Swift and Go)? We could always add an argument to choose the behavior.
Rust and Swift seem to drop newlines by default, Python doesn't.
https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines https://developer.apple.com/reference/swift/1641199-readline https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
It'd be good to add dropping the character as an option to Julia readline
as well, but making it the default and changing existing behavior would probably break a lot of code.
OK, thanks for checking. Changing the default would make sense to me. It should be possible by first adding the new argument and deprecating the one-argument method in 0.6, and then changing its meaning in 1.0. Maybe it would be accepted if you can make a PR shortly.
@StefanKarpinski any thoughts on this?
I can try implement the changes to readline
, but changing the default behavior breaks many thinks including the REPL. I think this would need strong agreement that changing the default behavior is the right thing to do and needs work from others as well.
I would be fine with keeping the current default (as I'm used to it in Python anyway) and just adding the variant that removes the newlines as an option.
Would it be OK to implement readline
as
function readline(s::IO)
linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
out = IOBuffer()
while !eof(s)
c = read(s, Char)
write(out, c)
if c in linefeeds
if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
write(out, read(s, Char))
end
break
end
end
return String(take!(out))
end
I'm not sure if options to readline
are actually needed as you can simply use
chomp.(readlines(file))
To get rid of EOL characters.
Anyway if you want to change the default, you'll have to first provide a version with an argument, to allow for a non-breaking deprecation period. So you may as well start with that. Then seeing how much code would need to be changed in Base (and whether it would simplify code or not) would be an interesting data point to make a decision.
The version using chomp
is going to be quite slower currently as it needs to create SubString
(not sure whether it could really be optimized in the future by changing the String
type, cf. https://github.com/JuliaLang/julia/issues/16107). In general it seems to me that skipping the newline characters directly from readline
should be faster, since you already look for them, which chomp
would have to do once again.
How about?
function readline(s::IO, chomp = true; nl2lf=false)
nl2lf && (chomp = false)
linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
out = IOBuffer()
while !eof(s)
c = read(s, Char)
if c in linefeeds
if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
!(nl2lf || chomp) && write(out, c)
c = read(s, Char)
!chomp && write(out, c)
else
nl2lf && (c = '\n')
!chomp && write(out, c)
end
break
else
write(out, c)
end
end
return String(take!(out))
end
If chomp = true
EOL will be omitted and if nl2lf = true
EOL will be converted to '\n' .
Did I understand correctly that if we want to make chomp = true
the default in the future we should define
function readline(s::IO)
readline(s, false)
end
and add the approriate deprecation warning?
I guess we could have a chomp::Bool=false
argument to eachline
and readlines
. I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.
I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.
For now I/O streams are assumed to contain UTF-8 text, since they have no encoding information attached to them anyway. When passing a custom string type to readline
, the conversion to UTF-8 is supposed to happen when creating the IOBuffer
from the string object.
Custom string types are free to implement their own readline
function if they do not use UTF-8. But it would make sense to provide a generic readline(::AbstractString)
method which does not go through IOBuffer
: that way custom string types do not have to implement streaming conversion to UTF-8, they just need to support getindex
. Or we could provide a default IOBuffer(::AbstractString)
implementation based on getindex
.
I have added support for \r
and added chomp::Bool=false for
readline` in #19877 . Adding support for other characters (if there is a need) based on that would not be too difficult.
The code that detects line ends also works for most encodings as they share the same control characters.
I think it would be useful to have universal newlines mode for
open
similar to Python. A short description of the mode foropen
from Python 3.6 docs https://docs.python.org/3.6/library/functions.html#open .Currently this needs to be handled manually in Julia which I think is a bit annoying and easy to enough to forget.
So I suggest the following:
open
as a flag to mode or a new keyword argumentreadstring
,readlines
,eachline