JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.65k stars 5.48k forks source link

Julep: Support universal newlines and make it default for Text IO #19785

Open mpastell opened 7 years ago

mpastell commented 7 years ago

I think it would be useful to have universal newlines mode for open similar to Python. A short description of the mode for open from Python 3.6 docs https://docs.python.org/3.6/library/functions.html#open .

Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

When writing output to the stream any '\n' characters written are translated to the system default line separator, os.linesep.

Currently this needs to be handled manually in Julia which I think is a bit annoying and easy to enough to forget.

So I suggest the following:

  1. Add support for universal newlines to open as a flag to mode or a new keyword argument
  2. Make universal newlines the default for methods reading from a stream and returning a string e.g. readstring, readlines, eachline
tkelman commented 7 years ago

Inserting \r bytes into input or output that weren't present in the contents of a file or string isn't a behavior we should perpetuate. Code that's written and executed on unix also needs to be able to deal correctly with \r\n files.

mpastell commented 7 years ago

I forgot to say that adding \r when writing doesn't seem that useful. Although Python only inserts '\r' when writing text files and that can also be changed.

When reading \r\n is translated to \n by default which I think is very useful and I have never had the need to change the default during 10 years.

It is also useful to read the original Python PEP https://www.python.org/dev/peps/pep-0278/

tkelman commented 7 years ago

I'd be fine with making any line-oriented functions more permissive in what they considered to be a line ending. But I think I/O should generally be binary-faithful. \r\n newlines are almost always best dealt with using dos2unix if you have write permissions on the file in question.

I wasn't much of a fan of https://github.com/JuliaLang/julia/pull/14073, but parsing of Julia source files does normalize line endings from \r\n to \n for the sake of multiline string literals. I don't think we should do the same with all non-Julia I/O.

mpastell commented 7 years ago

I think code should to be written in a way that there is no need to call dos2unix. I would find it strange to tell Windows users not to use \r\n in their files.

Of course using the Python way is not the only solution, but I think it works very well.

For instance I recently had a bug in Weave.jl (https://github.com/mpastell/Weave.jl/issues/72) related to this, which I fixed using replace(s, "\r\n", "\n") on all input files. Also #656 would have been avoided using the universal newlines approach.

Line-oriented functions do work with \r\n and \n (you just need to remember \r exists when you are developing on Windows). I'm not sure if they should also work with other EOL characters as well.

mpastell commented 7 years ago

I just opened #19815 . It would not be fixed with this proposal, but shows I'm not the only one who forgot to deal with \r ;)

mpastell commented 7 years ago

After giving it some thought I agree fixing line oriented functions as @tkelman suggested is a good solution. I think it would be best to follow unicode standard (http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf ) and consider all of the following as line ends.

unicode

Then

  1. Make readline to break on all of the above options
  2. readlines and eachline use readline so it means they will follow same behavior
  3. chomp should also be fixed to work with NEL and CR
  4. I would like to add splitlines according to #19759
  5. I suggest adding newline2lf and that can be used to convert line endings to \n (usign utf8proc)

If this is OK I can try to implement it and make pull request.

mpastell commented 7 years ago

Actually the Unicode standard also has the following recommendation

readline

I think it is more common to keep the separator in the return value as readline currently does. Any ideas?

nalimilan commented 7 years ago

+1, but I'd start with a PR to change eachline and chomp; whether we need splitlines and newline2lf is a different question.

Regarding whether to include the newlines character(s), could you check what popular languages do (in particular recent ones like Rust, Swift and Go)? We could always add an argument to choose the behavior.

mpastell commented 7 years ago

Rust and Swift seem to drop newlines by default, Python doesn't.

https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines https://developer.apple.com/reference/swift/1641199-readline https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

It'd be good to add dropping the character as an option to Julia readline as well, but making it the default and changing existing behavior would probably break a lot of code.

nalimilan commented 7 years ago

OK, thanks for checking. Changing the default would make sense to me. It should be possible by first adding the new argument and deprecating the one-argument method in 0.6, and then changing its meaning in 1.0. Maybe it would be accepted if you can make a PR shortly.

mpastell commented 7 years ago

@StefanKarpinski any thoughts on this?

I can try implement the changes to readline, but changing the default behavior breaks many thinks including the REPL. I think this would need strong agreement that changing the default behavior is the right thing to do and needs work from others as well.

I would be fine with keeping the current default (as I'm used to it in Python anyway) and just adding the variant that removes the newlines as an option.

mpastell commented 7 years ago

Would it be OK to implement readlineas

function readline(s::IO)
    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        write(out, c)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                write(out, read(s, Char))
            end
            break
        end
    end

    return String(take!(out))
end

I'm not sure if options to readline are actually needed as you can simply use

chomp.(readlines(file))

To get rid of EOL characters.

nalimilan commented 7 years ago

Anyway if you want to change the default, you'll have to first provide a version with an argument, to allow for a non-breaking deprecation period. So you may as well start with that. Then seeing how much code would need to be changed in Base (and whether it would simplify code or not) would be an interesting data point to make a decision.

The version using chomp is going to be quite slower currently as it needs to create SubString (not sure whether it could really be optimized in the future by changing the String type, cf. https://github.com/JuliaLang/julia/issues/16107). In general it seems to me that skipping the newline characters directly from readline should be faster, since you already look for them, which chomp would have to do once again.

mpastell commented 7 years ago

How about?

function readline(s::IO, chomp = true; nl2lf=false)
    nl2lf && (chomp = false)

    linefeeds = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
    out = IOBuffer()
    while !eof(s)
        c = read(s, Char)
        if c in linefeeds
            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
                !(nl2lf || chomp) && write(out, c)
                c = read(s, Char)
                !chomp && write(out, c)
            else
                nl2lf && (c = '\n')
                !chomp && write(out, c)
            end

            break
        else
            write(out, c)
        end
    end

    return String(take!(out))
end

If chomp = true EOL will be omitted and if nl2lf = true EOL will be converted to '\n' .

Did I understand correctly that if we want to make chomp = true the default in the future we should define

function readline(s::IO)
    readline(s, false)
end

and add the approriate deprecation warning?

StefanKarpinski commented 7 years ago

I guess we could have a chomp::Bool=false argument to eachline and readlines. I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

nalimilan commented 7 years ago

I think a more thoughtful approach needs to be fleshed out for end-of-line handling and encodings, etc.

For now I/O streams are assumed to contain UTF-8 text, since they have no encoding information attached to them anyway. When passing a custom string type to readline, the conversion to UTF-8 is supposed to happen when creating the IOBuffer from the string object.

Custom string types are free to implement their own readline function if they do not use UTF-8. But it would make sense to provide a generic readline(::AbstractString) method which does not go through IOBuffer: that way custom string types do not have to implement streaming conversion to UTF-8, they just need to support getindex. Or we could provide a default IOBuffer(::AbstractString) implementation based on getindex.

mpastell commented 7 years ago

I have added support for \r and added chomp::Bool=false forreadline` in #19877 . Adding support for other characters (if there is a need) based on that would not be too difficult.

The code that detects line ends also works for most encodings as they share the same control characters.