Remotecall_fetch errors when returning a large array on Windows

Allardvm commented 8 years ago

I encounter an error when using distributed processing as part of a simulation study. Following the pmap code, I use one master core to hand out tasks to a set of worker cores. Each worker processes a task and then returns some data in an array that the master writes to a file. Even with a single worker in addition to the master, I encounter the EINVAL error below when the return array is quite large (~ 2.4 GB). The array fits in the available memory (the machine has 16 GB, with over 9 GB available at the time of the error).

fatal error on 2: 
LoadError: ProcessExitedException()
while loading In[3], in expression starting on line 155

 in yieldto at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in wait at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib (repeats 3 times)
 in take! at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib (repeats 2 times)
 in remotecall_fetch at multi.jl:729
 in wrapper at In[2]:8

ERROR: write: invalid argument (EINVAL)
 in yieldto at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in wait at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in stream_wait at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in uv_write at stream.jl:944
 in buffer_or_write at stream.jl:966
 in write at stream.jl:993
 in serialize_array_data at serialize.jl:164
 in serialize at serialize.jl:181
 in serialize at serialize.jl:418
 in send_msg_ at multi.jl:222
 in send_msg_now at multi.jl:173
 in deliver_result at multi.jl:808
 [inlined code] from multi.jl:893
 in anonymous at task.jl:892
Worker 2 terminated.
ERROR (unhandled task failure): readcb: invalid argument (EINVAL)

The following minimal working example reproduces the error on my Mac, but results in a slightly different error and backtrace on a similar Windows computer:

addprocs(1)

@everywhere function testthis()
    databuffer = zeros(300000000)
    return databuffer
end

function wrapper()
    for i_wpid in workers()
        result = remotecall_fetch(i_wpid, testthis)
    end
end

wrapper()

The error does not occur when I omit the addprocs(1) in the above code.

Below my versioninfo, including memory information:

Julia Version 0.4.0
Commit 0ff703b* (2015-10-08 06:20 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  uname: Darwin 15.0.0 Darwin Kernel Version 15.0.0: Wed Aug 26 16:57:32 PDT 2015; root:xnu-3247.1.106~1/RELEASE_X86_64 x86_64 i386
Memory: 16.0 GB (9407.90625 MB free)

My attempts to fix this have not been successful so far. Any help is greatly appreciated.

amitmurthy commented 8 years ago

I thought this had gone away but apparently not.

./julia -p 1 -e "remotecall_fetch(()->zeros(2^28), 2)" fails while ./julia -p 1 -e "remotecall_fetch(()->zeros(2^28-1), 2)" succeeds.

amitmurthy commented 8 years ago

const sz = 2^28

s=listen(10000)
@schedule while true
    a = accept(s)
    @schedule while true
        data = read(a, Float64, sz)
    end
end

c=connect("localhost", 10000)
write(c, zeros(sz))

This is a generic socket IO issue. sz at 2^28-1 succeeds while 2^28 fails.

Allardvm commented 8 years ago

Thanks for the quick response! I looked into this a bit more and I believe I found why this, as you suggested, had previously gone away but is now rearing its ugly head again.

As part of the write(c, zeros(sz)) call in the generic version you posted above, the program eventually calls uv_write(s::AsyncStream, p, n::Integer) at stream.jl:929 (in Julia Version 0.4.0). There, line 933

err = ccall(:jl_uv_write, Int32, (Ptr{Void}, Ptr{Void}, UInt, Ptr{Void}, Ptr{Void}), s, p, n, uvw, uv_jl_writecb_task::Ptr{Void})

Results in the following error when sz >= 2^28:

ERROR (unhandled task failure): AssertionError: d != C_NULL
 in uv_writecb_task at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in jlcapi_uv_writecb_task_19223 at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in process_events at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib
 in wait at /Applications/Julia.app/Contents/Resources/julia/lib/julia/sys.dylib (repeats 2 times)
 in stream_wait at stream.jl:60
 in accept at stream.jl:1061
 in accept at socket.jl:338
 [inlined code] from In[1]:5
 in anonymous at task.jl:63

This does not crash the program--stream_wait(ct) on line 944 does. However, bb744fd14f01223961b0d2c74229a0ce001872e5, got rid of the try/finally statement that used to surround the above code and also changed the call

err = ccall(:jl_uv_write, Int32, (Ptr{Void}, Ptr{Void}, UInt, Ptr{Void}, Ptr{Void}), handle(s), p, n, uvw, uv_jl_writecb_task::Ptr{Void})

into the current

err = ccall(:jl_uv_write, Int32, (Ptr{Void}, Ptr{Void}, UInt, Ptr{Void}, Ptr{Void}), s, p, n, uvw, uv_jl_writecb_task::Ptr{Void})

I presume these changes cause the current error, but since I do not know the reasoning behind these changes I am unsure how to proceed from here.

amitmurthy commented 8 years ago

A good thing would be to confirm if https://github.com/JuliaLang/julia/commit/bb744fd14f01223961b0d2c74229a0ce001872e5 is indeed the cause. Could you test with one commit before https://github.com/JuliaLang/julia/commit/bb744fd14f01223961b0d2c74229a0ce001872e5 and also with https://github.com/JuliaLang/julia/commit/bb744fd14f01223961b0d2c74229a0ce001872e5 itself ?

Allardvm commented 8 years ago

It looks like bb744fd14f01223961b0d2c74229a0ce001872e5 may not be the cause after all. I did a git checkout e254aa84a69935495a84e07eaebf0aab61e85b33 to go back to e254aa84a69935495a84e07eaebf0aab61e85b33, which--as far as I can see--is the commit right before 37ade410e79e9fda3f2d08ebc63b7254a62793be was merged into master (this is the commit into master that corresponds to bb744fd14f01223961b0d2c74229a0ce001872e5, which was committed to the release-0.4 branch).

Using the resultant binary

Julia Version 0.5.0-dev+133
Commit e254aa8* (2015-09-14 18:09 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

to run your generic working example

const sz = 2^28

s = listen(10000)
@schedule while true
    a = accept(s)
    @schedule while true
        data = read(a, Float64, sz)
    end
end

c = connect("localhost", 10000)
write(c, zeros(sz))

still results in

ERROR: write: invalid argument (EINVAL)
 in yieldto at ./task.jl:71
 in wait at ./task.jl:367
 in stream_wait at ./stream.jl:293
 in uv_write at stream.jl:952
 in buffer_or_write at stream.jl:965
 in write at stream.jl:1005

The same error occurs when using the v0.3.11 Julia binary.

So far I have only used the Julia binaries, so I am new to building Julia from source. If you would like me to test this a bit differently, please let me know.

Allardvm commented 8 years ago

The discussion in these two closed issues might be of use as they seem related: #9497, #6567. The test code in #6567 still results in an EINVAL error when called with at least 1 worker present.

amitmurthy commented 8 years ago

This is the problem:

len in uv_buf_t is an unsigned int

See https://github.com/libuv/libuv/blob/ef47e8b2127c45ee58cae6b6f8ace5d6fde695b2/include/uv.h#L434

and our usage here: https://github.com/JuliaLang/julia/blob/706600408aba8b142c47c2bc887bde0d9bf774cf/src/jl_uv.c#L356

amitmurthy commented 8 years ago

We will need to provide more than one buffer to libuv. Will submit a patch.

amitmurthy commented 8 years ago

Nope that was not it. The issue is still one of a unsigned int overflow somewhere. Trying to track it down.

amitmurthy commented 8 years ago

libuv seems to be using system call writev for the actual socket write. And I found this line in the OSX writev man page that may explain this error:

 [EINVAL]           The sum of the iov_len values in the iov array overflows a 32-bit integer.

from https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/write.2.html

cc @vtjnash

Allardvm commented 8 years ago

Looks like that may indeed cause the OSX issue, considering that your generic working example runs perfectly on Windows.

However, there is still a related issue in here somewhere since my original working example (below) doesn't run correctly on Windows either. The code produces the EINVAL error reported above on OSX, but immediately triggers a BSOD on Windows (10). I just lost my debugging efforts when I tried to run the code. This also prevents me from providing a decent backtrace.

addprocs(1)

@everywhere function testthis()
    databuffer = zeros(300000000)
    return databuffer
end

function wrapper()
    for i_wpid in workers()
        result = remotecall_fetch(i_wpid, testthis)
    end
end

wrapper()

Allardvm commented 8 years ago

I'm all but certain that the EINVAL error on OSX is related to the writev exception your mentioned. I have manually traced your generic working example to the following call in libuv/src/unix/stream.c:

n = write(uv__stream_fd(stream), iov[0].iov_base, iov[0].iov_len);

Here, iov[0].iov_len contains the value 2147483648, which indeed overflows a signed 32 bit int. I guess this is some sort of documented bug in OSX, since the Open Group only calls for an EINVAL when it would overflow an ssize_t, which is usually based on an unsigned* int.

I see Go also ran into this issue: https://github.com/golang/go/issues/7812 and fixed it by implementing a cap on writes at 2 << 30 - 1.

*Correction: ssize_t should be the size of a signed int. This makes OSX's more expected, although I would expect the {SSIZE_MAX} to be based on a 64 bit int on a 64 bit machine, which it doesn't seem to be. The behavior of write (not writev) when iov[0].iov_len > {SSIZE_MAX} is left undefined by POSIX, although I haven't found documentation on OSX's implementation yet.

StefanKarpinski commented 8 years ago

Can't we just emulate large writes by splitting them into batches smaller than 2^31-1? That seems friendlier than Go's approach of disallowing really large writes.

Allardvm commented 8 years ago

Yeah, that's what Amit suggested as well and is how other languages have implemented it. I think the main thing to decide on is where to implement this. Should it be in libuv, or should we do this in Julia? There is some talk about that in the pull request above, but I'm also still trying to figure out how this can be best implemented in a way that works everywhere, rather than just on OS X.

ScottPJones commented 8 years ago

@Allardvm:

*Correction: ssize_t should be the size of a signed int. This makes OSX's more expected, although I would expect the {SSIZE_MAX} to be based on a 64 bit int on a 64 bit machine, which it doesn't seem to be.

size_t is only guaranteed to be able to hold the size of the largest size object (it is unsigned). ssize_t is a signed version of size_t. It is not the size of a int, it can be larger (or even smaller). System programming models these days are generally 32-bit ILP32, 64-bit LP64 (Linux/Unix), or 64-bit LLP64 (Windows-64). On LP64 and LLP64 systems, size_t is 64-bits. I don't think Julia needs to worry about segmented memory systems, where size_t might be 16 or 32-bits, but pointers were 32 or 48-bits respectively. (those systems are why uintptr_t and intptr_t should be used instead of size_t/ssize_t to hold pointers cast to integers).

This really seems like documented "feature" on OS X, where a signed int (32-bit) is used to sum up the different sizes, instead of a ssize_t (64-bit on 64-bit systems). Linux / AIX / Solaris all allow up to SSIZE_MAX. Emulating large writes by splitting them up is not correct either, because the write should be atomic. You'd need to malloc a buffer big enough for the total size, copy the sections to the buffer, do a single write, then free the buffer.

vtjnash commented 8 years ago

write isn't atomic anyways, so that is irrelevant (http://linux.die.net/man/3/write)

If you follow my cross ref, you will see write exhibits the same behaviors as writev.

ScottPJones commented 8 years ago

From http://man7.org/linux/man-pages/man2/readv.2.html:

The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).

A write may be interrupted, and you may be able to see a torn write if the system crashes, but they are supposed to be atomic in the sense that multiple processes output don't get intermingled. As it notes above, there are some exceptions, such as the one with pipes, where atomicity isn't guaranteed if the size is greater than the size of PIPE_BUF

@vtjnash The top of the reference you showed says that it isn't even necessarily valid for any Linux implementation:

This manual page is part of the POSIX Programmer's Manual. The Linux implementation of this interface may differ (consult the corresponding Linux manual page for details of Linux behavior), or the interface may not be implemented on Linux.

Even so, it implicity does make the guarantee, with the following exception:

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.

That documentation also makes it clear that if you have opened the file with the O_APPEND flag, the seek and the write are atomic with respect to other writers:

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

If write didn't make any guarantees of atomicity, it would make many programming tasks very difficult, imagine if you always had to grab a lock on a device just to ensure that your output operations didn't get interleaved with those of other processes?

So, to sum this up, it is important that this not be done by doing multiple writes. It is much better to just get the error, instead of losing the logical integrity of your data.

(although maybe another write and writev call could be added, for people who are certain that other processes can't be attempting to write to the file at the same time).

vtjnash commented 8 years ago

Unless you are arguing that 2GB is < 4096 or that we should be using the local filesyst as the network interface, I suggest you should start grabbing those locks. (and no, I don't find it hard to imagine a world where resource contention requires locks. That's parallel programming 101)

This is not a discussion about O_APPEND semantics. The two operations are not related, other than the use of the same term for different purposes

ScottPJones commented 8 years ago

No, what I'm saying is that would break things for large writes, depending on the type of device being written to. Grabbing locks when it is not required is a performance killer - maybe that wasn't taught in your Parallel Programming 101. The libuv API that Julia uses is broken, although it says that uv_write/uv_read are equivalent to pwritev/preadv, they return a 32-bit int, not a 32/64 bit ssize_t, and the uv_buf_init function takes an unsigned int instead of a size_t. @amitmurthy according to the libuv documentation, uv_buf.len actually is size_t or ULONG on Windows, which means that on Windows, it is always limited to 32-bits, but elsewhere it is being limited to 32-bits by the uv_buf_init constructor.

ScottPJones commented 8 years ago

@amitmurthy I just noticed what looks like bugs in the Julia jl_uv_ code. The fields base and len are documented as being Readonly in the libuv documentation, and uv_buf_t structures are supposed to be created via uv_buf_init, which returns a uv_buf_t by value, however all of the Julia wrapper code simply sets buf[0].base / buf[0].len directly.

vtjnash commented 8 years ago

That function was added for FFI convenience since the order of the fields is OS-dependent. It's unnecessary when used from C.

amitmurthy commented 8 years ago

https://github.com/JuliaLang/libuv/pull/33 fixed this for UNIX. Presumably still an issue on Windows.

Allardvm commented 8 years ago

Yes, this is still an issue on Windows. Running the code below results in an immediate BSOD on Windows 7 in the current release (v0.4.3) and the most recent nightly (v0.5.0-dev+3116).

addprocs(1)

@everywhere function foo()
    databuffer = zeros(300000000)
    return databuffer
end

function bar()
    for i_wpid in workers()
        result = remotecall_fetch(i_wpid, foo)
    end
end

foo()

The issue seems a bit different from the UNIX one, since the working example that triggered an EINVAL error due to large reads/writes on UNIX (before the fix in https://github.com/JuliaLang/libuv/pull/33), does not trigger an error on Windows:

const sz = 2^28

s = listen(10000)
@schedule while true
    a = accept(s)
    @schedule while true
        data = read(a, Float64, sz)
    end
end

c = connect("localhost", 10000)
write(c, zeros(sz))

JuliaLang / julia

Remotecall_fetch errors when returning a large array on Windows #13578