Closed sbromberger closed 7 years ago
... and bisect is broken:
error: Your local changes to the following files would be overwritten by checkout:
CMakeLists.txt
src/openssl_stream.c
Please, commit your changes or stash them before you can switch branches.
Aborting
make[1]: *** [libgit2/CMakeLists.txt] Error 1
make: *** [julia-deps] Error 2
Will try some brute-force.
No crashes, but errors here:
| | |_| | | | (_| | | Version 0.4.0-dev+6033 (2015-07-17 02:56 UTC)
_/ |\__'_|_|_|\__'_| | Commit efcc709 (12 days old master)
|__/ | x86_64-apple-darwin14.4.0
julia> @everywhere using LightGraphs
exception on 1: exception on 3: exception on 5: exception on 4: exception on 2: ERROR: MethodError: `base_include` has no method matching base_include(::UTF8String, ::ASCIIString, ::Tuple{Int64,ASCIIString})
Closest candidates are:
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
base_include(::AbstractString, ::Any)
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
...
in eval at sysimg.jl:14
in anonymous at multi.jl:1297
in run_work_thunk at multi.jl:584
in run_work_thunk at multi.jl:593
in anonymous at task.jl:8
I will assume that this is a separate issue and will attempt to pinpoint the version that crashes.
did you precompile this package?
@jakebolewski at one time, yes, but ~/.julia/lib/v0.4 is currently empty.
According to bisect results,
e8a1c7440f47707be3329775fac91f0c4bf9c27d is the first bad commit
commit e8a1c7440f47707be3329775fac91f0c4bf9c27d
Author: Tim Holy <tim.holy@gmail.com>
Date: Sat Jul 11 10:13:49 2015 -0500
Add missing base_include method
This fixes errors that crop up with multiple workers, e.g.,
ERROR: MethodError: `base_include` has no method matching base_include(::ASCIIString, ::ASCIIString, ::Tuple{Int64,ASCIIString})
Closest candidates are:
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
base_include(::AbstractString, ::Any)
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
...
in eval at sysimg.jl:14
in anonymous at multi.jl:1303
in run_work_thunk at multi.jl:584
...
Perhaps you'd prefer a call-site fix?
:040000 040000 dc8f6d703af0bc05b1dc811013d73a713fdf05dc 21a496ea73cbb8fcedeecb601d7a978e9067e283 M base
Validating that the previous version actually works... there were several errors encountered throughout.
The version prior to bisect's "first bad commit" turns out to be https://github.com/JuliaLang/julia/issues/12381#issuecomment-126144754 so it looks like Tim's commit fixed that bug but may have uncovered the crash bug.
cc @timholy
There doesn't appear to be any way e8a1c7440f47707be3329775fac91f0c4bf9c27d is the real culprit. CC @vtjnash?
re: https://github.com/JuliaLang/julia/issues/12381#issuecomment-126144337, if you go back far enough that dependencies changed versions you'll often need to do make -C deps distclean-libgit2
, or similar for pcre. Make sure you're doing make cleanall
at each step of bisect just to be sure (the only deps that cleanall
deletes by default are small ones).
Following is with no cleaning before rebuild. I get different behavior on two machines
Julia Version 0.4.0-dev+6033
Commit efcc709* (2015-07-17 02:56 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
> julia -p 4
Hangs (for at least several minutes) with:
ERROR: AssertionError:
in init_worker at ./multi.jl:1051
in start_worker at multi.jl:964
in process_options at ./client.jl:265
in _start at ./client.jl:411
Different machine
Julia Version 0.4.0-dev+6033
Commit efcc709* (2015-07-17 02:56 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
> julia -p 4
> @everywhere using PowerSeries
exception on 1: exception on 4: exception on 2: ERROR: MethodError: `base_include` has no method matching base_include(::ASCIIString, ::ASCIIString, ::Tuple{Int64,ASCIIString})
Closest candidates are:
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString, ::Any)
base_include(::AbstractString, ::Any)
base_include(::Nullable{Union{UTF8String,ASCIIString}}, ::AbstractString)
...
in eval at sysimg.jl:14
in anonymous at multi.jl:1297
in run_work_thunk at multi.jl:584
in run_work_thunk at multi.jl:593
in anonymous at task.jl:8
...
Warning: requiring "PowerSeries" did not define a corresponding module.
A git bisect with make cleanall
leads me to 416a23ee as the first bad commit. 88bb2e9 is the last commit where
./julia -e "addprocs(2); @everywhere using FactCheck"
succeeds for me. (No pre-compilation used anywhere).
I also find that one to be strange as a culprit, but CCing @ScottPJones anyway.
I'm not at home now, but I can't see how it would have an effect, unless there were invalid UTF-8 data that wasn't detected before, but you would have seen a UnicodeError
then
Interesting. What I posted above was performed on OSX. On Linux, both commits work just fine. Current master on Linux fails, though, as on OSX. If nobody beats me to it I will bisect again on both systems tomorrow.
i don't think a bisect is entirely required for this one: from the issue description above, it's apparent that there's a race condition between the call to using
on node 1 and the call to using
on the other nodes that is not being properly accounted for in the changes to the require
logic.
bisects are not at all accurate for things like race conditions - they can only give you some idea as to some bounds where the bug was introduced.
Yes, it looks like a race condition. The type of error produced (and whether they occur at all) changes from run to run.
For testing, I find the bug more likely to occur with four processes than two.
If I do
make cleanall
make distclean
make -C deps distcleanall
Then building commit 88bb2e9 shows the bug.
FWIW, sleeping a little makes it pass for me again on current master:
./julia -e "addprocs(6); @everywhere sleep(0.1); using JSON, FactCheck, Compat"
Calling it without @everywhere
altogether works as well:
./julia -e "addprocs(6); using JSON, FactCheck, Compat"
sleeping a little makes everything pass
https://github.com/JuliaLang/julia/pull/12581 does seem to fix the segfault on my local machine. Request other folks to tests it out.
This is what I get.
amitm@amitm-macbookpro:~/Work/julia/julia$ julia -p 4
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+6662 (2015-08-13 16:13 UTC)
_/ |\__'_|_|_|\__'_| | amitm/loading_fix/296a7db* (fork: 1 commits, 2 days)
|__/ | x86_64-linux-gnu
julia> @everywhere using LightGraphs
INFO: Precompiling module LightGraphs...
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: Module StatsFuns uuid did not match cache file
WARNING: node state is inconsistent: node 2 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 3 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 4 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
WARNING: node state is inconsistent: node 5 failed to load cache from /home/amitm/.julia/lib/v0.4/LightGraphs.ji
and
julia>
amitm@amitm-macbookpro:~/Work/julia/julia$ ./julia -e "addprocs(6); using JSON, FactCheck, Compat"
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module DataStructures should explicitly import < from Base
WARNING: module DataStructures should explicitly import <= from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
WARNING: module JSON should explicitly import colon from Base
Warnings and errors but no segfault.
why are you getting "WARNING: node state is inconsistent" there? that generally is going to be really, really bad.
I cleaned .cache
.
Now with julia -p4
, I see
julia> @everywhere using LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:32.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35 overwritten in module LightGraphs at /home/amitm/.julia/v0.4/LightGraphs/src/core.jl:35.
Why did it precompile the first time around and not now?
Now getting
seth@schroeder:~/dev/julia/julia$ julia -p 4
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+6817 (2015-08-18 15:25 UTC)
_/ |\__'_|_|_|\__'_| | Commit 77bef6e (0 days old master)
|__/ | x86_64-apple-darwin14.5.0
julia> @everywhere using LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27.
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: replacing module LightGraphs
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27.
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27.
WARNING: Method definition ==(Base.Pair{Int64, Int64}, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:24.
WARNING: Method definition show(Base.IO, Base.Pair{Int64, Int64}) in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27 overwritten in module LightGraphs at /Users/seth/.julia/v0.4/LightGraphs/src/core.jl:27.
I tried it on a fresh build made with make cleanall
with a package that does not use precompilation:
➜ LMCLUS git:(master) ✗ julia-dev
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+6817 (2015-08-18 15:25 UTC)
_/ |\__'_|_|_|\__'_| | Commit 77bef6e* (0 days old master)
|__/ | x86_64-linux-gnu
julia> addprocs(1)
1-element Array{Int64,1}:
2
julia> nprocs()
2
julia> @everywhere using LMCLUS
WARNING: replacing module LMCLUS
WARNING: could not import MultivariateStats.PCA into LMCLUS
WARNING: could not import MultivariateStats.fit into LMCLUS
WARNING: could not import MultivariateStats.principalratio into LMCLUS
signal (11): Segmentation fault
unknown function (ip: 0x7f64226a509a)
unknown function (ip: 0x7f6422621ecb)
unknown function (ip: 0x7f6422621ef1)
jl_get_global at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7f6422667344)
unknown function (ip: 0x7f6422668263)
unknown function (ip: 0x7f6422669081)
unknown function (ip: 0x7f64226694e6)
unknown function (ip: 0x7f642267cb0f)
unknown function (ip: 0x7f642267d3d9)
jl_load_file_string at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
include_string at loading.jl:228
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:269
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7f6422668a43)
unknown function (ip: 0x7f6422667e61)
unknown function (ip: 0x7f642267c6e8)
unknown function (ip: 0x7f642267ce52)
unknown function (ip: 0x7f642267caa5)
unknown function (ip: 0x7f642267d3d9)
jl_load_file_string at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
include_string at loading.jl:228
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:269
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7f6422668a43)
unknown function (ip: 0x7f6422667e61)
unknown function (ip: 0x7f642267c6e8)
jl_toplevel_eval_in at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
require at ./loading.jl:203
unknown function (ip: 0x7f641f42a53c)
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7f642267b9c5)
unknown function (ip: 0x7f642267c95b)
jl_toplevel_eval_in at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
eval at ./sysimg.jl:14
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:1348
jl_f_apply at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:889
run_work_thunk at multi.jl:642
jlcall_run_work_thunk_21126 at (unknown line)
jl_apply_generic at /home/art/Development/julia-nightly/usr/bin/../lib/libjulia.so (unknown line)
anonymous at task.jl:889
unknown function (ip: 0x7f642266e650)
unknown function (ip: (nil))
Worker 2 terminated.ERROR: ProcessExitedException()
in yieldto at ./task.jl:75
in wait at ./task.jl:371
in wait at ./task.jl:286
in wait at ./channels.jl:93
in take! at ./channels.jl:82
in take! at ./multi.jl:789
in remotecall_fetch at multi.jl:726
in remotecall_fetch at multi.jl:731
in anonymous at multi.jl:1350
in sync_end at ./task.jl:413
in anonymous at multi.jl:1359
ERROR (unhandled task failure): EOFError: read end of file
in sync_end at ./task.jl:413
in anonymous at multi.jl:1359
But if I precompile the package in advance - no problems:
➜ LMCLUS git:(master) ✗ julia-dev
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+6817 (2015-08-18 15:25 UTC)
_/ |\__'_|_|_|\__'_| | Commit 77bef6e* (0 days old master)
|__/ | x86_64-linux-gnu
julia> Base.compilecache(:LMCLUS)
"/home/art/.julia/lib/v0.4/LMCLUS.ji"
julia>
➜ LMCLUS git:(master) ✗ julia-dev
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+6817 (2015-08-18 15:25 UTC)
_/ |\__'_|_|_|\__'_| | Commit 77bef6e* (0 days old master)
|__/ | x86_64-linux-gnu
julia> addprocs(1)
1-element Array{Int64,1}:
2
julia> @everywhere using LMCLUS
WARNING: replacing module LMCLUS
julia>
Just tried on OSX and Linux using latest master (Commit e5e8ed5*), and don't get any crashes when running (for X in 1:8):
julia -e "@everywhere using LightGraphs" -p X
I do get the WARNING: node state is inconsistent
though.
Using @everywhere using
with an empty compilation cache triggers it. Before each of the following I run rm -rf ~/.julia/lib/v0.4
. This works:
> julia -e "@everywhere using LightGraphs"
Adding one or more workers shows the warning:
> julia -e "@everywhere using LightGraphs" -p 1
WARNING: Module StatsFuns uuid did not match cache file
WARNING: node state is inconsistent: node 2 failed to load cache from /Users/rene/.julia/lib/v0.4/LightGraphs.ji
No warning when omitting @everywhere
:
> julia -e "using LightGraphs" -p 1
forgot to cc @stevengj
I suspect that what is going on here is:
@everywhere using Foo
for a Foo
with __precompile__()
triggers a compilecache(:Foo)
on the master node.compilecache
(because __precompile__
only takes effect on the master node); they either load from the .jl
file or from the .ji
file (if the latter exists).Foo.ji
in a half-written state and crashing?Possible solutions:
nprocs > 1
@everywhere
macro to recognize @everywhere using Foo
and transform it to import Foo; @everywhere using Foo
. (The import Foo
statement should work: it precompiles on the master node, then loads the binary and broadcasts it IIRC.)Base.compilecache
write the file atomically: write to a temporary file and then do a rename
, so that a corrupt half-written .ji
file is never present.The third option seems best to me, since atomic rename
is usually the best practice for writing files in general.
(Also, it would be nicer if the deserialization import threw an error on a corrupt .ji
file rather than crashing, if indeed that is what is happening.)
Using import
before @everywhere
worked for me.
I implemented the rename
on general principles, but it doesn't seem to solve the WARNING: Module StatsFuns uuid did not match cache file
.
Even when the cache file already exists, however, julia -e "@everywhere using LightGraphs" -p 1
gives WARNING: replacing module LightGraphs
. There is a basic problem here because @everywhere using
actually imports the module twice on all the workers:
using Foo
on the master node imports it everywhereusing Foo
on a worker node only imports on the worker.If the latter happens before the former, I guess the module will get imported twice, leading to the problems we are seeing. (At best, you get a warning.)
It seems like the using
logic really needs to know whether it is happening in an @everywhere
or similar statement to avoid this.
While it would be great to have @everywhere using
work without showing any warnings, I believe using @everywhere using
is simply a relic from when using
did not yet auto-load on all workers. The warnings are due to the race condition of running import twice.
Shall we just live with this for now and simply discourage using @everywhere using
?
(The original issue was a segfault, which does no longer occur).
The package_locks
mechanism was supposed to (used to?) solve this. If you try to do using X
multiple times at once on a worker, it should actually happen only once.
@stevengj It makes sense that your change would fix the segfault. Is there evidence of some other problem as well, or are we done here?
We haven't had any indication that it is still segfaulting. It would be good to have an issue for eliminating the warning, but probably that should be a separate issue.
Should we then take this off the 0.4 milestone list?
i think there are a few improvements that can be made:
1) to reduce the window of inconsistency, find_all_in_cache_path
should block if package_locks[mod]
indicates that the node is in the process of calling compilecache for that module
2) to work harder to reduce this window for inconsistencies, __precompile__
should be handled on worker nodes by first attempting to convince node 1 to cachecompile
the package (instead of ignoring this directive on worker nodes) before deciding whether to abort or continue running the source file
3) the broadcast of top-level import from node 1 should include a conditional check of isdefined(Main, mod)
to block accidental redefinition (unless the user explicitly does @everywhere reload("Mod")
)
4) to reduce potential confusion, rename require
to reload
and deprecate the old name entirely
Is the following deserialization error on workers a manifestation of this race condition?
# higher number of workers relative to available cores seems to make it easier to reproduce
# e.g., try with 8 if 4x doesn’t work
workers = 4*Sys.CPU_CORES;
addprocs(workers);
@everywhere begin
import Distributions
immutable ParameterUnivariate{U<:Distributions.UnivariateDistribution}
dist::U
end
end
param = ParameterUnivariate(Distributions.Normal());
pmap(x->x, fill(param, 100));
results in reloading module warnings and then a large number of workers exiting with error:
ERROR: TypeError: ParameterUnivariate: in U, expected U<:Distributions.Distribution{Distributions.Univariate,S<:Distributions.ValueSupport}, got Type{Distributions.Normal}
in deserialize_datatype at serialize.jl:646
...
Moving import Distributions
outside of the @everywhere
block as using Distributions
seems to fix it. Reproducible on 0.4.5 and 0.5.
We faced the same issue in DecisionTree.jl, and I've boiled it down to this. No precompilation necessary (on Julia 0.4, OSX)
# B.jl
module B
end
# C.jl
module C
abstract AbstractAbstract
end
# A.jl
module A
using B # can be any module
include("incl.jl") # problem disappears if the import is done in A.jl
end
# incl.jl
import C: AbstractAbstract
type Obj <: AbstractAbstract end
then interactively:
addprocs(3)
@everywhere using A
> On worker 2: UndefVarError: AbstractAbstract not defined
import A; @everywhere using A
is the best way to do this at the moment, I think.
I'm not sure this is a closed issue (I'm on 0.5.0).
This was my workaround (I like reload
for debugging purposes.):
for p in procs()
@fetchfrom p reload("Package")
end
@pearcemc, do import Package; @everywhere using Package
.
Fixed by #21718?
Per https://groups.google.com/d/msg/julia-users/FjGXSTzvfmc/j0ZDG629IwAJ
This works on
0.4.0-dev+5008 (2015-05-26 16:08 UTC) Commit 0855ec9
Next step:
git bisect
. Stand by.