Open Cody-G opened 8 years ago
Is there any chance you can capture this with --inline=no
? That will slow things down tremendously, so it may not be practical. Unless you can trap it using a really tiny sub-image?
I should clarify that part of the reason I asked this is that line 280 is this line, which doesn't make much sense as the source of the problem.
Oops, I had some local changes (just comments, so not relevant to the error). So for me line 280 corresponds to this line.
I'm also running it without inlining now to try to catch the error in case it's helpful.
Okay I did trigger the error without inlining. Most, but not all, of the line numbers are the same. I also noticed some warnings which I think are unrelated, but I'm pasting them too:
WARNING: Module Reexport uuid did not match cache file
WARNING: Module Reexport uuid did not match cache file
WARNING: deserialization checks failed while attempting to load cache from /home/cody/git/juliapackages_new/lib/v0.4/RegisterMismatchCuda.ji
WARNING: deserialization checks failed while attempting to load cache from /home/cody/git/juliapackages_new/lib/v0.4/RegisterMismatchCuda.ji
WARNING: Module Reexport uuid did not match cache file
WARNING: deserialization checks failed while attempting to load cache from /home/cody/git/juliapackages_new/lib/v0.4/RegisterMismatchCuda.ji
From worker 2: Worker 2 is working on 1
From worker 3: Worker 3 is working on 2
From worker 4: Worker 4 is working on 3
ERROR: LoadError: On worker 2:
UndefRefError: access to undefined reference
in getindex at /home/cody/git/juliapackages_new/v0.4/OCPI/src/Bidirectional.jl:280
in copy! at multidimensional.jl:582
in getindex at subarray.jl:607
in worker at /home/cody/git/juliapackages_new/v0.4/BlockRegistrationScheduler/src/RegisterWorkerAperturesMismatch.jl:124
in worker at /home/cody/git/juliapackages_new/v0.4/BlockRegistrationScheduler/src/RegisterWorkerShell.jl:148
in anonymous at multi.jl:907
in run_work_thunk at multi.jl:645
[inlined code] from multi.jl:907
in anonymous at task.jl:63
in remotecall_fetch at multi.jl:731
in remotecall_fetch at multi.jl:734
[inlined code] from /home/cody/git/juliapackages_new/v0.4/BlockRegistrationScheduler/src/RegisterDriver.jl:84
in anonymous at task.jl:447
...and 2 other exceptions.
That line number makes much more sense---it suggests that one of the fields of the ArrayZInterp
is undefined. But the weird thing is neither ArrayZInterp
nor ArrayZSeq
has an inner constructor, and without an inner constructor I don't think it's possible to have an undefined field. So I'm still puzzled.
How about trying this: define a copy!
method with signature
copy!{T}(dest::AbstractArray{T,4}, src::ArrayZInterp{T})
and first check that this is what gets called (e.g., make the body something like error("here we are!")
and check that the backtrace is the same starting with that getindex at subarray.jl:607
). Then you should be able to use this to learn more about what's happening. If you haven't used it before, isdefined
could be quite handy, e.g., isdefined(src, :data)
and isdefined(src, :framenumber)
.
Long-term, we might want to define such a copy!
method anyway, since it could be more efficient than looping over scalar indexes. But that's a topic for another day.
Just noticed that it's quite possible that --inline=no
doesn't get passed to the workers. You could use
addprocs(n; exeflags=`--inline=no`)
I should also say that now the backtrace makes sense, so running with --inline=no
is a moot point.
One other thought: try changing those types to immutable
. They can't have undefined fields.
I wonder if this could by any chance be the same as #36, with a different error message?
When using multiple workers I am getting this error fairly often. When I just try running the same script again it works after one or two tries. My guess is that multiple workers are trying to access the image (in this case a subarray into a Bidirectional image) simultaneously and depending on the timing, the reference is sometimes unavailable. I'm not sure how to debug this. For now just running the script again seems to work, but I'm happy to try to fix this if someone can suggest a direction.