JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
Other
610 stars 66 forks source link

KeyError: key Dagger not found #509

Closed droodman closed 2 months ago

droodman commented 2 months ago

Just copied an example from the documentation in a new Julia session...

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.3 (2024-04-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Distributed, Dagger

julia> addprocs(4);

julia> X = Dagger.@shard myid()
ERROR: On worker 2:
KeyError: key Dagger [d58978e5-989f-55fb-8d15-ea34adc7bf54] not found
Stacktrace:
  [1] getindex
    @ .\dict.jl:498 [inlined]
  [2] macro expansion
    @ .\lock.jl:267 [inlined]
  [3] root_module
    @ .\loading.jl:1878
  [4] deserialize_module
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:994
  [5] handle_deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:896
  [6] deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:814
  [7] deserialize_datatype
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:1398
  [8] handle_deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:867
  [9] deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:814
 [10] handle_deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:874
 [11] deserialize
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Serialization\src\Serialization.jl:814 [inlined]
 [12] deserialize_msg
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\messages.jl:87
 [13] #invokelatest#2
    @ .\essentials.jl:892 [inlined]
 [14] invokelatest
    @ .\essentials.jl:889 [inlined]
 [15] message_handler_loop
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\process_messages.jl:176
 [16] process_tcp_streams
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\process_messages.jl:133
 [17] #103
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\process_messages.jl:121
Stacktrace:
  [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::@Kwargs{})
    @ Distributed C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\remotecall.jl:465
  [2] remotecall_fetch(::Function, ::Distributed.Worker)
    @ Distributed C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\remotecall.jl:454
  [3] remotecall_fetch
    @ C:\Users\drood\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\Distributed\src\remotecall.jl:492 [inlined]
  [4] OSProc(pid::Int64)
    @ Dagger C:\Users\drood\.julia\packages\Dagger\5F8wE\src\processor.jl:110
  [5] iterate
    @ .\generator.jl:47 [inlined]
  [6] collect_to!
    @ .\array.jl:892 [inlined]
  [7] collect_to_with_first!
    @ .\array.jl:870 [inlined]
  [8] _collect(c::Vector{Int64}, itr::Base.Generator{Vector{…}, Type{…}}, ::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base .\array.jl:864
  [9] collect_similar
    @ .\array.jl:763 [inlined]
 [10] map
    @ .\abstractarray.jl:3285 [inlined]
 [11] Context
    @ C:\Users\drood\.julia\packages\Dagger\5F8wE\src\context.jl:34 [inlined]
 [12] eager_context()
    @ Dagger.Sch C:\Users\drood\.julia\packages\Dagger\5F8wE\src\sch\eager.jl:9
 [13] shard(f::Any; procs::Nothing, workers::Nothing, per_thread::Bool)
    @ Dagger C:\Users\drood\.julia\packages\Dagger\5F8wE\src\chunks.jl:185
 [14] shard(f::Any)
    @ Dagger C:\Users\drood\.julia\packages\Dagger\5F8wE\src\chunks.jl:180
 [15] top-level scope
    @ C:\Users\drood\.julia\packages\Dagger\5F8wE\src\chunks.jl:223
Some type information was truncated. Use `show(err)` to see complete types.
JamesWrigley commented 2 months ago

I think the issue is that Dagger is loaded before the workers are added, if you load it afterwards it works:

            _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-beta1 (2024-04-10)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Distributed

julia> addprocs(4)
4-element Vector{Int64}:
 2
 3
 4
 5

julia> using Dagger

julia> X = Dagger.@shard myid()
Dagger.Shard(Dict{Dagger.Processor, Dagger.Chunk}(OSProc(1) => Dagger.Chunk{Int64, MemPool.DRef, OSProc, ProcessScope}(Int64, UnitDomain(), MemPool.DRef(1, 16, 0x0000000000000008), OSProc(1), ProcessScope: worker == 1, false), OSProc(2) => Dagger.Chunk{Int64, MemPool.DRef, OSProc, ProcessScope}(Int64, UnitDomain(), MemPool.DRef(2, 0, 0x0000000000000008), OSProc(2), ProcessScope: worker == 2, false), OSProc(3) => Dagger.Chunk{Int64, MemPool.DRef, OSProc, ProcessScope}(Int64, UnitDomain(), MemPool.DRef(3, 0, 0x0000000000000008), OSProc(3), ProcessScope: worker == 3, false), OSProc(4) => Dagger.Chunk{Int64, MemPool.DRef, OSProc, ProcessScope}(Int64, UnitDomain(), MemPool.DRef(4, 0, 0x0000000000000008), OSProc(4), ProcessScope: worker == 4, false), OSProc(5) => Dagger.Chunk{Int64, MemPool.DRef, OSProc, ProcessScope}(Int64, UnitDomain(), MemPool.DRef(5, 0, 0x0000000000000008), OSProc(5), ProcessScope: worker == 5, false)))

AFAIK this is a limitation of Distributed.jl rather than Dagger itself. Where did you did see that example in the docs?

droodman commented 2 months ago

Ah, yes, that does fix it.

But I think it points up a gap in the documentation. The example is from the documentation in the sense that the line of interest, X = Dagger.@shard myid() is on the quick start page. I wanted to try it in the Julia session, so I did what seemed the obvious thing to me. myid() is in DIstributed, so I loaded that with using. While I was at it, I loaded Dagger. Then I ran addprocs(). Then I ran the command of interest. It crashed. I thought, oh I guess Dagger is not worth the trouble. More of a quick end than a quick start!

If it is is easy to get a crash when using Dagger then I think how to avoid that should be prominent on the quick start page. Put another way, there isn't a complete example on the quick start page that includes the using commands and whatever other setup is needed. Or is it possible for Dagger to detect the condition that causes the crash and provide a helpful message?

JamesWrigley commented 2 months ago

Yeah that's fair, I added some docs about it in #510.

Or is it possible for Dagger to detect the condition that causes the crash and provide a helpful message?

I don't think this is possible in Dagger itself, it would need to be added in Distributed. What's happening is that the master process is executing some code (like Dagger.@shard) that serializes Dagger objects and sends them to the workers, but if the workers don't have Dagger loaded they see a name like Dagger in the object type and cannot deserialize the object because they don't know anything about the Dagger module.