JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
27 stars 10 forks source link

Failure when run inside module #105

Open gbruer15 opened 1 month ago

gbruer15 commented 1 month ago

I am trying to run some parallel code defined inside a module via include_string, and I can't quite figure out how it works or how it's supposed to work.

It seems like a bug for such a simple case to not work, so I'm making an issue here. But it may be something that can't be fixed with Julia's current module system.

Main question: Is there a way to use a sandbox module without breaking Distributed.jl?

Short failing code

A simple remote call works when run in module Main:

using Distributed
worker_ids = addprocs(2)
remotecall_wait(() -> println("I want to run this remotely"), 2)

But doesn't work when in a different module:

module sandbox
using Distributed
worker_ids = addprocs(2)
remotecall_wait(() -> println("I want to run this remotely"), 2)
end
ERROR: On worker 2:
UndefVarError: sandbox not defined
Stacktrace:
  [1] deserialize_module
    @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:996
  [2] handle_deserialize
    @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:895
  [3] deserialize

Test scripts

I made a test script to figure out how to work with this. I run the script in a sandbox module and in the Main module.

Driver script mfe_driver.jl.

filename = ARGS[1]
include_string(Module(:sandbox_mod), read(filename, String), filename)

Escape-to-main method: mfe.jl.

The second is a script that escapes the sandbox module to get the function to run remotely: mfe.jl.

mfe.jl ```julia using Distributed worker_ids = addprocs(2) try println("================== Main process - try to run remotely") try remotecall_wait(() -> println("I want to run this remotely"), 2) catch e showerror(stdout, e) end println() println() println("================== Main process - define and run function locally") function I_want_to_run_this_remotely() println("I want to run this remotely") end I_want_to_run_this_remotely() println() println() println("================== Main process - run remotely") try remotecall_wait(I_want_to_run_this_remotely, 2) catch e showerror(stdout, e) end println() println() println("================== Main process - define function in local Main and run locally") @everywhere [1] begin using Distributed println(" process $(myid()) has module $(@__MODULE__)") function I_want_to_run_this_remotely() println("I want to run this remotely") end end Main.I_want_to_run_this_remotely() println() println() println("================== Main process - run remotely") try remotecall_wait(Main.I_want_to_run_this_remotely, 2) catch e showerror(stdout, e) end println() println() println("================== Main process - define function in everyone's Main") @everywhere begin using Distributed println(" process $(myid()) has module $(@__MODULE__)") function I_want_to_run_this_remotely() println("I want to run this remotely") end end println() println() println("================== Main process - run remotely") try remotecall_wait(Main.I_want_to_run_this_remotely, 2) catch e showerror(stdout, e) end finally rmprocs(worker_ids) end ```
Output of "julia mfe.jl" ``` ================== Main process - try to run remotely From worker 2: I want to run this remotely ================== Main process - define and run function locally I want to run this remotely ================== Main process - run remotely From worker 2: I want to run this remotely ================== Main process - define function in local Main and run locally process 1 has module Main I want to run this remotely ================== Main process - run remotely On worker 2: UndefVarError: #I_want_to_run_this_remotely not defined Stacktrace: [1] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1364 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [5] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [6] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [7] #invokelatest#2 @ ./essentials.jl:729 [inlined] [8] invokelatest @ ./essentials.jl:726 [inlined] [9] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [10] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [11] #103 @ ./task.jl:484 ================== Main process - define function in everyone's Main process 1 has module Main From worker 2: process 2 has module Main From worker 3: process 3 has module Main ================== Main process - run remotely From worker 2: I want to run this remotely ```
Output of "julia mfe_driver.jl mfe.jl" ``` ================== Main process - try to run remotely On worker 2: UndefVarError: sandbox_mod not defined Stacktrace: [1] deserialize_module @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:996 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:895 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1363 [5] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [6] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [7] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [8] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [9] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [10] #invokelatest#2 @ ./essentials.jl:729 [inlined] [11] invokelatest @ ./essentials.jl:726 [inlined] [12] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [13] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [14] #103 @ ./task.jl:484 ================== Main process - define and run function locally I want to run this remotely ================== Main process - run remotely On worker 2: UndefVarError: sandbox_mod not defined Stacktrace: [1] deserialize_module @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:996 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:895 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1363 [5] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [6] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [7] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [8] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [9] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [10] #invokelatest#2 @ ./essentials.jl:729 [inlined] [11] invokelatest @ ./essentials.jl:726 [inlined] [12] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [13] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [14] #103 @ ./task.jl:484 ================== Main process - define function in local Main and run locally process 1 has module Main I want to run this remotely ================== Main process - run remotely On worker 2: UndefVarError: #I_want_to_run_this_remotely not defined Stacktrace: [1] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1364 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [5] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [6] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [7] #invokelatest#2 @ ./essentials.jl:729 [inlined] [8] invokelatest @ ./essentials.jl:726 [inlined] [9] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [10] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [11] #103 @ ./task.jl:484 ================== Main process - define function in everyone's Main process 1 has module Main From worker 3: process 3 has module Main From worker 2: process 2 has module Main ================== Main process - run remotely From worker 2: I want to run this remotely ```

Define-sandbox-everywhere method: mfe_define_module.jl

I don't want to escape the sandbox to define things in the Main scope, so I tried to work within the sandbox.

This file tries to define the sandbox module for the worker processes.

mfe_define_module.jl ```julia using Distributed worker_ids = addprocs(2) try println("================== Main process - determine current module information") current_module_local = split(string(@__MODULE__), ".")[end] println(" Driver module: $(@__MODULE__)") println(" Driver module string: $current_module_local") println() println() println("================== Main process - define module in workers") @everywhere worker_ids begin using Distributed driver_module = try Module(Symbol(Meta.parse($current_module_local))) catch e showerror(stdout, e) end println("Process $(myid()): ", driver_module, " ", typeof(driver_module)) end println() println() driver_module = @__MODULE__ println("================== Main process - import module in workers") try @everywhere worker_ids begin println("Process $(myid()): ", driver_module, " ", typeof(driver_module)) end catch e showerror(stdout, e) end println() println() println("================== Main process - try to run remotely") try remotecall_wait(() -> println("I want to run this remotely"), 2) catch e showerror(stdout, e) end println() println() println("================== Main process - define and run function locally") function I_want_to_run_this_remotely() println("I want to run this remotely") end I_want_to_run_this_remotely() println() println() println("================== Main process - run remotely") try remotecall_wait(I_want_to_run_this_remotely, 2) catch e showerror(stdout, e) end finally rmprocs(worker_ids) end ```
Output of "julia mfe_driver.jl mfe_define_module.jl" ``` ================== Main process - determine current module information Driver module: Main.sandbox_mod Driver module string: sandbox_mod ================== Main process - define module in workers From worker 2: Process 2: Main.sandbox_mod Module From worker 3: Process 3: Main.sandbox_mod Module ================== Main process - import module in workers From worker 2: Process 2: Main.sandbox_mod Module From worker 3: Process 3: Main.sandbox_mod Module ================== Main process - try to run remotely On worker 2: UndefVarError: sandbox_mod not defined Stacktrace: [1] deserialize_module @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:996 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:895 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1363 [5] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [6] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [7] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [8] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [9] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [10] #invokelatest#2 @ ./essentials.jl:729 [inlined] [11] invokelatest @ ./essentials.jl:726 [inlined] [12] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [13] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [14] #103 @ ./task.jl:484 ================== Main process - define and run function locally I want to run this remotely ================== Main process - run remotely On worker 2: UndefVarError: sandbox_mod not defined Stacktrace: [1] deserialize_module @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:996 [2] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:895 [3] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [4] deserialize_datatype @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:1363 [5] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:866 [6] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [7] handle_deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:873 [8] deserialize @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Serialization/src/Serialization.jl:813 [inlined] [9] deserialize_msg @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:87 [10] #invokelatest#2 @ ./essentials.jl:729 [inlined] [11] invokelatest @ ./essentials.jl:726 [inlined] [12] message_handler_loop @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:176 [13] process_tcp_streams @ ~/.julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133 [14] #103 @ ./task.jl:484 ```

Version info

I tested this with Julia installed by juliaup version 1.13.0, with Julia versions below.

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

and

Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
JamesWrigley commented 1 month ago

Not an expert on the internals so take this with a grain of salt, but my understanding is that Distributed by design expects all processes in the cluster to have the same global state. And defining new modules or importing things changes that state, which causes the errors you see.

andreyz4k commented 1 month ago

Not an expert on the internals so take this with a grain of salt, but my understanding is that Distributed by design expects all processes in the cluster to have the same global state. And defining new modules or importing things changes that state, which causes the errors you see.

The problem is that when you create new workers with addprocs they don't get the same global state, and there is no simple way to initialize them properly. And in this case, you get into Catch-22 when your init code is a part of a module, but you can't use it in the worker because the module is not loaded.

jpsamaroo commented 1 month ago

You should generally not be performing side-effects like this at the top-level of a module - our module top-level is basically "compile-time", if that helps think about whether this is a good or bad idea. You could maybe do this in __init__ to run it during module initialization, as the module is already compiled and loaded at that point. But personally I would make your own init function and have users call that directly, or something more "explicit" like that.

andreyz4k commented 1 month ago

I have a slightly different setup from the topic starter, and I can say that the problem is not related to his code being in the module's top level. If I have a module with a function that creates a new process with addprocs and then tries to run anything on it, I get an error that the module is not defined. It's not possible to run import/using because it's not at the top level, and it's impossible to run any more complicated code because it's defined as part of the module, which is not loaded.

jpsamaroo commented 1 month ago

@andreyz4k can you provide an MWE?

andreyz4k commented 1 month ago

Ok, here it goes

module distributed_test

using Distributed

function test()
    new_pids = addprocs(1)
    try
        # @spawnat new_pids[1] using distributed_test
        # Can't call using because we're not at the top level
        @fetchfrom new_pids[1] println("I want to run this remotely")
    finally
        rmprocs(new_pids)
    end
end

function test2()
    new_pids = addprocs(1)
    try
        ex = Expr(
            :toplevel,
            :(task_local_storage()[:SOURCE_PATH] = $(get(task_local_storage(), :SOURCE_PATH, nothing))),
            :(using distributed_test),
        )
        Distributed.remotecall_eval(Main, new_pids[1], ex)
        println("Finished import")
        @fetchfrom new_pids[1] println("I want to run this remotely")
    finally
        rmprocs(new_pids)
    end
end

end

The first test is what I would prefer — clean code, no manual initialization. Here is the result

julia> using distributed_test

julia> distributed_test.test()
ERROR: On worker 9:
KeyError: key distributed_test [3f6dfd82-8c60-4c67-bce8-86bcb295ad2e] not found
Stacktrace:
  [1] getindex
    @ ./dict.jl:498 [inlined]
  [2] macro expansion
    @ ./lock.jl:267 [inlined]
  [3] root_module
    @ ./loading.jl:1878
  [4] deserialize_module
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:994
  [5] handle_deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:896
  [6] deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
  [7] deserialize_datatype
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1398
  [8] handle_deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
  [9] deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
 [10] handle_deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:874
 [11] deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
 [12] deserialize_msg
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/messages.jl:87
 [13] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
 [14] invokelatest
    @ ./essentials.jl:889 [inlined]
 [15] message_handler_loop
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:176
 [16] process_tcp_streams
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [17] #103
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
Stacktrace:
 [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::@Kwargs{})
   @ Distributed ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:465
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
 [3] remotecall_fetch
   @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
 [4] test()
   @ distributed_test ~/distributed_test/distributed_test/src/distributed_test.jl:11
 [5] top-level scope
   @ REPL[34]:1

In the second test, I tried to adapt the trick from @everywhere macro to go around the limitation that using can be used only at the file's top level. Still no luck, but the log is slightly different

julia> distributed_test.test2()
Finished import
ERROR: On worker 8:
UndefVarError: `#19#20` not defined
Stacktrace:
  [1] deserialize_datatype
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:1399
  [2] handle_deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:867
  [3] deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814
  [4] handle_deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:874
  [5] deserialize
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Serialization/src/Serialization.jl:814 [inlined]
  [6] deserialize_msg
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/messages.jl:87
  [7] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
  [8] invokelatest
    @ ./essentials.jl:889 [inlined]
  [9] message_handler_loop
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:176
 [10] process_tcp_streams
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [11] #103
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
Stacktrace:
 [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::@Kwargs{})
   @ Distributed ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:465
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
 [3] remotecall_fetch
   @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
 [4] test2()
   @ distributed_test ~/distributed_test/distributed_test/src/distributed_test.jl:27
 [5] top-level scope
   @ REPL[33]:1
andreyz4k commented 1 month ago

Well, it looks like I found a partial workaround

module distributed_test

using Distributed

function test4()
    new_pids = addprocs(1)
    try
        ex = Expr(
            :toplevel,
            :(task_local_storage()[:SOURCE_PATH] = $(get(task_local_storage(), :SOURCE_PATH, nothing))),
            :(using distributed_test),
        )
        Distributed.remotecall_eval(Main, new_pids[1], ex)
        println("Finished import")

        remotecall_wait(inner_func, new_pids[1])
    finally
        rmprocs(new_pids)
    end
end

function inner_func()
    println("I want to run this remotely")
end

export inner_func

end

When the called function is exported, it can be accessed in the global scope, where the worker does the evaluation, so it succeeds.

julia> distributed_test.test4()
Finished import
      From worker 11:   I want to run this remotely
Future(11, 1, 44, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (0, 4297724304, 0)), nothing)

But if I'm trying to use a macro like @fetchfrom new_pids[1] inner_func(), it fails because it creates a closure that is not exported upon calling using distributed_test.

jpsamaroo commented 1 month ago

Ahh ok that makes sense - this is related to the discussion we had in the JuliaHPC call about having addprocs load user-defined startup code before it returns, so that serialized values work as you'd expect. I agree that the first example could probably work once we implement such auto-load support.

andreyz4k commented 1 month ago

I'm not sure if it can be solved by just running some user-defined startup code. I've also tried to add an exeflags parameter to addprocs with -L flag and a file that does import, but it looked like it didn't work for some reason.

In any case, the main problem here is that serialization is done in the module scope, and deserialization in the worker is done in Main. I don't think that there is any way to change that in the initialization script. Furthermore, module context can be different for different invocations sent to the same worker, so it should be an invocation-level setting.

andreyz4k commented 1 month ago

I did some more experiments and the temporary fix is to use remotecall_eval everywhere, for example, like this

macro fetchfrom2(p, expr)
    :(Distributed.remotecall_eval(@__MODULE__, $(esc(p)), $(esc(expr))))
end

remote2(p, f) = (args...; kwargs...) -> begin
    thunk = :($f($args...; $kwargs...))
    Distributed.remotecall_eval(@__MODULE__, p, thunk)
end

It still requires the same initialization hack, but I won't need to export everything I want to run. I will use this method in my project for now, but it would be great if this issue could be fixed in the main Distributed codebase.