JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

unpredictable behavior of `Random.seed!` on workers using Distributed #51225

Open vancleve opened 1 year ago

vancleve commented 1 year ago

I'm having trouble setting the seed on worker tasks using Distributed. If I set the seed and then get a random number in one remote call, the result is correct. If I set the seed and then get the random number in two remote calls, the random number is different each time.

MWE

using Distributed

addprocs(1)
@everywhere using Random

# works on main process
Random.seed!(1234); rand() 
fetch(@spawnat 1 (Random.seed!(1234); rand()))

# works in one call
fetch(@spawnat 2 (Random.seed!(1234); rand()))

# doesn't work in two calls
fetch(@spawnat 2 Random.seed!(1234)); fetch(@spawnat 2 rand())
fetch(@spawnat 2 Random.seed!(1234)); fetch(@spawnat 2 rand())

Output:

0.32597672886359486
0.32597672886359486

0.32597672886359486

0.23226466961312875
0.11712298602126525
versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 
vchuravy commented 1 year ago

Note that on the remote process all RPCs are executed on separate tasks.

You can observe the same behaviour locally:

julia> using Base.Threads: @spawn

julia> wait(@spawn Random.seed!(1234))

julia> fetch(@spawn rand())
0.9531220182290453

julia> wait(@spawn Random.seed!(1234))

julia> fetch(@spawn rand())
0.18986972383585565

This is expected behavior see https://docs.julialang.org/en/v1/stdlib/Random/

However, the default RNG is thread-safe as of Julia 1.3 (using a per-thread RNG up to version 1.6, and per-task thereafter).

Since version 1.6, Julia uses a per-task random number generator and thus the seeding it is only influencing the task itself and it's children.

If you want deterministic behavior you will need to use a non-default RNG. Do note that remote-process calls execute concurrently and thus you may need to protect that global RNG with a lock.

We should improve the documentation on this.

vancleve commented 1 year ago

Got it! Thanks for the help!

nhz2 commented 1 year ago

I am doing the following as a workaround for this.

julia> using Distributed

julia> addprocs(1);

julia> @everywhere using Random

julia> @everywhere const rng_copy = copy(Random.default_rng());

julia> fetch(@spawnat 2 let 
           Random.seed!(1234)
           copy!(getglobal(Main,:rng_copy), Random.default_rng())
           nothing
       end)

julia> fetch(@spawnat 2 let 
           copy!(Random.default_rng(), getglobal(Main,:rng_copy))
           x = rand()
           copy!(getglobal(Main,:rng_copy), Random.default_rng())
           x
       end)
0.32597672886359486

While this works on 1.9, I'm pretty sure this code may break in future releases because copy! isn't the documented way to save or load the state of the TaskLocalRNG from what I can tell.

nhz2 commented 1 year ago

A related issue in Pluto: https://github.com/fonsp/Pluto.jl/issues/2290

vtjnash commented 1 year ago

We do actually have tests that copy works like that, so it is unlikely it would change or break that example

nhz2 commented 1 year ago

In 1.10 a similar example is broken.

julia> using Distributed

julia> addprocs(1);

julia> @everywhere begin 
           using Random
           myrand() = fetch(Threads.@spawn(rand()))
           const rng_copy = copy(Random.default_rng())
       end

julia> fetch(@spawnat 2 let 
           Random.seed!(1234)
           copy!(getglobal(Main,:rng_copy), Random.default_rng())
           nothing
       end)

julia> fetch(@spawnat 2 let 
           copy!(Random.default_rng(), getglobal(Main,:rng_copy))
           x = myrand()
           copy!(getglobal(Main,:rng_copy), Random.default_rng())
           x
       end)

julia> fetch(@spawnat 2 let 
           copy!(Random.default_rng(), getglobal(Main,:rng_copy))
           x = myrand()
           copy!(getglobal(Main,:rng_copy), Random.default_rng())
           x
       end)

julia> Random.seed!(1234);

julia> myrand()

julia> myrand()

In 1.9 both remote and local calls return the same random sequence.

In 1.10.0-beta2 the remote call always returns the constant 0.47487231547644215, so in this case, using copy! seems to break the rng of any spawned tasks.

rfourquet commented 1 year ago

In 1.10.0-beta2 the remote call always returns the constant 0.47487231547644215, so in this case, using copy! seems to break the rng of any spawned tasks.

Yeah in 1.10 spawning a task doesn't update anymore (compared to 1.9) the current task's local RNG state (TaskLocalRNG()), so what you are seeing is expected (calling myrand() doesn't mutate Random.default_rng()).

vtjnash commented 1 year ago

So that might be an issue with Test.guardseed now then. That copy seems to be supposed to work.