Closed MilesCranmer closed 1 month ago
I'm seeing the same thing. My feeling is that this effect is amplified when logging at high frequency. When I throttle the logging sufficiently, the problem seems to disappear.
That would be consistent with the example Miles shared. I am guessing SR.jl would be logging at quite a high rate. That said, I am completely clueless as to how to solve it. Maybe someone from the Python-Julia Interop could help out here.
I will post this on slack later today to seek out help (feel free to post it there, I don't have slack access for the next hrs)
@avik-pal I wonder if it could be an interaction with Python multiprocessing? Is there a way to prevent wandb from spawning a separate process to communicate with wandb?
what happens if you pass in settings=wandb.Settings(; start_method="thread")
to init
?
Hm it seems like my original example doesn't start. Is Wandb.jl incompatible with the latest Python?
julia> using SymbolicRegression, Wandb, Logging, MLJBase
ERROR: InitError: Python: ModuleNotFoundError: No module named 'distutils'
Python stacktrace:
[1] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/env.py:16
[2] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/util.py:57
[3] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/sdk/lib/config_util.py:10
[4] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/sdk/wandb_helper.py:6
[5] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/sdk/__init__.py:24
[6] <module>
@ /private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/tmp.GdDLQLXHGq/.CondaPkg/env/lib/python3.12/site-packages/wandb/__init__.py:27
Stacktrace:
[1] pythrow()
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
[2] errcheck
@ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
[3] pyimport(m::String)
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:1444
Seems to work for me at least in the beginning but then segfaults
what happens if you pass in settings=wandb.Settings(; start_method="thread") to init?
doesn't seem to work
Seems like the conda versions weren't updated for whatever reason. I have updated the code to install the pip version which is the latest one. @MilesCranmer can you check if https://github.com/avik-pal/Wandb.jl/pull/38 fixes the installation issue you had?
It looks like an issue with PythonCall.jl being called from threads other than the first Julia thread (https://juliapy.github.io/PythonCall.jl/stable/faq/#Is-PythonCall/JuliaCall-thread-safe?). The example at the top works for me when I use Julia with one thread and segfaults with 2+ threads. The following example works with one Julia thread and segfaults with 2+ threads
using Wandb, Logging
# Initialize the project
lg = WandbLogger(; project = "Wandb.jl", name = nothing)
# Set logger globally / in scope / in combination with other loggers
global_logger(lg)
# Logging Values
function log_wandb()
Wandb.log(lg, Dict("accuracy" => 0.9, "loss" => 0.3))
end
Threads.@threads for i in 1:1000
log_wandb()
end
I found a potential solution here to ensure that Python functions are called from the main thread.
using Wandb, Logging, ThreadPools
# Initialize the project
lg = WandbLogger(; project = "Wandb.jl", name = nothing)
# Set logger globally / in scope / in combination with other loggers
global_logger(lg)
# Logging Values
macro pythread(expr)
quote
fetch(@tspawnat 1 begin
$(esc(expr))
end)
end
end
function log_wandb()
@pythread begin
Wandb.log(lg, Dict("accuracy" => 0.9, "loss" => 0.3))
end
end
Threads.@threads for i in 1:1000
log_wandb()
end
Multi-threaded calls to PythonCall are documented not be allowed. However, this problem is not limited to multi-threading on the julia side. The cases where I have seen this happen and the original example above do not feature multiple Julia threads.
Did anyone try https://github.com/JuliaPy/PythonCall.jl/pull/520 on this?
The segfaults indeed seem to be all fixed with https://github.com/JuliaPy/PythonCall.jl/pull/520 :tada:
That's great!
Did anyone try https://github.com/JuliaPy/PythonCall.jl/pull/520 on this?
Yes that seems to fix it! :tada:
Awesome! I will bump the compat for PythonCall once that PR lands.
@avik-pal I've been trying out TensorBoardLogger.jl and Wandb.jl in https://github.com/MilesCranmer/SymbolicRegression.jl/pull/277, I find that:
Here's a MWE. First, install SymbolicRegression in the logger branch with:
Then, test it out with Wandb with:
which generates the segfault:
You can verify that running it with
TensorBoardLogger
does not produce any issues. So I'm not sure what's going wrong here...Here's my system info:
but hopefully you are able to reproduce.
For the record I also see this on v0.4.4 with PyCall.jl