Open zzccchen opened 4 months ago
Can you try with turbo=False, bumper=False
? Those options are experimental and get PySR to use libraries which are bleeding edge. When they work, they are really fast, but they can also cause crashes (especially on Windows).
Regrettably. I tried turbo=False, bumper=False
parameter and the crash problem still occurred.
Could automatically setting --heap-size-hint=2730M
cause this problem?
Hm, Can you show the rest of your code?
from pysr import PySRRegressor
# data load code
X_123e = data_X_123e.to_numpy()
y_123e = data_y_123e.to_numpy()
sr_model = PySRRegressor(
binary_operators=[
"*",
"+",
"-",
"/",
],
unary_operators=["square", "cube", "exp", "log", "sqrt"],
maxsize=80,
maxdepth=10,
niterations=100,
populations=32,
population_size=100,
ncycles_per_iteration=550,
constraints={
"/": (-1, 9),
"^": (-1, 5),
"exp": 6,
"square": 6,
"cube": 6,
"log": 6,
"sqrt": 6,
"abs": 9,
},
nested_constraints={
"square": {"square": 0, "cube": 0, "exp": 1},
"cube": {"square": 0, "cube": 0, "exp": 1},
"exp": {"square": 0, "cube": 0, "exp": 0},
"sqrt": {"sqrt": 0, "log": 0},
"log": {"log": 0},
},
complexity_of_operators={
"square": 2,
"cube": 3,
"exp": 3,
"log": 3,
"sqrt": 2,
},
complexity_of_constants=4,
adaptive_parsimony_scaling=150.0,
weight_add_node=0.79,
weight_insert_node=5.1,
weight_delete_node=1.7,
weight_do_nothing=0.21,
weight_mutate_constant=0.048,
weight_mutate_operator=0.47,
weight_swap_operands=0.1,
weight_randomize=0.23,
weight_simplify=0.5,
weight_optimize=0.5,
crossover_probability=0.066,
perturbation_factor=0.076,
cluster_manager=None,
precision=32,
turbo=True,
bumper=True,
progress=True,
elementwise_loss="""
function loss_fnc(prediction, target)
percentage_error = abs((prediction - target) / target) * 100
return percentage_error
end
""",
multithreading=False,
equation_file=symbol_regression_csv_path,
)
complexity_of_variables = [] # list of complexity
sr_model.fit(
X_123e, y_123e, complexity_of_variables=complexity_of_variables
)
here is the main code of the workflow.
At the same time, I will put the above code in a multi-layer loop to test different feature data sets and the stability of the symbolic regression results. A single loop takes about 2.2 minutes. The program crashes after running for 3-4 hours, running about 80-110 rounds.
That looks good. Great to see all those options being used! 🙂
(Random comment: your element wise loss divides by the target, so make sure the target > 0, otherwise one target will dominate. But I’m assuming you’re aware of that!)
Other comment: can you try with multithreading=True
? With it set to False
, and with procs>0
(the default), it will use multiple Julia processes. But if you just use multi-threading instead, it will start up much faster and hopefully be more stable. With multi-processing it is launching new Julia processes every single time it searches. (This is a weakness in the current codebase; I would like to eventually store the processes within PySRRegressor so multiprocessing has fast startup too.)
You can also set multithreading=False, procs=0
to use serial mode.
But it’s curious that it crashes. Since it runs for a few hours, did you notice anything else happening, like the memory usage gradually increasing over that time and not going down?
If I use multithreading instead of multiprocessing, the calculation speed will drop from 30it/s to 7it/s on my device, which is a bit unacceptable to me. In addition, I have made sure that my y_true values are all greater than 0. And the memory usage does not fluctuate when the program crashes, occupying only 30% of the total memory.
Maybe try multithreading=True
again, but this time, before loading PySR, set a larger thread count:
import os
os.environ["PYTHON_JULIACALL_THREADS"] = (num_cores) * 2
Where num_cores
is the number of CPU cores. The factor of 2 is so there’s some redundancy but you could try more or less depending on performance.
The default behavior of PySR is to start Julia with --threads='auto'
which is actually fewer than the number of available cores (so it doesn’t take up the whole CPU). But for high performance you can increase the usage.
The full list of available juliacall environment variables is here: https://juliapy.github.io/PythonCall.jl/stable/juliacall/#julia-config
I tried
import os
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
# or
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
os.environ["PYTHON_JULIACALL_PROCS"] = "64"
But it did not improve the calculation speed, the processor usage was only 20-30%, I am using a 24c32t 14900k processor.
To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.
Also note that the PROCS
env variable won’t have any effect.
I had a similar problem when I gave up Windows and moved to Ubuntu 24.04 lts. I also used a tool (tm5) to test the memory. After testing for 1 hour, there was no error and the temperature was stable at 45℃. It doesn't seem to be a hardware problem. This problem is so strange.
Traceback (most recent call last):
File "/home/zc/Documents/GitHub/MLPIP/notebooks/TC/S2_symbol_regression/S202_sr_123e.py", line 192, in <module>
sr_model.fit(
File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 2088, in fit
self._run(X, y, runtime_params, weights=weights, seed=seed)
File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 1890, in _run
out = SymbolicRegression.equation_search(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zc/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl", line 223, in __call__
return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
[1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
@ Base ./stream.jl:410
[2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:949
[3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:955
[4] unsafe_read
@ ./io.jl:774 [inlined]
[5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:773
[6] read!
@ ./io.jl:775 [inlined]
[7] deserialize_hdr_raw
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
[8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
[9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
[10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
juliacall.JuliaError: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:352 [inlined]
[2] fetch
@ ./task.jl:372 [inlined]
[3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
[4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
[5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
[6] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
[7] #equation_search#26
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
[8] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
[9] #equation_search#28
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
[10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:36
[11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/base.jl:72
[12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
@ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/C.jl:63
nested task error: Distributed.ProcessExitedException(423)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:931
[2] wait()
@ Base ./task.jl:995
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Any})
@ Base ./channels.jl:477
[6] take!(c::Channel{Any})
@ Base ./channels.jl:471
[7] take!(::Distributed.RemoteValue)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
[8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
[9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
[10] remotecall_fetch
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
[11] call_on_owner
@ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
[12] fetch(r::Distributed.Future)
@ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
[13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984
Just to confirm, there is no crash now? Just that this message is printed?
I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the @async fetch
call on that worker failed.
However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous fetch
tasks before the worker processes are killed.
I do think it would be better if there was a way to get multithreading to be faster, by increasing PYTHON_JULIACALL_THREADS
before importing pysr. Windows multiprocessing seems to occasionally have issues for unknown reasons, and has been quite hard to debug, whereas multithreading has been quite stable.
This message appears when the search process reaches about 30%, and then the search process stops. I can try to reproduce it again to see if it crashes. Also, does using the slurm backend help avoid this problem?
Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.
Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.
I guess this might be hard to make a smaller MWE but (2) would be most useful.
The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.
To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.
Also note that the
PROCS
env variable won’t have any effect.
I have confirmed this point. If I use os.environ["PYTHON_JULIACALL_THREADS"] = "1", it will warn Warning: You are using multithreading mode, but only one thread is available. Try starting julia with --threads=auto
.
Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.
Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.
I guess this might be hard to make a smaller MWE but (2) would be most useful.
The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.
Thank you very much. I need to apply for the relevant code and data to be provided. In addition, I have an Ubuntu 20 server running a single-node slurm. In the preliminary test, the calculation speed is consistent with multi-process. I can test on that device to confirm whether it is a device problem.
Just to confirm, there is no crash now? Just that this message is printed?
I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the
@async fetch
call on that worker failed.However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous
fetch
tasks before the worker processes are killed.
I have confirmed that this prompt will cause the search process to be interrupted. I temporarily bypassed the crash by using try...except Exception...
in the Python code, but the memory requested by Julia was not released. This caused my memory to be full after crashing 3 times. Can we use the try-finally
block in the Julia source code to improve the stability of the program?
I think I have found a temporary solution for the time being, which is to manually end the julia process after each search.
import time, os
time.sleep(10)
os.system("killall julia")
Thanks. That is good to know.
I do think the way SymbolicRegression.jl launches processes is a bit problematic for large-scale use-cases at the moment. The way it works is that it calls addprocs
from within SymbolicRegression.equation_search
. This was designed for convenience of users, especially on the Python side, but as far as I can tell it's not well-supported behavior in Julia, which means it needs to do some very fragile things like manually copying function definitions to workers.
What would be better is if PySR did one of the following alternative strategies:
mpiexec
manually, rather than launch the multi-processor search from a single Python session. However, it is nice that MPI has support as a standard on every cluster, so we wouldn't need to rely on different cluster manager-specific scripts.PySRRegressor
object itself would call addprocs
, and store the processes as an attribute of the regressor object. It can pass these to equation_search
via the procs
keyword argument, in which case SymbolicRegression.jl will simply use them.
jl.seval
is called with an @everywhere
in front of it – thus executing each Julia snippet on all processes. This also means that it would be harder for users to use jl.seval
themselves.Main
context – which might interfere with other Python+Julia packages in the future.I'm not sure how much work each of these options would be. They might be fairly easy to get working though. But it would definitely require some Julia coding (if you are up for it).
Just going to keep this open until there's a better solution than a manual workaround. Ideally the workaround shouldn't be needed
What happened?
The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection process.
Version
v0.19.0
Operating System
Windows
Package Manager
pip
Interface
Script (i.e.,
python my_script.py
)Relevant log output
Extra Info
turbo=True, bumper=True