[BUG]: EXCEPTION_ACCESS_VIOLATION during garbage collection in PySR

zzccchen commented 4 months ago

What happened?

The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection process.

Version

v0.19.0

Operating System

Windows

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

[ Info: Automatically setting `--heap-size-hint=2730M` on each Julia process. You can configure this with the `heap_size_hint_in_bytes` parameter.
[ Info: Importing SymbolicRegression on workers as well as extensions Bumper, LoopVectorization.
[ Info: Finished!
[ Info: Copying definition of loss_fnc to workers...
[ Info: Finished!
[ Info: Started!
32.1%┣█████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                      ┫ 1.0k/3.2k [00:40<01:26, 25it/s]
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffa6106a6b0 -- gc_mark_outrefs at C:/workdir/src\gc.c:2527 [inlined]
gc_mark_and_steal at C:/workdir/src\gc.c:2746
in expression starting at none:0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gc_mark_outrefs at C:/workdir/src\gc.c:2527 [inlined]
gc_mark_and_steal at C:/workdir/src\gc.c:2746
gc_mark_loop_parallel at C:/workdir/src\gc.c:2885
jl_gc_mark_threadfun at C:/workdir/src\partr.c:142
uv__thread_start at /workspace/srcdir/libuv\src/win\thread.c:111
beginthreadex at C:\Windows\System32\msvcrt.dll (unknown line)
endthreadex at C:\Windows\System32\msvcrt.dll (unknown line)
BaseThreadInitThunk at C:\Windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 9815735891 (Pool: 9517376769; Big: 298359122); GC: 69400

Extra Info

turbo=True, bumper=True

MilesCranmer commented 4 months ago

Can you try with turbo=False, bumper=False? Those options are experimental and get PySR to use libraries which are bleeding edge. When they work, they are really fast, but they can also cause crashes (especially on Windows).

zzccchen commented 4 months ago

Regrettably. I tried turbo=False, bumper=False parameter and the crash problem still occurred.

zzccchen commented 4 months ago

Could automatically setting --heap-size-hint=2730M cause this problem?

MilesCranmer commented 4 months ago

Hm, Can you show the rest of your code?

zzccchen commented 4 months ago

from pysr import PySRRegressor

# data load code

X_123e = data_X_123e.to_numpy()
y_123e = data_y_123e.to_numpy()

sr_model = PySRRegressor(
    binary_operators=[
        "*",
        "+",
        "-",
        "/",
    ],
    unary_operators=["square", "cube", "exp", "log", "sqrt"],
    maxsize=80, 
    maxdepth=10,  
    niterations=100, 
    populations=32, 
    population_size=100, 
    ncycles_per_iteration=550, 
    constraints={
        "/": (-1, 9),
        "^": (-1, 5),
        "exp": 6,
        "square": 6,
        "cube": 6,
        "log": 6,
        "sqrt": 6,
        "abs": 9,
    },
    nested_constraints={
        "square": {"square": 0, "cube": 0, "exp": 1},
        "cube": {"square": 0, "cube": 0, "exp": 1},
        "exp": {"square": 0, "cube": 0, "exp": 0},
        "sqrt": {"sqrt": 0, "log": 0},
        "log": {"log": 0},
    },
    complexity_of_operators={
        "square": 2,
        "cube": 3,
        "exp": 3,
        "log": 3,
        "sqrt": 2,
    },
    complexity_of_constants=4,
    adaptive_parsimony_scaling=150.0,
    weight_add_node=0.79,
    weight_insert_node=5.1,
    weight_delete_node=1.7,
    weight_do_nothing=0.21,
    weight_mutate_constant=0.048,
    weight_mutate_operator=0.47,
    weight_swap_operands=0.1,
    weight_randomize=0.23,
    weight_simplify=0.5,
    weight_optimize=0.5,
    crossover_probability=0.066,
    perturbation_factor=0.076,
    cluster_manager=None,
    precision=32,
    turbo=True,
    bumper=True,
    progress=True,
    elementwise_loss="""
    function loss_fnc(prediction, target)
        percentage_error = abs((prediction - target) / target) * 100
        return percentage_error
    end
    """,
    multithreading=False,
    equation_file=symbol_regression_csv_path,
)

complexity_of_variables = [] # list of complexity
sr_model.fit(
    X_123e, y_123e, complexity_of_variables=complexity_of_variables
)

here is the main code of the workflow.

zzccchen commented 4 months ago

At the same time, I will put the above code in a multi-layer loop to test different feature data sets and the stability of the symbolic regression results. A single loop takes about 2.2 minutes. The program crashes after running for 3-4 hours, running about 80-110 rounds.

MilesCranmer commented 4 months ago

That looks good. Great to see all those options being used! 🙂

(Random comment: your element wise loss divides by the target, so make sure the target > 0, otherwise one target will dominate. But I’m assuming you’re aware of that!)

Other comment: can you try with multithreading=True? With it set to False, and with procs>0 (the default), it will use multiple Julia processes. But if you just use multi-threading instead, it will start up much faster and hopefully be more stable. With multi-processing it is launching new Julia processes every single time it searches. (This is a weakness in the current codebase; I would like to eventually store the processes within PySRRegressor so multiprocessing has fast startup too.)

You can also set multithreading=False, procs=0 to use serial mode.

But it’s curious that it crashes. Since it runs for a few hours, did you notice anything else happening, like the memory usage gradually increasing over that time and not going down?

zzccchen commented 4 months ago

If I use multithreading instead of multiprocessing, the calculation speed will drop from 30it/s to 7it/s on my device, which is a bit unacceptable to me. In addition, I have made sure that my y_true values are all greater than 0. And the memory usage does not fluctuate when the program crashes, occupying only 30% of the total memory.

MilesCranmer commented 4 months ago

Maybe try multithreading=True again, but this time, before loading PySR, set a larger thread count:

import os
os.environ["PYTHON_JULIACALL_THREADS"] = (num_cores) * 2

Where num_cores is the number of CPU cores. The factor of 2 is so there’s some redundancy but you could try more or less depending on performance.

The default behavior of PySR is to start Julia with --threads='auto' which is actually fewer than the number of available cores (so it doesn’t take up the whole CPU). But for high performance you can increase the usage.

The full list of available juliacall environment variables is here: https://juliapy.github.io/PythonCall.jl/stable/juliacall/#julia-config

zzccchen commented 4 months ago

I tried

import os
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
# or
os.environ["PYTHON_JULIACALL_THREADS"] = "64"
os.environ["PYTHON_JULIACALL_PROCS"] = "64"

But it did not improve the calculation speed, the processor usage was only 20-30%, I am using a 24c32t 14900k processor.

MilesCranmer commented 4 months ago

To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.

Also note that the PROCS env variable won’t have any effect.

zzccchen commented 4 months ago

I had a similar problem when I gave up Windows and moved to Ubuntu 24.04 lts. I also used a tool (tm5) to test the memory. After testing for 1 hour, there was no error and the temperature was stable at 45℃. It doesn't seem to be a hardware problem. This problem is so strange.

Traceback (most recent call last):
  File "/home/zc/Documents/GitHub/MLPIP/notebooks/TC/S2_symbol_regression/S202_sr_123e.py", line 192, in <module>
    sr_model.fit(
  File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 2088, in fit
    self._run(X, y, runtime_params, weights=weights, seed=seed)
  File "/home/zc/miniconda3/envs/MLPIP_ENV_PIP/lib/python3.11/site-packages/pysr/sr.py", line 1890, in _run
    out = SymbolicRegression.equation_search(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zc/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl", line 223, in __call__
    return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:949
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:955
  [4] unsafe_read
    @ ./io.jl:774 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:773
  [6] read!
    @ ./io.jl:775 [inlined]
  [7] deserialize_hdr_raw
    @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
juliacall.JuliaError: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
  [4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
  [5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Vector{Int64}}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, true, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Nothing, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
  [6] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
  [7] #equation_search#26
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
  [8] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
  [9] #equation_search#28
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
 [10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:36
 [11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/base.jl:72
 [12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
    @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/C.jl:63

    nested task error: Distributed.ProcessExitedException(423)
    Stacktrace:
      [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
        @ Base ./task.jl:931
      [2] wait()
        @ Base ./task.jl:995
      [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
        @ Base ./condition.jl:130
      [4] wait
        @ ./condition.jl:125 [inlined]
      [5] take_buffered(c::Channel{Any})
        @ Base ./channels.jl:477
      [6] take!(c::Channel{Any})
        @ Base ./channels.jl:471
      [7] take!(::Distributed.RemoteValue)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
      [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
      [9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
     [10] remotecall_fetch
        @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
     [11] call_on_owner
        @ ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
     [12] fetch(r::Distributed.Future)
        @ Distributed ~/miniconda3/envs/MLPIP_ENV_PIP/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
     [13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
        @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984

MilesCranmer commented 4 months ago

Just to confirm, there is no crash now? Just that this message is printed?

I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the @async fetch call on that worker failed.

However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous fetch tasks before the worker processes are killed.

MilesCranmer commented 4 months ago

I do think it would be better if there was a way to get multithreading to be faster, by increasing PYTHON_JULIACALL_THREADS before importing pysr. Windows multiprocessing seems to occasionally have issues for unknown reasons, and has been quite hard to debug, whereas multithreading has been quite stable.

zzccchen commented 4 months ago

This message appears when the search process reaches about 30%, and then the search process stops. I can try to reproduce it again to see if it crashes. Also, does using the slurm backend help avoid this problem?

MilesCranmer commented 4 months ago

Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.

Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.

I guess this might be hard to make a smaller MWE but (2) would be most useful.

The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.

zzccchen commented 4 months ago

To confirm, this was before importing PySR right? As a test, if you set it to 1, the CPU usage should only be 1 core.

Also note that the PROCS env variable won’t have any effect.

I have confirmed this point. If I use os.environ["PYTHON_JULIACALL_THREADS"] = "1", it will warn Warning: You are using multithreading mode, but only one thread is available. Try starting julia with --threads=auto.

zzccchen commented 4 months ago

Thanks. So if this reproduces on ubuntu, it seems like a deeper issue. Can you share your data so that I can reproduce it on my machine? If there is some script I can run which reproduces the error exactly on my computer it will be easier to help debug it.

Also, the more minimal the code, the easier it will be for me to debug it. So perhaps try (1) reducing the dataset size, (2) creating conditions that cause the error to occur earlier during training, (3) using fewer parameters of PySR.

I guess this might be hard to make a smaller MWE but (2) would be most useful.

The Slurm backend is only if you’re using a Slurm computing cluster, but won’t be available otherwise.

Thank you very much. I need to apply for the relevant code and data to be provided. In addition, I have an Ubuntu 20 server running a single-node slurm. In the preliminary test, the calculation speed is consistent with multi-process. I can test on that device to confirm whether it is a device problem.

zzccchen commented 4 months ago

Just to confirm, there is no crash now? Just that this message is printed?

I see this message sometimes during testing. So far, it has seemed to be harmless, and has never caused a crash – it simply indicates that one of the worker processes has exited, due to the search returning, and the @async fetch call on that worker failed.

However, if this is what is calling the error, perhaps it is not harmless, and we should close the asynchronous fetch tasks before the worker processes are killed.

I have confirmed that this prompt will cause the search process to be interrupted. I temporarily bypassed the crash by using try...except Exception... in the Python code, but the memory requested by Julia was not released. This caused my memory to be full after crashing 3 times. Can we use the try-finally block in the Julia source code to improve the stability of the program?

error_log.txt

zzccchen commented 4 months ago

I think I have found a temporary solution for the time being, which is to manually end the julia process after each search.

import time, os
time.sleep(10)
os.system("killall julia")

MilesCranmer commented 4 months ago

Thanks. That is good to know.

I do think the way SymbolicRegression.jl launches processes is a bit problematic for large-scale use-cases at the moment. The way it works is that it calls addprocs from within SymbolicRegression.equation_search. This was designed for convenience of users, especially on the Python side, but as far as I can tell it's not well-supported behavior in Julia, which means it needs to do some very fragile things like manually copying function definitions to workers.

What would be better is if PySR did one of the following alternative strategies:

For big jobs, use MPI directly, via MPI.jl. However, this would require the user to call mpiexec manually, rather than launch the multi-processor search from a single Python session. However, it is nice that MPI has support as a standard on every cluster, so we wouldn't need to rely on different cluster manager-specific scripts.
Explore @oschulz's ParallelProcessingTools.jl as an alternative. This uses an elastic manager – which is actually designed for the things PySR is doing, like adding and removing workers. (Right now PySR basically misuses Distributed.jl to start new processes, send code to them, and finally kill them at the end of a search. It works and it's convenient, but I'm not sure it is a sustainable solution)
Start the workers from the Python side, rather than within Julia. Basically, the PySRRegressor object itself would call addprocs, and store the processes as an attribute of the regressor object. It can pass these to equation_search via the procs keyword argument, in which case SymbolicRegression.jl will simply use them.
- However, this would require rewriting some of the Python side of things so that each jl.seval is called with an @everywhere in front of it – thus executing each Julia snippet on all processes. This also means that it would be harder for users to use jl.seval themselves.
- This approach would also mean that we could wrap PySR in a Julia module, rather than the current approach of running everything in Julia's Main context – which might interfere with other Python+Julia packages in the future.

I'm not sure how much work each of these options would be. They might be fairly easy to get working though. But it would definitely require some Julia coding (if you are up for it).

MilesCranmer commented 4 months ago

Just going to keep this open until there's a better solution than a manual workaround. Ideally the workaround shouldn't be needed

MilesCranmer / PySR