[BUG]: Memory leak(?) when using batching with large dataset (>450K items)

What happened?

It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.

I used the following setup in each of my tests changing the sample_size 10000 value to a specific value or commenting the three lines out to run the whole dataset:

import numpy as np
import json
from pysr import PySRRegressor

with open('test2.json', 'r') as file:
    data = json.load(file)

data_array = np.array(data)

sample_size = min(10000, len(data_array))
random_indices = np.random.choice(len(data_array), size=sample_size, replace=False)
data_array = data_array[random_indices]

# Split the array into X and y
X = data_array[:, :5]
y = data_array[:, 5]

model = PySRRegressor(
    procs=14,
    populations=32,
    population_size=200,
    ncycles_per_iteration=1000,
    niterations=10000000,
    binary_operators=["-", "+", "*", "/", "^"],
    unary_operators=[
        "square",
    ],
    nested_constraints={
        "square": {"square": 1 }
    },
    constraints={
        "^": (-1, 8),
    },
    elementwise_loss="""
loss(prediction, target) = try 
    trunc(Int, prediction) == trunc(Int, target) ? 0 : (prediction - target)^2
catch
    (prediction - target)^2
end
""",
    maxsize=40,
    maxdepth=10,
    complexity_of_constants=2,
    batching=True,
    batch_size=200,
    heap_size_hint_in_bytes=200000000,
)

model.fit(X, y)

My dataset has 460K records which I know isn't advised, but it's for a niche problem. The memory issues appears to only happen when running on a dataset over a certain size.

I saw a comment about heap_size_hint_in_bytes needing to be set and I've played with values for it, but it doesn't appear to change the behavior. I've set it to 1.2 GB and also 200MB for instance. I've tried smaller batch sizes like 50 and it doesn't appear to change the behavior either. None of the other settings appear to change things either. I've tried procs=8 and smaller populations and population_size, and smaller ncycles_per_iteration.

100K random records then WSL starts at 3.85 GB. At 10 minutes 4GB. At 50 minutes 4.3 GB. At 1 hour 3.2 GB. At 1.5 hours 3 GBs. No issues.

200K random records then WSL starts at ~4GB. At 20 minutes 4.2GB, At 1 hour 3.6GB. At 2 hours 15 minutes 3.5GB. No issues.

300K random records then WSL starts at ~4.3GB. At 20 minutes 5.3GB. Grew to 5.9 GBs then dropped to 5.2GB at 30 minutes. 1 hour 6GB. 1 hour 15 minutes 5.4GB. 1 hour 30 minutes 4.7GB. 1 hour 32 minutes 4.4GB. No issues.

400k random records then WSL starts at ~4.2GB. 3 minutes 4.5GB. 8 hours 30 minutes 7.7GB.

460K then WSL starts at ~4.8GB. 2 minutes 5.2GB. 30 minutes 9.6GB, 40 minutes 12.3GB. 55 minutes 14.5 GB. 1 hour 15.1GB. 1 hour 10 minutes 15.4GB. 1 hour 19 minutes 15.5 GB. 1 hour 26 minutes OOM. I ran this also using:

sample_size = min(460309, len(data_array))

Just to be sure and it failed at 1 hour 23 minutes so no difference.

I've attached my test2.json file. test2.json

Version

0.19.4

Operating System

Linux

Package Manager

pip

Interface

IPython Terminal

Relevant log output

No response

Extra Info

No response

How much memory does your machine have? I fear this OOM error might be “real” in the sense that it is legitimately not enough memory for doing the calculations up to this maxsize and in this many processes (with this large a dataset) at once. Keep in mind that even with batching, it will still periodically evaluate and compute gradients on the entire dataset (although perhaps in the future we could aggregate over batches rather than all at once…). I think it may be trending towards more complex expressions which is why the memory usage increases later in training.

Although I don’t want to rule out a memory leak issue, it could still be the case.

Maybe you could try with fewer processes, but the same 200MB heap size limit, and see if the OOM still occurs?

Thanks for the detailed report!

How much memory does your machine have?

I have 32GB with around 16GB free when I'm testing. I have tried running this with procs=8 with the same behavior.

Maybe you could try with fewer processes, but the same 200MB heap size limit, and see if the OOM still occurs?

I just ran it with procs=4, same settings as above with sample_size = min(460309, len(data_array)). WSL starts at 4.8GB. 5 minutes 6GB. 15 minutes 7.7GB. 33 minutes 11.2GB. 40 minutes 12GB. 45 minutes 11.6GB. 50 minutes 12.8GB. 1 hour 14.3GB. 1 hour 10 minutes 15.4GB. 1 hour 20 minutes 15.6GB. And OOM at 01:20:10.

This seems to be independent of procs based on this testing.

I think it may be trending towards more complex expressions which is why the memory usage increases later in training.

Is it continuously updating this memory with new expressions? If it could offload to storage feasibly that would be nice. (I have TBs of storage). The loss seems to slowly go down, so I'm interested in having this run forever.

Thanks, that's interesting. Let me walk you through my thought process on debugging –

So a vector of length 460k with precision=64 (default) is about 4 MB. For a maxsize of 40, this means that at worst, let's say an expression evaluation will have need to allocate 21 arrays of that size (one for each leaf), meaning 84 MB of memory usage per evaluation. Then, with 14 processes running concurrently, this is about 1.2 GB of memory being used for a single evaluation at every process at the same time.

Now, one tricky part is that PySR can also use multithreading during evaluations, at this part of the code: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/cd23a6e25c64d00565c3ae3905d06dc3c63033ed/src/SingleIteration.jl#L112.

    @threads_if !(options.deterministic) for j in 1:(pop.n)
        if options.should_simplify
            tree = pop.members[j].tree
            tree = simplify_tree!(tree, options.operators)
            if tree isa Node
                tree = combine_operators(tree, options.operators)
            end
            pop.members[j].tree = tree
        end
        if options.should_optimize_constants && do_optimization[j]
            # TODO: Might want to do full batch optimization here?
            pop.members[j], array_num_evals[j] = optimize_constants(
                dataset, pop.members[j], options
            )
        end
    end

By default threading is enabled (unless you set the environment variable PYTHON_JULIACALL_THREADS and JULIA_NUM_THREADS to 1, which means 1 thread per Julia process)

this means if each process has like 14 threads (which might be automatically set if you have 14 cores), at worst, assuming all the processes hit this part of the code at the same time, there is 16.8 GB of memory usage. That is the very worst case though. And since Julia uses garbage collection, memory is not deallocated immediately (unless you were to use bumper=True which uses bump allocation instead of garbage collection), so it could be that memory hangs around and looks similar to a memory leak.

Maybe that could be something to try first. Does bumper=True help at all? Or do you still see the same behavior?

It's weird that using fewer processes doesn't help. Maybe you could also try multithreading=False, procs=0 which should use serial mode? That should be easier on memory consumption.

precision=64 (default)

https://astroautomata.com/PySR/api/ This says default is 32.

Maybe you could also try multithreading=False, procs=0 which should use serial mode?

I ran that test overnight and it indeed runs without issues. I ran it for over 10 hours and it never went over 2GB of memory. I'd probably need to play with settings though as the loss kind of stopped lowering it seemed. (It was above the levels I got when using procs=14 and such I assume due to less variation or something).

Does bumper=True help at all?

OOM at 5 hours and 51 minutes. That does appear to help.

I'll try running it with lower procs and bumper to see what happens.

Sorry you are totally right about the precision, oops!

Really interesting that bumper didn’t help fix it. If that has no effect then I am a bit confused as to the cause. Do you have another machine to test this on, preferably one that isn’t Windows? I know that Windows can sometimes have issues with multiprocessing in Julia. Maybe the garbage collection is having issues.

It would also be interesting to test this on Julia 1.11 (not yet released) as I know they’ve improved the garbage collection in a certain way. Perhaps it can help with issues like this.

Though I admit I’m confused as to the actual source of the memory usage here since the bumper option didn’t change it.

Lastly, have you also tried multithreading instead of multiprocessing, and how did it change things?

Oh wait it sounds like bumper actually did help, since the OOM happened 4x later? I guess it’s just the remaining cause of the continual increase.

Are you also able to monitor per-process memory consumption? My current guess is that the head process is using all of the memory while the children processes use something close to that requested (200 MB). This is because JuliaCall does not yet give a way to size a heap size hint: https://github.com/JuliaPy/PythonCall.jl/issues/546 which would be required to set it for the head process (children processes are set up from within the backend, so should work fine).

Lastly, have you also tried multithreading instead of multiprocessing, and how did it change things?

Since multithreading=True by default I've been using a single process this whole time it seems. As seen here:

When using the above procs=14 setups WSL is using 9.8GB at 3 hours top -H -p <pid> looks like:

So I've been using threads. I tested multiprocessing and it failed after 32 minutes outputting this:

Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:949
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:955
  [4] unsafe_read
    @ ./io.jl:774 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:773
  [6] read!
    @ ./io.jl:775 [inlined]
  [7] deserialize_hdr_raw
    @ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
---------------------------------------------------------------------------
JuliaError                                Traceback (most recent call last)
Cell In[1], line 54
     19 y = data_array[:, 5]
     21 model = PySRRegressor(
     22     multithreading=False,
     23     procs=16,
   (...)
     51     bumper=True
     52 )
---> 54 model.fit(X, y)

File ~/centripetalcatmull/lib/python3.10/site-packages/pysr/sr.py:2087, in PySRRegressor.fit(self, X, y, Xresampled, weights, variable_names, complexity_of_variables, X_units, y_units)
   2084     self._checkpoint()
   2086 # Perform the search:
-> 2087 self._run(X, y, runtime_params, weights=weights, seed=seed)
   2089 # Then, after fit, we save again, so the pickle file contains
   2090 # the equations:
   2091 if not self.temp_equation_file:

File ~/centripetalcatmull/lib/python3.10/site-packages/pysr/sr.py:1890, in PySRRegressor._run(self, X, y, runtime_params, weights, seed)
   1887     jl_y_variable_names = None
   1889 PythonCall.GC.disable()
-> 1890 out = SymbolicRegression.equation_search(
   1891     jl_X,
   1892     jl_y,
   1893     weights=jl_weights,
   1894     niterations=int(self.niterations),
   1895     variable_names=jl_array([str(v) for v in self.feature_names_in_]),
   1896     display_variable_names=jl_array(
   1897         [str(v) for v in self.display_feature_names_in_]
   1898     ),
   1899     y_variable_names=jl_y_variable_names,
   1900     X_units=jl_array(self.X_units_),
   1901     y_units=(
   1902         jl_array(self.y_units_)
   1903         if isinstance(self.y_units_, list)
   1904         else self.y_units_
   1905     ),
   1906     options=options,
   1907     numprocs=cprocs,
   1908     parallelism=parallelism,
   1909     saved_state=self.julia_state_,
   1910     return_state=True,
   1911     addprocs_function=cluster_manager,
   1912     heap_size_hint_in_bytes=self.heap_size_hint_in_bytes,
   1913     progress=progress and self.verbosity > 0 and len(y.shape) == 1,
   1914     verbosity=int(self.verbosity),
   1915 )
   1916 PythonCall.GC.enable()
   1918 self.julia_state_stream_ = jl_serialize(out)

File ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/any.jl:240, in __call__(self, *args, **kwargs)
    238     return ValueBase.__dir__(self) + self._jl_callmethod($(pyjl_methodnum(pyjlany_dir)))
    239 def __call__(self, *args, **kwargs):
--> 240     return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
    241 def __bool__(self):
    242     return True

JuliaError: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
  [4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
  [5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Int64, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
  [6] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
  [7] #equation_search#26
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
  [8] equation_search
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
  [9] #equation_search#28
    @ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
 [10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/any.jl:40
 [11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/base.jl:73
 [12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
    @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/C.jl:63

    nested task error: Distributed.ProcessExitedException(14)
    Stacktrace:
      [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
        @ Base ./task.jl:931
      [2] wait()
        @ Base ./task.jl:995
      [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
        @ Base ./condition.jl:130
      [4] wait
        @ ./condition.jl:125 [inlined]
      [5] take_buffered(c::Channel{Any})
        @ Base ./channels.jl:477
      [6] take!(c::Channel{Any})
        @ Base ./channels.jl:471
      [7] take!(::Distributed.RemoteValue)
        @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
      [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
        @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
      [9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
        @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
     [10] remotecall_fetch
        @ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
     [11] call_on_owner
        @ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
     [12] fetch(r::Distributed.Future)
        @ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
     [13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
        @ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984

My program:

import numpy as np
import json
from pysr import PySRRegressor

with open('test2.json', 'r') as file:
    data = json.load(file)

data_array = np.array(data)

print(len(data_array));

sample_size = min(460309, len(data_array))
print(sample_size);
random_indices = np.random.choice(len(data_array), size=sample_size, replace=False)
data_array = data_array[random_indices]

# Split the array into X and y
X = data_array[:, :5]
y = data_array[:, 5]

model = PySRRegressor(
    multithreading=False,
    procs=16,
    populations=16,
    population_size=50,
    ncycles_per_iteration=1000,
    niterations=10000000,
    binary_operators=["-", "+", "*", "/", "^"],
    unary_operators=[
        "square",
    ],
    nested_constraints={
        "square": {"square": 1 }
    },
    constraints={
        "^": (-1, 8),
    },
    elementwise_loss="""
loss(prediction, target) = try 
    trunc(Int, prediction) == trunc(Int, target) ? 0 : (prediction - target)^2
catch
    (prediction - target)^2
end
""",
    maxsize=40,
    maxdepth=10,
    complexity_of_constants=2,
    batching=True,
    batch_size=1000,
    heap_size_hint_in_bytes=200000000,
    bumper=True
)

model.fit(X, y)

(The ncycles_per_iteration is high because lower values had the head worker at 99%).

Since multithreading=True by default I've been using a single process this whole time it seems. As seen here:

Ah, the heap_size_hint_in_bytes not affecting things makes much more sense now. The heap_size_hint_in_bytes is only used when multithreading=False.

This parameter is basically used to prevent OOM errors. When the memory usage gets close to the heap size hint, Julia's garbage collection gets much more aggressive. However, in PySR, this parameter only gets set in new processes. This is because Julia is not aware of the memory usage of other Julia processes, so it can help to set it in advance for newly spawned processes.

For a single process, it's usually not needed. I guess it is here because the garbage collection isn't working hard enough on WSL (?). Right now you can't set the heap size hint on the main Julia process if using PythonCall.jl. So, let me prioritise https://github.com/JuliaPy/PythonCall.jl/issues/546 and try to add this, and we can see if it helps at all.

Ok I made a PR to PythonCall.jl here: https://github.com/JuliaPy/PythonCall.jl/pull/547. Once that gets in you can see if that fixes it (will be available via the PYTHON_JULIACALL_HEAP_SIZE_HINT environment variable; see https://julialang.org/blog/2023/04/julia-1.9-highlights/#memory_usage_hint_for_the_gc_with_--heap-size-hint)

MilesCranmer / PySR