Open sirisian opened 3 months ago
How much memory does your machine have? I fear this OOM error might be “real” in the sense that it is legitimately not enough memory for doing the calculations up to this maxsize and in this many processes (with this large a dataset) at once. Keep in mind that even with batching, it will still periodically evaluate and compute gradients on the entire dataset (although perhaps in the future we could aggregate over batches rather than all at once…). I think it may be trending towards more complex expressions which is why the memory usage increases later in training.
Although I don’t want to rule out a memory leak issue, it could still be the case.
Maybe you could try with fewer processes, but the same 200MB heap size limit, and see if the OOM still occurs?
Thanks for the detailed report!
How much memory does your machine have?
I have 32GB with around 16GB free when I'm testing. I have tried running this with procs=8 with the same behavior.
Maybe you could try with fewer processes, but the same 200MB heap size limit, and see if the OOM still occurs?
I just ran it with procs=4, same settings as above with sample_size = min(460309, len(data_array))
. WSL starts at 4.8GB. 5 minutes 6GB. 15 minutes 7.7GB. 33 minutes 11.2GB. 40 minutes 12GB. 45 minutes 11.6GB. 50 minutes 12.8GB. 1 hour 14.3GB. 1 hour 10 minutes 15.4GB. 1 hour 20 minutes 15.6GB. And OOM at 01:20:10.
This seems to be independent of procs based on this testing.
I think it may be trending towards more complex expressions which is why the memory usage increases later in training.
Is it continuously updating this memory with new expressions? If it could offload to storage feasibly that would be nice. (I have TBs of storage). The loss seems to slowly go down, so I'm interested in having this run forever.
Thanks, that's interesting. Let me walk you through my thought process on debugging –
So a vector of length 460k with precision=64
(default) is about 4 MB. For a maxsize of 40, this means that at worst, let's say an expression evaluation will have need to allocate 21 arrays of that size (one for each leaf), meaning 84 MB of memory usage per evaluation. Then, with 14 processes running concurrently, this is about 1.2 GB of memory being used for a single evaluation at every process at the same time.
Now, one tricky part is that PySR can also use multithreading during evaluations, at this part of the code: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/cd23a6e25c64d00565c3ae3905d06dc3c63033ed/src/SingleIteration.jl#L112.
@threads_if !(options.deterministic) for j in 1:(pop.n)
if options.should_simplify
tree = pop.members[j].tree
tree = simplify_tree!(tree, options.operators)
if tree isa Node
tree = combine_operators(tree, options.operators)
end
pop.members[j].tree = tree
end
if options.should_optimize_constants && do_optimization[j]
# TODO: Might want to do full batch optimization here?
pop.members[j], array_num_evals[j] = optimize_constants(
dataset, pop.members[j], options
)
end
end
By default threading is enabled (unless you set the environment variable PYTHON_JULIACALL_THREADS
and JULIA_NUM_THREADS
to 1
, which means 1 thread per Julia process)
this means if each process has like 14 threads (which might be automatically set if you have 14 cores), at worst, assuming all the processes hit this part of the code at the same time, there is 16.8 GB of memory usage. That is the very worst case though. And since Julia uses garbage collection, memory is not deallocated immediately (unless you were to use bumper=True
which uses bump allocation instead of garbage collection), so it could be that memory hangs around and looks similar to a memory leak.
Maybe that could be something to try first. Does bumper=True
help at all? Or do you still see the same behavior?
It's weird that using fewer processes doesn't help. Maybe you could also try multithreading=False, procs=0
which should use serial mode? That should be easier on memory consumption.
precision=64 (default)
https://astroautomata.com/PySR/api/ This says default is 32.
Maybe you could also try multithreading=False, procs=0 which should use serial mode?
I ran that test overnight and it indeed runs without issues. I ran it for over 10 hours and it never went over 2GB of memory. I'd probably need to play with settings though as the loss kind of stopped lowering it seemed. (It was above the levels I got when using procs=14 and such I assume due to less variation or something).
Does bumper=True help at all?
OOM at 5 hours and 51 minutes. That does appear to help.
I'll try running it with lower procs and bumper to see what happens.
Sorry you are totally right about the precision, oops!
Really interesting that bumper didn’t help fix it. If that has no effect then I am a bit confused as to the cause. Do you have another machine to test this on, preferably one that isn’t Windows? I know that Windows can sometimes have issues with multiprocessing in Julia. Maybe the garbage collection is having issues.
It would also be interesting to test this on Julia 1.11 (not yet released) as I know they’ve improved the garbage collection in a certain way. Perhaps it can help with issues like this.
Though I admit I’m confused as to the actual source of the memory usage here since the bumper option didn’t change it.
Lastly, have you also tried multithreading instead of multiprocessing, and how did it change things?
Oh wait it sounds like bumper actually did help, since the OOM happened 4x later? I guess it’s just the remaining cause of the continual increase.
Are you also able to monitor per-process memory consumption? My current guess is that the head process is using all of the memory while the children processes use something close to that requested (200 MB). This is because JuliaCall does not yet give a way to size a heap size hint: https://github.com/JuliaPy/PythonCall.jl/issues/546 which would be required to set it for the head process (children processes are set up from within the backend, so should work fine).
Lastly, have you also tried multithreading instead of multiprocessing, and how did it change things?
Since multithreading=True by default I've been using a single process this whole time it seems. As seen here:
When using the above procs=14 setups WSL is using 9.8GB at 3 hours top -H -p <pid>
looks like:
So I've been using threads. I tested multiprocessing and it failed after 32 minutes outputting this:
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
@ Base ./stream.jl:410
[2] (::Base.var"#wait_locked#739")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
@ Base ./stream.jl:949
[3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
@ Base ./stream.jl:955
[4] unsafe_read
@ ./io.jl:774 [inlined]
[5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
@ Base ./io.jl:773
[6] read!
@ ./io.jl:775 [inlined]
[7] deserialize_hdr_raw
@ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/messages.jl:167 [inlined]
[8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:172
[9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:133
[10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:121
---------------------------------------------------------------------------
JuliaError Traceback (most recent call last)
Cell In[1], line 54
19 y = data_array[:, 5]
21 model = PySRRegressor(
22 multithreading=False,
23 procs=16,
(...)
51 bumper=True
52 )
---> 54 model.fit(X, y)
File ~/centripetalcatmull/lib/python3.10/site-packages/pysr/sr.py:2087, in PySRRegressor.fit(self, X, y, Xresampled, weights, variable_names, complexity_of_variables, X_units, y_units)
2084 self._checkpoint()
2086 # Perform the search:
-> 2087 self._run(X, y, runtime_params, weights=weights, seed=seed)
2089 # Then, after fit, we save again, so the pickle file contains
2090 # the equations:
2091 if not self.temp_equation_file:
File ~/centripetalcatmull/lib/python3.10/site-packages/pysr/sr.py:1890, in PySRRegressor._run(self, X, y, runtime_params, weights, seed)
1887 jl_y_variable_names = None
1889 PythonCall.GC.disable()
-> 1890 out = SymbolicRegression.equation_search(
1891 jl_X,
1892 jl_y,
1893 weights=jl_weights,
1894 niterations=int(self.niterations),
1895 variable_names=jl_array([str(v) for v in self.feature_names_in_]),
1896 display_variable_names=jl_array(
1897 [str(v) for v in self.display_feature_names_in_]
1898 ),
1899 y_variable_names=jl_y_variable_names,
1900 X_units=jl_array(self.X_units_),
1901 y_units=(
1902 jl_array(self.y_units_)
1903 if isinstance(self.y_units_, list)
1904 else self.y_units_
1905 ),
1906 options=options,
1907 numprocs=cprocs,
1908 parallelism=parallelism,
1909 saved_state=self.julia_state_,
1910 return_state=True,
1911 addprocs_function=cluster_manager,
1912 heap_size_hint_in_bytes=self.heap_size_hint_in_bytes,
1913 progress=progress and self.verbosity > 0 and len(y.shape) == 1,
1914 verbosity=int(self.verbosity),
1915 )
1916 PythonCall.GC.enable()
1918 self.julia_state_stream_ = jl_serialize(out)
File ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/any.jl:240, in __call__(self, *args, **kwargs)
238 return ValueBase.__dir__(self) + self._jl_callmethod($(pyjl_methodnum(pyjlany_dir)))
239 def __call__(self, *args, **kwargs):
--> 240 return self._jl_callmethod($(pyjl_methodnum(pyjlany_call)), args, kwargs)
241 def __bool__(self):
242 return True
JuliaError: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:352 [inlined]
[2] fetch
@ ./task.jl:372 [inlined]
[3] _main_search_loop!(state::SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:882
[4] _equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}, ropt::SymbolicRegression.SearchUtilsModule.RuntimeOptions{:multiprocessing, 1, true}, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, saved_state::Nothing)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:599
[5] equation_search(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, @NamedTuple{}, Nothing, Nothing, Nothing, Nothing}}; niterations::Int64, options::Options{SymbolicRegression.CoreModule.OptionsStructModule.ComplexityMapping{Int64, Int64}, DynamicExpressions.OperatorEnumModule.OperatorEnum, Node, false, true, nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::Nothing, heap_size_hint_in_bytes::Int64, runtests::Bool, saved_state::Nothing, return_state::Bool, verbosity::Int64, progress::Bool, v_dim_out::Val{1})
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:571
[6] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:449 [inlined]
[7] #equation_search#26
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:412 [inlined]
[8] equation_search
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:360 [inlined]
[9] #equation_search#28
@ ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:442 [inlined]
[10] pyjlany_call(self::typeof(equation_search), args_::Py, kwargs_::Py)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/any.jl:40
[11] _pyjl_callmethod(f::Any, self_::Ptr{PythonCall.C.PyObject}, args_::Ptr{PythonCall.C.PyObject}, nargs::Int64)
@ PythonCall.JlWrap ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/base.jl:73
[12] _pyjl_callmethod(o::Ptr{PythonCall.C.PyObject}, args::Ptr{PythonCall.C.PyObject})
@ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/sQSpa/src/JlWrap/C.jl:63
nested task error: Distributed.ProcessExitedException(14)
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:931
[2] wait()
@ Base ./task.jl:995
[3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
@ Base ./condition.jl:130
[4] wait
@ ./condition.jl:125 [inlined]
[5] take_buffered(c::Channel{Any})
@ Base ./channels.jl:477
[6] take!(c::Channel{Any})
@ Base ./channels.jl:471
[7] take!(::Distributed.RemoteValue)
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:726
[8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID; kwargs::@Kwargs{})
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:461
[9] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:454
[10] remotecall_fetch
@ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:492 [inlined]
[11] call_on_owner
@ ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:565 [inlined]
[12] fetch(r::Distributed.Future)
@ Distributed ~/centripetalcatmull/julia_env/pyjuliapkg/install/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:619
[13] (::SymbolicRegression.var"#67#72"{SymbolicRegression.SearchUtilsModule.SearchState{Float32, Float32, Node{Float32}, Distributed.Future, Distributed.RemoteChannel}, Int64, Int64})()
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/9q4ZC/src/SymbolicRegression.jl:984
My program:
import numpy as np
import json
from pysr import PySRRegressor
with open('test2.json', 'r') as file:
data = json.load(file)
data_array = np.array(data)
print(len(data_array));
sample_size = min(460309, len(data_array))
print(sample_size);
random_indices = np.random.choice(len(data_array), size=sample_size, replace=False)
data_array = data_array[random_indices]
# Split the array into X and y
X = data_array[:, :5]
y = data_array[:, 5]
model = PySRRegressor(
multithreading=False,
procs=16,
populations=16,
population_size=50,
ncycles_per_iteration=1000,
niterations=10000000,
binary_operators=["-", "+", "*", "/", "^"],
unary_operators=[
"square",
],
nested_constraints={
"square": {"square": 1 }
},
constraints={
"^": (-1, 8),
},
elementwise_loss="""
loss(prediction, target) = try
trunc(Int, prediction) == trunc(Int, target) ? 0 : (prediction - target)^2
catch
(prediction - target)^2
end
""",
maxsize=40,
maxdepth=10,
complexity_of_constants=2,
batching=True,
batch_size=1000,
heap_size_hint_in_bytes=200000000,
bumper=True
)
model.fit(X, y)
(The ncycles_per_iteration is high because lower values had the head worker at 99%).
Since multithreading=True by default I've been using a single process this whole time it seems. As seen here:
Ah, the heap_size_hint_in_bytes
not affecting things makes much more sense now. The heap_size_hint_in_bytes
is only used when multithreading=False
.
This parameter is basically used to prevent OOM errors. When the memory usage gets close to the heap size hint, Julia's garbage collection gets much more aggressive. However, in PySR, this parameter only gets set in new processes. This is because Julia is not aware of the memory usage of other Julia processes, so it can help to set it in advance for newly spawned processes.
For a single process, it's usually not needed. I guess it is here because the garbage collection isn't working hard enough on WSL (?). Right now you can't set the heap size hint on the main Julia process if using PythonCall.jl. So, let me prioritise https://github.com/JuliaPy/PythonCall.jl/issues/546 and try to add this, and we can see if it helps at all.
Ok I made a PR to PythonCall.jl here: https://github.com/JuliaPy/PythonCall.jl/pull/547. Once that gets in you can see if that fixes it (will be available via the PYTHON_JULIACALL_HEAP_SIZE_HINT
environment variable; see https://julialang.org/blog/2023/04/julia-1.9-highlights/#memory_usage_hint_for_the_gc_with_--heap-size-hint)
What happened?
It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.
I used the following setup in each of my tests changing the sample_size 10000 value to a specific value or commenting the three lines out to run the whole dataset:
My dataset has 460K records which I know isn't advised, but it's for a niche problem. The memory issues appears to only happen when running on a dataset over a certain size.
I saw a comment about heap_size_hint_in_bytes needing to be set and I've played with values for it, but it doesn't appear to change the behavior. I've set it to 1.2 GB and also 200MB for instance. I've tried smaller batch sizes like 50 and it doesn't appear to change the behavior either. None of the other settings appear to change things either. I've tried procs=8 and smaller populations and population_size, and smaller ncycles_per_iteration.
100K random records then WSL starts at 3.85 GB. At 10 minutes 4GB. At 50 minutes 4.3 GB. At 1 hour 3.2 GB. At 1.5 hours 3 GBs. No issues.
200K random records then WSL starts at ~4GB. At 20 minutes 4.2GB, At 1 hour 3.6GB. At 2 hours 15 minutes 3.5GB. No issues.
300K random records then WSL starts at ~4.3GB. At 20 minutes 5.3GB. Grew to 5.9 GBs then dropped to 5.2GB at 30 minutes. 1 hour 6GB. 1 hour 15 minutes 5.4GB. 1 hour 30 minutes 4.7GB. 1 hour 32 minutes 4.4GB. No issues.
400k random records then WSL starts at ~4.2GB. 3 minutes 4.5GB. 8 hours 30 minutes 7.7GB.
460K then WSL starts at ~4.8GB. 2 minutes 5.2GB. 30 minutes 9.6GB, 40 minutes 12.3GB. 55 minutes 14.5 GB. 1 hour 15.1GB. 1 hour 10 minutes 15.4GB. 1 hour 19 minutes 15.5 GB. 1 hour 26 minutes OOM. I ran this also using:
Just to be sure and it failed at 1 hour 23 minutes so no difference.
I've attached my test2.json file. test2.json
Version
0.19.4
Operating System
Linux
Package Manager
pip
Interface
IPython Terminal
Relevant log output
No response
Extra Info
No response