[BUG] OSError: exception: access violation reading

TadeuNP commented 1 year ago

PySR throws an "OSError: exception: access violation reading" error. It seems to occur often when fitting a model many times (tried with the exact same settings and data). Occurs both in Jupyter and when running from a Python file.

Visual Studio Code outputs the following:

Traceback (most recent call last): File "c:\Users\Tadeu\Desktop\pysr-access-violation.py", line 44, in model.fit(X_train , dx , variable_names=["x", "y", "z"]) File "C:\Users\Tadeu\anaconda3\envs\thesis\lib\site-packages\pysr\sr.py", line 1792, in fit self._run(X, y, mutated_params, weights=weights, seed=seed) File "C:\Users\Tadeu\anaconda3\envs\thesis\lib\site-packages\pysr\sr.py", line 1493, in _run Main = init_julia(self.julia_project, julia_kwargs=julia_kwargs) File "C:\Users\Tadeu\anaconda3\envs\thesis\lib\site-packages\pysr\julia_helpers.py", line 180, in init_julia Julia(**julia_kwargs) File "C:\Users\Tadeu\anaconda3\envs\thesis\lib\site-packages\julia\core.py", line 519, in init self._call("const PyCall = Base.require({0})".format(PYCALL_PKGID)) File "C:\Users\Tadeu\anaconda3\envs\thesis\lib\site-packages\julia\core.py", line 554, in _call ans = self.api.jl_eval_string(src.encode('utf-8'))

OSError: exception: access violation reading 0x0000025A5A9D1000 Exception ignored in atexit callback: <_FuncPtr object at 0x0000025A5A83AF60> OSError: exception: access violation reading 0x0000025A5A9D1000

Windows 11
Julia 1.8.3
Python 3.10.9
Installed with pip
PySR 0.11.14 (just updated from .11 in an attempt to fix this)

PySR settings and a minimal example:

import numpy as np
from scipy.integrate import odeint
from pysr import PySRRegressor

model = PySRRegressor(
    model_selection="best", 
    niterations=30,
    population_size=90,
    binary_operators=["+", "*", "/",  "-"],
    loss="loss(x, y) = (x - y)^2",
    warm_start=True
)
goodwin = lambda x,  t , a1=5, a2=5, a3=5, c1=0.5, c2=0.5, c3 = 0.5, n=10, K=1: [

                     a1* K**n/(K**n  + x[2]**n ) - c1* x[0],
                     a2*x[0] - c2 * x[1],
                     a3*x[1] - c3*x[2]    ]

initial_cond = np.random.uniform(0, 5, 3) 
sol = odeint(goodwin, initial_cond, t) 
x = sol[:, 0]; y = sol[:, 1]; z = sol[:, 2];
X_train = np.column_stack((x, y, z))
dx = goodwin((x, y, z), 0)[0] 

model.fit(X_train, dx , variable_names=["x", "y", "z"])

As far as I can tell, this is enough to reproduce the error. It occurs often enough that I usually have to restart Jupyter after fitting twice (this was not the case a month ago, for some reason).

Note: I thought this could be caused by the different datasets that were being fed to the model, but locking it after the first run still leads to the bug. I also removed the division and multiplication operator and somehow it managed to fit ~7 times before crashing, way more than the maximum of 2 that I was seeing when using a larger pool of binary_operators.

Let me know if there is something else I need to provide. I will also try to run this same example on a different computer to see if I get similar behaviour.

MilesCranmer commented 1 year ago

Related to the third error I am seeing in the Windows tests: https://github.com/MilesCranmer/PySR/issues/238 (also posted here: https://github.com/JuliaLang/julia/issues/47957). I unfortunately don't have a Windows machine where I can replicate this so it's a bit difficult for me to debug, but I can offer some questions which will help me track it down:

Does your machine have low RAM by any chance?
Are you running it inside a VM?
Does this error still occur in multiprocessing mode, with multithreading=False, procs=procs (where procs is the number of processors you have).
Does this error depend on how many procs you set? It could be a data race.
Does this error still occur in serial mode (multithreading=False, procs=0)?
If you run the pure-Julia example here: https://github.com/MilesCranmer/SymbolicRegression.jl/#quickstart, does the error still occur? If not, then it might be a PyJulia problem.
Does this error still occur if you do not import scipy and run odeint, but rather pre-compute that integral, and load it from a file? (sometimes Python libraries with C bindings can interfere).
Does the error frequency change if you pass julia_kwargs={"optimize": 0} to PySRRegressor?
Does the issue go away if you try PySR 0.10? I think this is where I noticed the access error in the tests.
Might also try Julia 1.8.5 but not sure it will fix things

The more information the better – these questions will help me figure out where the problem could be lurking. Thanks! Miles

TadeuNP commented 1 year ago

Memory usage is at around 11 out of 16 GB available.
Not using a VM.
The pure Julia example works fine.
Skipping Scipy and loading from a file did not fix it.
julia_kwargs={"optimize": 0} did not fix it. As far as I can tell, it showed a similar error frequency as before.
Error seems to disappear when setting multithreading=False
- It seems to work independently of the number of procs set. Using 0, 1 or 12 procs all worked.
- I have not yet tried updating Julia or PySR 0.10.

After running a bunch of successful tests with multithreading disabled, I decided to turn it on. To my surprise, it worked perfectly. Then I noticed a new warning message:

C:\Users\Tadeu\anaconda3\envs\thesis2\lib\site-packages\pysr\julia_helpers.py:217: UserWarning: Julia has already started. The new Julia options {'threads': 12} will be ignored.

I restarted the Jupyter kernel and attempted to fit a model twice, with multithreading enabled both times, and it failed.

Thanks! Let me know if there are more tests I can run.

MilesCranmer commented 1 year ago

Awesome, thanks for answering those. So indeed using multiprocessing instead of multithreading seems like a good workaround for now (via multithreading=False). It’s actually just as fast, if not faster, than multi-threading; it just takes a bit longer to start each search.

This is also very helpful for finding the bug, thanks. Because it only occurs for multithreading, but not multiprocessing, I think it is a data race issue. (Multiprocessing copies between processes, whereas threads can access the same resources). It’s interesting that it only seems to occur in PyJulia context though…

MilesCranmer commented 6 months ago

Should be fixed on most recent version. Ping me if not!

MilesCranmer / PySR

[BUG] OSError: exception: access violation reading #266