MilesCranmer / PySR

High-Performance Symbolic Regression in Python and Julia
https://astroautomata.com/PySR
Apache License 2.0
2.19k stars 207 forks source link

Repeated CI failures on Windows #238

Closed MilesCranmer closed 6 months ago

MilesCranmer commented 1 year ago

Many of the Windows tests are now failing with various segmentation faults, which appear to be randomly triggered:

They seem to occur more frequently on older versions of Julia, and rarely on Julia 1.8.3. Regardless, a segfault anywhere is cause for concern and should be tracked down.

The errors include:

  1. Early segmentation fault (Julia 1.6.7) at first run, segfault during noise test (Julia 1.6.7 and others), as well as segfaults during warm start test.

e.g., Windows:

 D:\a\_temp\221410f9-8bf7-4099-901d-eb9813d86c45.sh: line 1:  1098 Segmentation fault      python -m pysr.test main
Started!
also occurs on Ubuntu sometimes: ``` signal (11): Segmentation fault in expression starting at none:0 unknown function (ip: 0x7fd6a19bc215) unknown function (ip: 0x7fd6a19947ff) macro expansion at /home/runner/.julia/packages/PyCall/ygXW2/src/exception.jl:95 [inlined] convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:94 pyjlwrap_getattr at /home/runner/.julia/packages/PyCall/ygXW2/src/pytype.jl:378 unknown function (ip: 0x7fd68d30b1bd) unknown function (ip: 0x7fd6a19babda) unknown function (ip: 0x7fd6a198e9d4) pyisinstance at /home/runner/.julia/packages/PyCall/ygXW2/src/PyCall.jl:170 [inlined] pysequence_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:752 pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:773 pytype_query at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:806 [inlined] convert at /home/runner/.julia/packages/PyCall/ygXW2/src/conversions.jl:831 julia_kwarg at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:19 [inlined] #57 at ./none:0 [inlined] iterate at ./generator.jl:47 [inlined] collect_to! at ./array.jl:728 unknown function (ip: 0x7fd68d341d9a) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 collect_to! at ./array.jl:736 unknown function (ip: 0x7fd68d33e35a) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 collect_to! at ./array.jl:736 collect_to_with_first! at ./array.jl:706 unknown function (ip: 0x7fd68d33d775) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 collect at ./array.jl:687 unknown function (ip: 0x7fd68d33afb4) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 _pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:31 unknown function (ip: 0x7fd68d3348d5) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419 pyjlwrap_call at /home/runner/.julia/packages/PyCall/ygXW2/src/callback.jl:44 unknown function (ip: 0x7fd68d30aeee) unknown function (ip: 0x7fd6a19980c7) _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined] _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined] PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined] call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined] _PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3537 unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a199a1e0) unknown function (ip: 0x7fd6a19ed97b) unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a19ecdf6) unknown function (ip: 0x7fd6a1998972) unknown function (ip: 0x7fd6a199a1e0) unknown function (ip: 0x7fd6a19ecb12) unknown function (ip: 0x7fd6a1998972) unknown function (ip: 0x7fd6a19ecdf6) unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a199a28d) unknown function (ip: 0x7fd6a19ef9b1) unknown function (ip: 0x7fd6a19ebbb7) unknown function (ip: 0x7fd6a1997d4c) unknown function (ip: 0x7fd6a1998f2b) unknown function (ip: 0x7fd6a1a46421) unknown function (ip: 0x7fd6a199802f) _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined] _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined] PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined] call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined] _PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520 unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a199a28d) unknown function (ip: 0x7fd6a19ef9b1) unknown function (ip: 0x7fd6a19ebbb7) unknown function (ip: 0x7fd6a1997d4c) unknown function (ip: 0x7fd6a1998f2b) unknown function (ip: 0x7fd6a1a46421) unknown function (ip: 0x7fd6a199802f) _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:116 [inlined] _PyObject_VectorcallTstate at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:103 [inlined] PyObject_Vectorcall at /home/runner/work/_temp/SourceCode/./Include/cpython/abstract.h:127 [inlined] call_function at /home/runner/work/_temp/SourceCode/Python/ceval.c:5077 [inlined] _PyEval_EvalFrameDefault at /home/runner/work/_temp/SourceCode/Python/ceval.c:3520 unknown function (ip: 0x7fd6a1998972) unknown function (ip: 0x7fd6a19ecdf6) unknown function (ip: 0x7fd6a1998972) unknown function (ip: 0x7fd6a19ecb12) unknown function (ip: 0x7fd6a19ebbb7) _PyEval_EvalCodeWithName at /home/runner/work/_temp/SourceCode/Python/ceval.c:4361 unknown function (ip: 0x7fd6a19eb876) PyEval_EvalCode at /home/runner/work/_temp/SourceCode/Python/ceval.c:828 unknown function (ip: 0x7fd6a1a6399f) cfunction_vectorcall_FASTCALL at /home/runner/work/_temp/SourceCode/Objects/methodobject.c:430 unknown function (ip: 0x7fd6a19ecb12) unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a19ecb12) unknown function (ip: 0x7fd6a19ebbb7) _PyFunction_Vectorcall at /home/runner/work/_temp/SourceCode/Objects/call.c:396 unknown function (ip: 0x7fd6a1a7fdd6) unknown function (ip: 0x7fd6a1a7faae) Py_BytesMain at /home/runner/work/_temp/SourceCode/Modules/main.c:731 unknown function (ip: 0x7fd6a1642d8f) __libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) _start at python (unknown line) Allocations: 185387713 (Pool: 185351460; Big: 36253); GC: 470 /home/runner/work/_temp/bdd49862-48fd-4e82-bed8-685329606248.sh: line 1: 2324 Segmentation fault (core dumped) python -m pysr.test main ```
  1. Git errors: (Julia 1.8.2)
PyCall is installed and built successfully.
     Cloning git-repo `[https://github.com/MilesCranmer/SymbolicRegression.jl`](https://github.com/MilesCranmer/SymbolicRegression.jl%60)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 87, in install
    _add_sr_to_julia_project(Main, io_arg)
  File "/Users/runner/work/PySR/PySR/pysr/julia_helpers.py", line 240, in _add_sr_to_julia_project
    Main.eval(f"Pkg.add([sr_spec, clustermanagers_spec], {io_arg})")
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 627, in eval
    ans = self._call(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 555, in _call
    self.check_exception(src)
  File "/Users/runner/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/julia/core.py", line 609, in check_exception
    raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
julia.core.JuliaError: Exception 'failed to clone from https://github.com/MilesCranmer/SymbolicRegression.jl, error: GitError(Code:ERROR, Class:Net, SecureTransport error: connection closed via error)' occurred while calling julia code:
Pkg.add([sr_spec, clustermanagers_spec], io=stderr)
  1. Access errors during scikit-learn tests (these ones don't even fail the CI, which is a bit worrisome)

e.g.,

Failed check_fit2d_predict1d with:
    Traceback (most recent call last):
      File "D:\a\PySR\PySR\pysr\test\test.py", line 671, in test_scikit_learn_compatibility
        check(model)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\_testing.py", line 188, in wrapper
        return fn(*args, **kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\sklearn\utils\estimator_checks.py", line 1300, in check_fit2d_predict1d
        estimator.fit(X, y)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1792, in fit
        self._run(X, y, mutated_params, weights=weights, seed=seed)
      File "D:\a\PySR\PySR\pysr\sr.py", line 1493, in _run
        Main = init_julia(self.julia_project, julia_kwargs=julia_kwargs)
      File "D:\a\PySR\PySR\pysr\julia_helpers.py", line 180, in init_julia
        Julia(**julia_kwargs)
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 519, in __init__
        self._call("const PyCall = Base.require({0})".format(PYCALL_PKGID))
      File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\julia\core.py", line 554, in _call
        ans = self.api.jl_eval_string(src.encode('utf-8'))
    OSError: exception: access violation reading 0x000001BC1C501000
  1. Torch errors.

One other curious thing is that this error is raised on some Windows tests (https://github.com/MilesCranmer/PySR/actions/runs/3664894286/jobs/6195713513). But, this should not take place...

Run python -m pysr.test torch
D:\a\PySR\PySR\pysr\julia_helpers.py:139: UserWarning: `torch` was loaded before the Julia instance started. This may cause a segfault when running `PySRRegressor.fit`. To avoid this, please run `pysr.julia_helpers.init_julia()` *before* importing `torch`. For updates, see https://github.com/pytorch/pytorch/issues/78829
  warnings.warn(
D:\a\_temp\8727c9f4-d0f6-4345-84e6-e774762771ab.sh: line 1:   258 Segmentation fault      python -m pysr.test torch
Started!
MilesCranmer commented 1 year ago

@mkitti for error 3 in particular, do you have an idea of where I should check PyJulia? It almost looks like Python garbage collected the pointer to the Julia runtime which is strange.

mkitti commented 1 year ago

What changed?

MilesCranmer commented 1 year ago

So I have seen a few of these on-and-off for a while, especially on Windows. However, the rate has gone up recently. Perhaps this is because I have added more unit-tests over time, and tested more complex functionality (e.g., LoopVectorization.jl) and thus there is cumulatively a higher chance of each error occurring. I am really not sure what causes error 1 and 3 though. Error 2 and 4 seem doable to debug but seem more related to CI than the code itself; so I am mostly worried about 1+3.

MilesCranmer commented 1 year ago

I wonder if it has to do with the _LIBJULIA variable in PyJulia being cleaned up by the python gc? https://github.com/JuliaPy/pyjulia/blob/1e3de7bbd27312f9abd200761a0c04a03c40a23d/src/julia/libjulia.py#L90-L94

self.api is set to an evaluation of get_libjulia, which is defined here, which returns a global variable _LIBJULIA. However, that variable is actually not declared as global in that function, but just passed when the function is first defined. I wonder if that is the source of the issue?

i.e., maybe the fix is

   def get_libjulia():
+      global _LIBJULIA
       return _LIBJULIA
MilesCranmer commented 1 year ago

Edit: looks like the access error in particular was introduced between these two commits: https://github.com/MilesCranmer/PySR/compare/c97f60de90203bd5091c3f49e031f49b17a0c6fa..da0bef974b69dc9215a0986145c53f5f7f4462a9. Maybe it has to do with setting optimize=3 on Julia?

MilesCranmer commented 1 year ago

Nope; neither the optimize=2 nor the global change fixed it. Very confused...

It seems like the access errors first show up in test_scikit_learn_compatibility, which passes PySRRegressor to an internal test suite of scikit-learn: here. I wonder if a recent change to this test suite is what suddenly caused this breakage in the Windows tests.

MilesCranmer commented 1 year ago

I can't reproduce the errors on a local copy of Windows (in Parallels) - Python 3.10, Julia 1.8.3. I wonder if the GitHub action is just running out of memory or something...

mkitti commented 1 year ago

Running out of memory would definitely put pressure on the garbage collector

MilesCranmer commented 1 year ago

Indeed I think it is an overuse of memory from some sort of garbage not being properly collected from threads:

Screenshot 2022-12-20 at 6 58 21 PM

I was launching searches repeatedly from IPython, and at one point there was 10 GB allocated in the RAM. Even when I set model = None, none of the memory was cleared by the Python/Julia GCs, indicating it is somehow sticking around.

The short term solution is to split the CI into separate launches of Python, so that memory is forced to clear after multiple tests.

The long term solution is to debug exactly why memory is not being freed. Perhaps it has something to do with jobs being added to this list through the use of @async: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/367d155f26c5a7f0faf26bf529b95f097f1f7f22/src/SymbolicRegression.jl#L652, and then garbage not being collected when this function exits?

MilesCranmer commented 1 year ago

Debugging list:

- [ ] Does the memory leak appear in Julia, or just PyJulia? - [ ] Is the memory leak due to parallelism? - [ ] Does the memory leak occur when running in serial mode? - [ ] Does the memory leak occur when running until completion, rather than early stopping? - [ ] How does the memory leak scale with # populations, dataset size, etc.? - [ ] Does the memory leak appear only on some operating systems? - [ ] Is the memory leak due to running everything directly on Main in PyJulia, rather than in a scope?

Edit: seems like there isn't actually a memory leak; it's just the JIT cache.

MilesCranmer commented 1 year ago

Even just splitting it into 10 different subsets of tests seems to cause segfaults: https://github.com/MilesCranmer/PySR/actions/runs/3752052933.

MilesCranmer commented 1 year ago

Got some cloud compute to try to debug this. Looks like the test triggering the series of access violations is TestPipeline.test_high_dim_selection_early_stop in test.py. In particular, something in the second half of this test (the second model.fit) seems to trigger it:

https://github.com/MilesCranmer/PySR/blob/d04558686078f4b182ad01ca8fe589918883dab1/pysr/test/test.py#L300-L317


Updates:

  1. Turned off early_stop_condition, and the bug went away. So perhaps stopping early is triggering some sort of memory access bug (e.g., from threads which haven't completed yet?)
    • It looks like threads could continue to modify the contents of returnPops even after it has been returned to Python. Perhaps that is the issue.
    • This could be tested by seeing if the problem goes away when serial mode is used instead, or when the returnPops store an explicit copy of populations.
MilesCranmer commented 1 year ago

The poster in #266 confirmed that multi-processing got rid of their issue. So it seems like a data race issue. I wonder if this is because EquationSearch is exiting before some threads are finished, because there is no safe way to cancel threads, whereas for processes, I simply call rmprocs(procs): https://github.com/MilesCranmer/SymbolicRegression.jl/blob/51d205c518eb3e99cfd45ac6a2d3dbbbd1944f32/src/SymbolicRegression.jl#L915

One possible solution is to implement a task handler that will safely kill tasks, as described here: https://discourse.julialang.org/t/how-to-kill-thread/34236/8.

MilesCranmer commented 6 months ago

Presumably fixed by #535