MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://ai.damtp.cam.ac.uk/symbolicregression/
Apache License 2.0
644 stars 86 forks source link

Long-running parallel jobs have small percentage of processes hang #28

Closed MilesCranmer closed 3 years ago

MilesCranmer commented 3 years ago

I have a suspicion that sometimes processes hang for long-running jobs, as the load decreases over time. Very infrequently, when running with procs=0, I will see errors such as cos receiving Inf as input, even though I have special checks for such behavior. It is possible that such a situation occurs for a long-running job, and the process crashes without stating anything.

Putting code within try in Julia slows it down quite a bit, so it is not practical to do error catching in that way.

Once #27 is implemented, perhaps one could monitor processes over time to see if they crash, and see what caused this to occur.

MilesCranmer commented 3 years ago

This should be completely occupied:

Screen Shot 2021-06-04 at 1 05 47 PM

I think this might be from the head worker having too much work to do? I should add a statistic that says what proportion of time the head node is occupied.

The easiest way to debug whether processes are crashing before sending a result to the channel (the literal processes have not crashed - I can still see all of them if I run ps -ef) would be to run with procs=0. Here's an error log with procs=0:

ERROR: LoadError: DomainError with -Inf:
sin(x) is only defined for finite x.
Stacktrace:
  [1] sin_domain_error(x::Float64)
    @ Base.Math ./special/trig.jl:28
  [2] sin(x::Float64)
    @ Base.Math ./special/trig.jl:39
  [3] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
  [4] multiply_powers(eqn::SymbolicUtils.Term{Number})
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
  [5] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
  [6] multiply_powers(eqn::SymbolicUtils.Term{Number})
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
  [7] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
  [8] multiply_powers(eqn::SymbolicUtils.Term{Number})
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
  [9] custom_simplify(init_eqn::SymbolicUtils.Term{Number}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss})
    @ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:140
 [10] simplifyWithSymbolicUtils(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, curmaxsize::Int64)
    @ SymbolicRegression.../SimplifyEquation.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/SimplifyEquation.jl:124
 [11] macro expansion
    @ ~/.julia/packages/SymbolicRegression/WvHtm/src/SingleIteration.jl:65 [inlined]
 [12] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [13] OptimizeAndSimplifyPopulation(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, pop::Population{Float32}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, curmaxsize::Int64, record::Dict{String, Any})
    @ SymbolicRegression.../SingleIteration.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/SingleIteration.jl:62
 [14] EquationSearch(datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:305
 [15] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:144
 [16] #EquationSearch#22
    @ ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:156 [inlined]
 [17] top-level scope
    @ /tmp/tmpsw_2w8ef/runfile.jl:7
MilesCranmer commented 3 years ago

So it looks like the simplification algorithm is encountering infinities when it tries to reduce powers. So I need to add manual infinity and nan checks to multiply_powers, just like I have in the equation evaluation. This is probably the cause of the processes hanging.

MilesCranmer commented 3 years ago

Fixed this by adding manual infinity and nan checks to multiply_powers.

Seems to be doing better:

Screen Shot 2021-06-04 at 6 51 14 PM

Head worker load is very low, even for 128 worker process: it is only occupied 5% of the time. But it is a good quantity to print so that one can increase ncyclesperiteration if it gets too high.