Closed MilesCranmer closed 3 years ago
This should be completely occupied:
I think this might be from the head worker having too much work to do? I should add a statistic that says what proportion of time the head node is occupied.
The easiest way to debug whether processes are crashing before sending a result to the channel (the literal processes have not crashed - I can still see all of them if I run ps -ef
) would be to run with procs=0
. Here's an error log with procs=0
:
ERROR: LoadError: DomainError with -Inf:
sin(x) is only defined for finite x.
Stacktrace:
[1] sin_domain_error(x::Float64)
@ Base.Math ./special/trig.jl:28
[2] sin(x::Float64)
@ Base.Math ./special/trig.jl:39
[3] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
[4] multiply_powers(eqn::SymbolicUtils.Term{Number})
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
[5] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
[6] multiply_powers(eqn::SymbolicUtils.Term{Number})
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
[7] multiply_powers(eqn::SymbolicUtils.Term{Number}, op::typeof(sin))
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:15
[8] multiply_powers(eqn::SymbolicUtils.Term{Number})
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:39
[9] custom_simplify(init_eqn::SymbolicUtils.Term{Number}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss})
@ SymbolicRegression.../CustomSymbolicUtilsSimplification.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/CustomSymbolicUtilsSimplification.jl:140
[10] simplifyWithSymbolicUtils(tree::Node, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, curmaxsize::Int64)
@ SymbolicRegression.../SimplifyEquation.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/SimplifyEquation.jl:124
[11] macro expansion
@ ~/.julia/packages/SymbolicRegression/WvHtm/src/SingleIteration.jl:65 [inlined]
[12] macro expansion
@ ./simdloop.jl:77 [inlined]
[13] OptimizeAndSimplifyPopulation(dataset::SymbolicRegression.../Dataset.jl.Dataset{Float32}, baseline::Float32, pop::Population{Float32}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, curmaxsize::Int64, record::Dict{String, Any})
@ SymbolicRegression.../SingleIteration.jl ~/.julia/packages/SymbolicRegression/WvHtm/src/SingleIteration.jl:62
[14] EquationSearch(datasets::Vector{SymbolicRegression.../Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:305
[15] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(sin)}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
@ SymbolicRegression ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:144
[16] #EquationSearch#22
@ ~/.julia/packages/SymbolicRegression/WvHtm/src/SymbolicRegression.jl:156 [inlined]
[17] top-level scope
@ /tmp/tmpsw_2w8ef/runfile.jl:7
So it looks like the simplification algorithm is encountering infinities when it tries to reduce powers. So I need to add manual infinity and nan checks to multiply_powers
, just like I have in the equation evaluation. This is probably the cause of the processes hanging.
Fixed this by adding manual infinity and nan checks to multiply_powers
.
Seems to be doing better:
Head worker load is very low, even for 128 worker process: it is only occupied 5% of the time. But it is a good quantity to print so that one can increase ncyclesperiteration
if it gets too high.
I have a suspicion that sometimes processes hang for long-running jobs, as the load decreases over time. Very infrequently, when running with
procs=0
, I will see errors such ascos
receivingInf
as input, even though I have special checks for such behavior. It is possible that such a situation occurs for a long-running job, and the process crashes without stating anything.Putting code within
try
in Julia slows it down quite a bit, so it is not practical to do error catching in that way.Once #27 is implemented, perhaps one could monitor processes over time to see if they crash, and see what caused this to occur.