Workers out of memory during optimization runs

daveweisberg commented 1 year ago

I have been seeing srun error: Out Of Memory on FUSE optimization runs that last longer than ~60 iterations. I don't understand why a worker node would be running out of memory, but we need to identify which process is causing this. It is preventing me from efficiently completing optimization runs that require 100s of iterations.

The following error corresponds to the optimization run located here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_qpol2.75_A3.5_Zeff2.0/

where the last saved run folder is from iteration 56.

Error:

Iteration  30%|████████▍                   |  ETA: 6:51:07 ( 3.26  m/it)mWorker 42 terminated.
srun: error: omega26: task 40: Out Of Memory
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484

orso82 commented 1 year ago

The new HFSsizing actor is taking a lot more memory than it used to... it has now the largest memory footprint of all actors:

See for example https://github.com/ProjectTorreyPines/FUSE.jl/actions/runs/5742584439/job/15564988399

This likely points to some type instability (which would also makes it slow)

daveweisberg commented 1 year ago

Ok, but in general why does the allocated memory increase with the number of iterations? Shouldn't we stop this ratcheting effect somehow? Maybe by re-setting all variables after each iteration?

smithsp commented 1 year ago

Are you requesting enough memory in the first place?

daveweisberg commented 1 year ago

Some more info. Compare these FUSE.timing outputs from the same Julia process (1218946) at the 1st, 2nd, and 56th iterations. By the last iteration, the total allocations are 1.37 TiB.

1st iteration (includes a lot of pre-compiling that is not repeated in subsequent iterations) Screen Shot 2023-08-09 at 12 00 59 PM

2nd iteration Screen Shot 2023-08-09 at 12 01 39 PM

56th iteration Screen Shot 2023-08-09 at 11 57 36 AM

bclyons12 commented 1 year ago

How about the 28th iteration, if you have it? This looks like a memory increase of 20-30 GB per iteration. Some sort of memory leak not captured by the garbage collector?

daveweisberg commented 1 year ago

28th iteration:

Screen Shot 2023-08-09 at 1 59 34 PM

orso82 commented 1 year ago

Indeed, some memory does not get freed. However the memory usage within each actor does not seem to be growing.

daveweisberg commented 1 year ago

Note that in this case (which is typical for my optimization runs), there are 256 tasks being run in parallel on 4 Omega worker nodes. So whatever memory leakage is indicated in the above timing.txt files, multiply that by 256/4 = 64 to get the total amount for each node.

bclyons12 commented 1 year ago

I see some issues online related to garbage collection and multithreading, though the issues appear closed, for example, https://github.com/JuliaLang/julia/issues/31923. Perhaps it would be worth trying a single thread on a single core to see what the performance in. It should only require a few iterations to see if the memory footprint is growing.

orso82 commented 1 year ago

@daveweisberg see 75089bc

I see that you are now writing a timing.txt file for each of the optimization outputs. Could you please add a memory.txt file that contains the output of the following function:

FUSE.varinfo(FUSE, all=true, imported=true, recursive=true, sortby=:size, minsize=1024)

this will contain an output similar to this (here shown with recursive=false), which we can then use to track down the memory issue that you are seeing:

| name                                 |        size | summary                                               |
|:------------------------------------ | -----------:|:----------------------------------------------------- |
| Interpolations                       |  20.557 MiB | Module                                                |
| Plots                                |  14.332 MiB | Module                                                |
| DataFrames                           |  13.287 MiB | Module                                                |
| FUSE                                 |  11.432 MiB | Module                                                |
| IMAS                                 |   9.902 MiB | Module                                                |
| OrderedCollections                   |   9.825 MiB | Module                                                |
| ForwardDiff                          |   9.316 MiB | Module                                                |
| LibGEOS                              |   9.124 MiB | Module                                                |
| SpecialFunctions                     |   9.111 MiB | Module                                                |
| CSV                                  |   9.036 MiB | Module                                                |
| Optim                                |   8.833 MiB | Module                                                |
| TEQUILA                              |   8.803 MiB | Module                                                |
| Random                               |   8.543 MiB | Module                                                |
| FiniteElementHermite                 |   8.507 MiB | Module                                                |
| Distributed                          |   8.477 MiB | Module                                                |
| GeoInterface                         |   8.475 MiB | Module                                                |
| Dates                                |   8.463 MiB | Module                                                |
| FileIO                               |   8.427 MiB | Module                                                |
| MXHEquilibrium                       |   8.360 MiB | Module                                                |
| JSON                                 |   8.206 MiB | Module                                                |
| AbstractTrees                        |   8.109 MiB | Module                                                |
| QuadGK                               |   8.099 MiB | Module                                                |
| TimerOutputs                         |   8.073 MiB | Module                                                |
| VacuumFields                         |   8.056 MiB | Module                                                |
| JpegTurbo                            |   8.004 MiB | Module                                                |
| NLsolve                              |   7.962 MiB | Module                                                |
| QED                                  |   7.911 MiB | Module                                                |
| LoggingExtras                        |   7.899 MiB | Module                                                |
| TGLFNN                               |   7.885 MiB | Module                                                |
| Logging                              |   7.874 MiB | Module                                                |
| BoundaryPlasmaModels                 |   7.872 MiB | Module                                                |
| EPEDNN                               |   7.867 MiB | Module                                                |
| TAUENN                               |   7.867 MiB | Module                                                |
| ClusterManagers                      |   7.863 MiB | Module                                                |
| CHEASE                               |   7.839 MiB | Module                                                |
| PolygonOps                           |   7.831 MiB | Module                                                |
| NNeutronics                          |   7.827 MiB | Module                                                |
| EFIT                                 |   7.823 MiB | Module                                                |
| elements                             | 331.727 KiB | PeriodicTable.Elements                                |
| ##meta#58                            | 279.988 KiB | IdDict{Any, Any} with 175 entries                     |
| memory                               | 123.646 KiB | FUSE.Memory                                           |
| timer                                |  33.063 KiB | TimerOutputs.TimerOutput                              |
| ##load_inflation_rate_memoized_cache |  15.069 KiB | IdDict{Any, Any} with 1 entry                         |
| SnoopPrecompile                      |  14.213 KiB | Module                                                |
| Memoize                              |  12.317 KiB | Module                                                |
| ConstraintFunctionsLibrary           |   1.964 KiB | Dict{Symbol, FUSE.ConstraintFunction} with 15 entries |
| ObjectiveFunctionsLibrary            |   1.720 KiB | Dict{Symbol, FUSE.ObjectiveFunction} with 11 entries  |

NOTE: FUSE.varinfo() behaves differently than varinfo() that is available in InteractiveUtils. I found that the recursive=true option of InteractiveUtils.varinfo() would not work as advertised. @sjkelly maybe a bug in the JuliaLang source?

You can now also do to see how memory gets used by your julia process

"""
    plot_memory(memory::Memory, n_big_jumps::Int=5, ignore_first_seconds::Int=0)

Plot the memory usage over time from a `Memory` object. 

# Arguments
- `n_big_jumps`: number of significant memory jumps to highlight in the plot.
- `ignore_first_seconds`: number of initial seconds to ignore in the plot. Memory usage will be plotted relative to the memory after this cutoff. Default is `0` (no seconds are ignored).

# Returns
A plot with the following characteristics:
- Time is displayed on the x-axis as delta seconds since the first recorded time (after ignoring the specified initial seconds).
- Memory usage is displayed on the y-axis in MB.
- Memory usage is shown as a scatter plot.
- The `n_big_jumps` largest jumps in memory are highlighted in red with annotations indicating the action causing each jump.
"""
plot(FUSE.memory, 5, 0)

I have not tried, but I suspect that you might be able to gather the memory history from each process and plot them this way:

using Distributed
p = plot(FUSE.memory, 5) #Main process
memories = [@fetchfrom worker FUSE.memory for worker in workers()]
for memory in memories
    plot!(memory, 5)
end
display(p)

daveweisberg commented 1 year ago

@orso82 Sorry but I don't have the time to help diagnose this issue. Can you please recreate this using my established optimization workflow?

orso82 commented 1 year ago

Ok. Let's say whoever gets there first.

daveweisberg commented 1 year ago

@orso82 New runs with the varinfo.txt logging are available on Omega here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs2 /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs3

These runs each have 60 generations, and completed without any memory errors.

An on-going run with 180 generations can be found here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs4

daveweisberg commented 1 year ago

@orso82 Update: The 180 generation run errored out 64% of the way to completion. I've moved the run directory to this location:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs_fail

Error messages:

Iteration  64%|█████████████████▉          |  ETA: 4:06:56 ( 3.80  m/it)mWorker 246 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484
srun: error: omega29: task 244: Out Of Memory
Worker 116 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484
Worker 19 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484

ProcessExitedException(19)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:871
  [2] wait()
    @ Base ./task.jl:931
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:416
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:410
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:726
  [7] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:461
  [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:454
  [9] remotecall_fetch(f::Function, id::Int64, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492
 [10] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492 [inlined]
 [11] remotecall_pool(rc_f::typeof(remotecall_fetch), f::Function, pool::WorkerPool, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:123
 [12] remotecall_pool
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:120 [inlined]
 [13] #remotecall_fetch#200
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [15] #208#209
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [16] #208
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}})(r::Base.RefValue{Any}, args::Tuple{Vector{Float64}})
    @ Base ./asyncmap.jl:100
 [18] macro expansion
    @ ./asyncmap.jl:234 [inlined]
 [19] (::Base.var"#945#946"{Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}}, Channel{Any}, Nothing})()
    @ Base ./task.jl:484

Stacktrace:
  [1] (::Base.var"#939#941")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#939#941", itr::Vector{Any})
    @ Base ./abstractarray.jl:2774
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{Vector{Float64}})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#924
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#923
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::Vector{Vector{Float64}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
  [8] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99 [inlined]
  [9] #pmap#233
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [10] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [11] optimization_engine(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), X::Matrix{Float64}, objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}, save_folder::String, save_dd::Bool, p::ProgressMeter.Progress)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/optimization.jl:266
 [12] (::FUSE.var"#748#751"{String, Bool, FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, ProgressMeter.Progress})(X::Matrix{Float64})
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [13] create_solutions(X::Matrix{Float64}, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}; ε::Float64)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:327
 [14] create_solutions
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:323 [inlined]
 [15] update_state!(::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, ::Metaheuristics.SPEA2, ::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, ::Metaheuristics.Information, ::Metaheuristics.Options; kargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:95
 [16] update_state!
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:79 [inlined]
 [17] _during_optimization!(status::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, parameters::Metaheuristics.SPEA2, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, information::Metaheuristics.Information, options::Metaheuristics.Options, convergence::Vector{Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}}, logger::Metaheuristics.var"#154#156", termination::Metaheuristics.Termination)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/during.jl:5
 [18] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2}; logger::Function)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:52
 [19] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:42
 [20] macro expansion
    @ ./timing.jl:262 [inlined]
 [21] workflow_multiobjective_optimization(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}; exploitation_vs_exploration::Float64, N::Int64, iterations::Int64, continue_state::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, save_folder::String, save_dd::Bool)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [22] top-level scope
    @ In[13]:22

orso82 commented 1 year ago

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

lstagner commented 1 year ago

I made sure all uses of Memoize limited the amount of instances it could save at once. If you think its caused by Memoize you can use MXHEquilibrium.set_cache_size!(0) to set the cache size to zero. The default cache size is five.

orso82 commented 1 year ago

OK, thank you for the good insight @lstagner !

@daveweisberg the next time you run, can you please try to set FUSE.MXHEquilibrium.set_cache_size!(0) and point me again to the folder where you did the run?

lstagner commented 1 year ago

another thing to try is to periodically call GC.gc() and ccall(:malloc_trim, Cvoid, (Cint,), 0) to run the garbage collector and to tell linux to free up some memory the GC misses.

daveweisberg commented 1 year ago

@orso82 I've finished a new optimization run with FUSE.MXHEquilibrium.set_cache_size!(0).

It can be found here: /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_1/opt_betaN_cost__Solovev_flattop72/opt_runs1/.

orso82 commented 1 year ago

Similar growth, but with an offset, which makes sense, since before we had a finite cache size of 5. So, while this reduces the memory usage, it does not solve the problem.

Next, as @lstagner suggested, could you please try to add

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

at the bottom of your workflow function that you feed to the FUSE.workflow_multiobjective_optimization() function?

daveweisberg commented 1 year ago

@orso82 New run directories with both of Luke's suggestions implemented can be found here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_2/opt_betaN_cost__flattop72_A3.5_H1.0/opt_runs1/

orso82 commented 1 year ago

Better, but it's still increasing!

orso82 commented 1 year ago

@daveweisberg can you please try to setup a run where you do not execute the ActorWholeFacility, and just do a init(dd,ini,act). This will run ActorSolovev and ActorCXbuild (and few other trivial actors) but none of the other heavier actors will be run. This would be informative to rule out places where the issue might be.

TimSlendebroek commented 1 year ago

I have another memtrace of similar behavior of a TEQUIA & FluxMatcher run on SAGA.

Here is the memtrace of the last case that was run:

memtrace.txt

TimSlendebroek commented 1 year ago

plot of the memtrace

TimSlendebroek commented 1 year ago

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

I am not sure if it only Solovev as I see the same problem for running with TEQULIA (although of course it could be both)

TimSlendebroek commented 1 year ago

My conclusion on how the memory leakage is happening relates to the garbage collector being unable to do it's job in a pmap while also being in a try-catch structure... see https://github.com/ProjectTorreyPines/FUSE.jl/commit/5d8df73ae410fe239f7354cd2e0d37327e798248 for the fix

orso82 commented 1 year ago

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which
  makes the next GC scan much slower, while an incremental collection may only sweep so-called young objects.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.

@TimSlendebroek can you please try calling GC.gc(false) and see if this still works?

orso82 commented 11 months ago

The issue has gotten better, but is not fully resolved. It would be great to have @mbolognajh and @sjkelly take a look at the issue.

sjkelly commented 11 months ago

This flag can be used to force Julia to collect at a greater interval:

 --heap-size-hint=<size>    Forces garbage collection if memory usage is higher than that value.

TimSlendebroek commented 11 months ago

I'll add the minimal working example

local
saga (cluster)

But also add the optimization case for

local
saga (cluster)

Memory goes up directly with the number of dd's processed

orso82 commented 11 months ago

heap-size-hint can be very handy. Set it to 1/2 or 1/4 of the system's memory. NOTE: since we have many julia processes on each node, perhaps we need to get 1/100th of the memory.

How do we pass heap-size-hint to the worker nodes? @mbolognaJH help! Something like:

JuliaArgs for Distributed.jl
heap-size-hint = sys.total_physical_memory() / sys.CPU_THREADS

v1.10.0 was supposed to improve GC, but did not make the cut. v.1.11.0 might address this.

May be able to find workarounds for v1.9.0 and v1.10.0

TimSlendebroek commented 11 months ago

I made an example notebook that runs locally and shows the same problems (without wrapping and gc) @mbolognaJH see https://github.com/ProjectTorreyPines/FUSE.jl/commit/76cfed0e16152084894f5fd00400a629971f3996 examples/debugging/memory.ipynb

(don't run the last two cells as this is the optimization it's already showing memory issues with simple load and save):

mbolognaJH commented 11 months ago

@TimSlendebroek is there a specific dd.json file I should be using? Maybe I'm missing part of my setup for this example.

TimSlendebroek commented 11 months ago

Aah sorry @mbolognaJH

I can send you one or you can do the following to create your own json:

using FUSE
dd, ini,act = FUSE.init(:KDEMO)

IMAS.json2imas(dd,<your path>)

TimSlendebroek commented 11 months ago

@mbolognaJH can we add each other on zoom/discord so I can respond a bit quicker if you have questions / explain / send files?

mbolognaJH commented 11 months ago

Yes sure thing, we can continue the conversation on Zoom (I haven't tried it before, but we'll see how it works): my email is michael.bologna@juliahub.com

mbolognaJH commented 11 months ago

I have been able to reproduce the MWE. By limiting the heap size available to the Julia GC with --heap-size-hint we can avoid the memory growth behavior as shown below:

The root cause of this is the multi-threading and multi-processing design patterns used by FUSE. Currently, FUSE uses multi-processing exclusively via Distributed.jl, including within each node. This uses multiple julia process within each node. This requires substantial memory overhead since a copy of the base system image and user space code is loaded for each process. This also introduces the observed memory growth behavior, since each julia process assumes that it is the only one on the node.

By adding --heap-size-hint such that Sys.total_physical_memory() / sys.CPU_THREADS we get the expected behavior and the GC is able to collect at a regular interval. This also obviates the need for manual calls to GC.gc(). Within the addprocs command the exeflags keyword argument can propagate a heap size hint to each child julia process, e.g.: exeflags="--heap-size-hint=2G"

The recommend long-term solution, that will also free up memory overhead and improve performance, would be to use multi-threading within each node as documented here. https://docs.julialang.org/en/v1.9/manual/multi-threading/ This requires the work to be partitioned in a way that the master node is aware of the thread counts on each child node as well as the total memory on each child node.

bclyons12 commented 11 months ago

Thanks @mbolognaJH , this makes sense to me. I've been looking for how to implement your long-term solution, but I can't find a good example. How can we launch one process per node that has access to all the cores/threads on that node?

mbolognaJH commented 11 months ago

You can use the exeflags again to specify the number of threads you want to spawn:

For example:

# Although you won't need the heap-size-hint if you're doing the long-term solution,
# but here's how to do multiple exeflags in one `addprocs()` command.
num_procs = 1
num_threads = 8 # You can use "auto" here as well
addprocs(num_procs, exeflags=["--threads=$num_threads", "--heap-size-hint=2G"])

orso82 commented 11 months ago

Thank you @mbolognaJH ! This solves a major headache!!!

@bclyons12 @mbolognaJH I consider this solved now. I'll add the heap-size-hint as you suggest, and have @TimSlendebroek test it.

Please don't spend further time on this now. I have more pressing questions for both of you @bclyons12 and @mbolognaJH I'll open up some issue and tag you in those.

Thanks again!!!!

bclyons12 commented 11 months ago

I still think there are very good reasons to figure out how to do the hybrid multiprocessing-multithreading implementation. I understand that it's not a priority for now, but we should open a separate issue for a future upgrade if we're going to close this one.

orso82 commented 11 months ago

@TimSlendebroek reported that the --heap-size-hint fix seems to have fixed the issue!

@daveweisberg please note that as part of d7c57a5 now by default cpus_per_task=1 which means that on OMEGA you'll be able to push through double the number of workers for the same number of nodes. This should help significantly!

Thanks again @mbolognaJH !

TimSlendebroek commented 11 months ago

on local with 2G hintsize

orso82 commented 11 months ago

Thank you @TimSlendebroek for the thorough testing!!

daveweisberg commented 11 months ago

Thanks everyone for the hard work in solving this issue! I've started testing this with Julia 1.10 on the Omega cluster, and so far it looks good!

A quick question for @TimSlendebroek or @orso82: Should I still include the following in the function I'm running on the distributed workers?

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

Or is this no longer necessary?

ProjectTorreyPines / FUSE.jl

Workers out of memory during optimization runs #387