IBMDecisionOptimization / docplex-examples

These samples demonstrate how to use the DOcplex library to model and solve optimization problems.
https://ibmdecisionoptimization.github.io/
Apache License 2.0
396 stars 229 forks source link

docplex.cp.model out-of-memory running on a cluster #53

Closed shanakap closed 3 years ago

shanakap commented 3 years ago

I am trying to run the following docplex.cp.model with a large dataset. This is with some sample data:

import numpy as np
from docplex.cp.model import CpoModel
N = 180000
S = 10
k = 2

u_i = np.random.rand(N)[:,np.newaxis]
u_ij = np.random.rand(N*S).reshape(N, S)
beta = np.random.rand(N)[:,np.newaxis]

m = CpoModel(name = 'model')
R = range(1, S)

idx = [(j) for j in R]
I = m.binary_var_dict(idx)
m.add_constraint(m.sum(I[j] for j in R)<= k)

total_rev = m.sum(beta[i,0] / ( 1 + u_i[i,0]/sum(I[j] * u_ij[i-1,j]  for j in R) ) for i in range(N) )

m.maximize(total_rev)

sol=m.solve(agent='local',execfile='/Users/Mine/Python/tf2_4_env/bin/cpoptimizer')

print(sol.get_solver_log())

I have tried to run this on a cluster with following settings:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=4571

This stops with out-of-memory as shown in the output :

! --------------------------------------------------- CP Optimizer 20.1.0.0 --
 ! Maximization problem - 9 variables, 1 constraint
 ! Presolve      : 360001 extractables eliminated
 ! Initial process time : 28.95s (28.77s extraction + 0.19s propagation)
 !  . Log search space  : 9.0 (before), 9.0 (after)
 !  . Memory usage      : 623.2 MB (before), 623.2 MB (after)
 ! Using parallel search with 28 workers.
 ! ----------------------------------------------------------------------------
 !          Best Branches  Non-fixed    W       Branch decision
                        0          9                 -
 + New bound is 80920.82
Traceback (most recent call last):
  File "sample.py", line 22, in <module>
    sol=m.solve(agent='local',execfile='/home/wbs/bstqhc/.local/bin/cpoptimizer') #agent='local',execfile='/Users/Mine/Python/tf2_4_env/bin/cpoptimizer')
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/model.py", line 1222, in solve
    msol = solver.solve()
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 775, in solve
    raise e
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver.py", line 768, in solve
    msol = self.agent.solve()
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 209, in solve
    jsol = self._wait_json_result(EVT_SOLVE_RESULT)
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 545, in _wait_json_result
    data = self._wait_event(evt)
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 448, in _wait_event
    evt, data = self._read_message()
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 604, in _read_message
    frame = self._read_frame(6)
  File "/home/wbs/bstqhc/.local/lib/python3.7/site-packages/docplex/cp/solver/solver_local.py", line 664, in _read_frame
    raise CpoSolverException("Nothing to read from local solver process. Process seems to have been stopped (rc={}).".format(rc))
docplex.cp.solver.solver.CpoSolverException: Nothing to read from local solver process. Process seems to have been stopped (rc=-9).
slurmstepd: error: Detected 2 oom-kill event(s) in step 379869.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

What I observed is the optimisation is running parallel as it says Using parallel search with 28 workers and there are 28 cores per node. However looks like its only using 1 node.

Can you please help me to overcome the out-of-memory issue?

ooudot commented 3 years ago

The number of workers is set by default to the number of cores, which is apparently 28 in your case. Please try to force it to a lower value, for example 4 which is enough in most cases, by setting the corresponding parameter in the solve request, as follows:

sol=m.solve(agent='local',execfile='/Users/Mine/Python/tf2_4_env/bin/cpoptimizer', Workers=4)