PyPSA / linopy

Linear optimization with N-D labeled arrays in Python
https://linopy.readthedocs.io
MIT License
155 stars 43 forks source link

Gurobi race condition #192

Closed dannyopts closed 9 months ago

dannyopts commented 9 months ago

I am currently experiencing a bug when using a remote compute gurobi solver.

The issue is that the env is destroyed before we retrieve the solution, and if we lose a race then we cant retrieve the solution and blow up with an error

AttributeError: Failure writing output to destination (code 23, node http://localhost:61000, command PUT http://localhost:61000/api/v1/jobs/b438a8b5-65db-4c6c-9a00-d5f4e9fefa53/worker?sync=true&cmd=13). Did you mean: 'ObjNVal'?

More details on how to reproduce below but the TLDR is: The problem is that the job is aborted when we exit the stack on line 582. I presume deletion of this env on the compute side is then an async process. If we ask "in time" we can get the solved values back. But if not then we error out.

I think the fix is simple and we just need to retrieve the solution before we exit the stack.

A workaround is to pass in an env rather than letting linopy create for you.

Happy to create a PR to make this fix.

Steps to reproduce

You need to have a gurobi compute server running with a valid license: I am doing this via docker compose

version: '3.9'
services:
  compute:
    image: gurobi/compute:latest
    restart: always
    command: --hostname=localhost
    volumes:
       - ./gurobi.lic:/opt/gurobi/gurobi.lic
    ports: 
      - "61000:61000"

Create a second lic file to point at this compute server

COMPUTESERVER=localhost:61000

Then build a simple model and solve it pointing to the gurobi server

from linopy import Model

m = Model()
x = m.add_variables(lower=0, name='x')
m.add_constraints(3*x + 1 >= 10)
m.add_objective(2 * x)
# m.add_constraints(3*x + 1 <= 1)
m.solve(solver_name="gurobi")

Now run the model:

GRB_LICENSE_FILE="./gurobi.lic" python x.py

This "should" solve with no issues.

Now add a time.sleep(5) before calling get_solver_solution in run_gurobi (simulating some slowness in the os) - this seems to be realistic when working with larger models.

status.legacy_status = condition

  import time
  time.sleep(5)
  def get_solver_solution() -> Solution:
      objective = m.ObjVal

Now run the original model again

GRB_LICENSE_FILE="./local_gurobi.lic" python x.py

When I do this I see

Solved in 0 iterations and 0.04 seconds (0.00 work units)
Optimal objective  6.000000000e+00
Warning: environment still referenced so free is deferred
Warning: remote job 5c53a3ce-82df-4c2a-9741-0361032cb025 on server http://localhost:61000 killed because environment was freed

Followed by

AttributeError: Failure writing output to destination (code 23, node http://localhost:61000, command PUT http://localhost:61000/api/v1/jobs/b438a8b5-65db-4c6c-9a00-d5f4e9fefa53/worker?sync=true&cmd=13). Did you mean: 'ObjNVal'?
aurelije commented 9 months ago

This is the problem I have noticed and wanted to report.

So what I have found is that the hack to save token is closing environment as soon as solution is found. The other parts of code are still expecting to have environment up and running so when you try to explore solution than it throws this error.

I was thinking about option to get solution before env is closed. But also if problem is not solvable and you try to get IIS you will hit the same problem.

dannyopts commented 9 months ago

Maybe computing the IIS when using a remote server should require you to manage the env yourself and this can just be added to the docs?

Since computing the IIS could happen at any time I cant think of how this could work, unless it was allowed to compute the IIS straight after the run with something wrapping function like solve_or_computeIIS which would itself manage the env creation?