Open metab0t opened 6 months ago
We believe there are fundamental limitations to the file-based I/O as pointed out by comment of developer of JuMP.jl. For example, file-based I/O makes the following advanced features nearly impossible: incremental modification and re-solve, extensible solver-specific attributes, on-demand query of solutions. Large file I/O is also slower than in-memory operations.
Hi @metab0t!
Thank you for bringing PyOptInterface to my attention! Your approach, if I understand correctly, of calling Gurobi's C API directly is very neat! Great job implementing that as I imagine it was non-trivial to get the Python-C bindings working properly.
Pyoframe is built on polars a Rust-based dataframe library that follows the Apache Arrow DataFrame format (and not numpy's format). So one issue I foresee with building off of PyOptInterface is the conversion from Polars to your C++ API (which might be slow?). I think long-term this could be a good goal as I agree file-based I/O is an inefficient way to build models. However, for now, I think file-based IO is good enough: our polars-based writer is extremely fast (~10s for very large models), Gurobi reads in the model as well extremely quickly (~10s for very large models) and we don't use files to read back the results.
Incremental modification, re-solve, etc. are rather niche cases imo although something I'd like to support on the long-term at which point PyOptInterface might make a lot of sense.
If you have an easy way to integrate Polars dataframes with your library do let me know as it would be great to support 4 solvers. However, I would guess such an integration is non-trivial and such a project would need to wait. Let me know! (Also happy to setup a call to discuss).
Thanks for reaching out!!
Thanks for your explanation!
I think that it would be not difficult to integrate Polars DataFrame with PyOptInterface. Variables and Constraints in PyOptInterface are just lightweight Python objects, and they can be stored as polars.Object
column.
There is no need to store UB, LB, RC of variables because they are stored internally by Gurobi and can be queried on demand.
The file-based IO can be skipped because we have added them to the Gurobi model once they are created. Just call model.optimize()
and the solution can be queried using the Variable and Constraint handles we previously stored in Polars DataFrame directly (to skip the io mapping process as well).
I will give a brief example later.
import polars as pl
import pyoptinterface as poi
from pyoptinterface import gurobi
model = gurobi.Model()
# Create a DataFrame
df = pl.DataFrame({
"X": [1, 2, 3],
"Y": [4, 5, 6],
'lb': 0.0,
'ub': 2.0
})
def addvar(lb, ub):
return model.add_variable(lb=lb, ub=ub)
df = df.with_columns(
pl.struct(["lb", "ub"]).map_elements(lambda x: addvar(x["lb"], x["ub"]), return_dtype=pl.Object)
.alias
("Variable")
)
vars = df["Variable"]
model.add_linear_constraint(poi.quicksum(vars), poi.Geq, 1.0)
obj = poi.quicksum(v * v for v in vars)
model.set_objective(obj)
model.optimize()
df = df.with_columns(
pl.col("Variable").map_elements(lambda x: model.get_value(x), return_dtype=pl.Float64).alias(
"Value")
)
print(df)
@staadecker This is a simple example to combine PyOptInterface and Polars to solve a QP problem.
Thank you @metab0t !
I don't think we'd want to change the expression generation code over to poi.quicksum
as the whole benefit of this library is the rapid creation of very large expressions using polars. Additionally, I'd be afraid that .map_elements
would be quite slow (perhaps even slower than the fileIO).
In any case, I'm currently swamped with work so I need to put this on hold.
I have played with Polars and find that its support for Python object is not complete https://github.com/pola-rs/polars/issues/10189
The design of Pyoframe is quite neat. Constraint.lhs.data and Variable.data are compact polars.DataFrame to store their terms and indices, which makes it easy for a possible switch in the future.
In general, using DataFrame to represent multidimensional indices and their sparse combination is a great choice. I remember the benchmark of GAMS and response of JuMP.jl where using DataFrames.jl improves the performance significantly. https://github.com/Gurobi/gurobipy-pandas is also an interesting project to use pandas.DataFrame as container of optimization.
@metab0t thank you, I'm glad you like it :)
Before building the library I actually tried to do something simple like gurobipy-pandas with polars but due to Python objects not being fully supported I couldn't store Gurobi Python expressions in a dataframe as gurobipy-pandas does.
The support for Python object seems not to be the priority of Polars, otherwise a similar API like gurobipy-pandas will be easy to implement (to store persistent variables/constraints objects in DataFrame directly).
Besides, the expression system of PyOptInterface is quite fast to construct expressions with many terms. The core is implemented by efficient hashmap in C++.
I prepare an example based on the facility_problem in Pyoframe repo at https://gist.github.com/metab0t/c3c685a8b2ec1f14171772bd7bc7ea3e
On my computer, the result is:
Pyoframe elapsed time: 28.71 seconds
POI elapsed time: 14.48 seconds
Very neat comparison. Do you have a breakdown of where the time is being taken in PyoFrame (expression building vs io)?
I have updated my gist to report time of Pyoframe in detail.
The time spent on expression building, write LP file and read LP file is approximately 1:1:2
. So expression building occupies 25% time and file io occupies the other 75% time.
Very neat, this confirms that file io is not ideal and that when I have time I should build a direct interface, perhaps using PyOptInterface. For context, expressions are stored in a "narrow" format where each row is a term and there is a column for the term's coefficient, and another with an ID to indicate the variable. Would that be something easily converted to your API? I'm thinking it is at that level that I'd want to pass things off to C.
The file based IO also requires a lot of code (i.e. all of io.py and io_mappers.py) so it would be great to get rid of it (we can always use gurobi to generate the .lp file for inspection).
The representation of expression is OK.
In fact, the variable in PyOptInterface is a thin wrapper of its ID, and the linear expression is two vectors representing the coefficients and indices of variables.
Storing variable object (from PyOptInterface or gurobipy) directly in Polars is not recommended because Polars supports Object poorly.
You can build ONE big array to store all the variables in the model and the variable id points to the array. When you want to add the constraint to the model, just traverse all rows and construct the expression object.
By the way, PyOptInterface supports writing the model to LP/MPS files as well. We use the native C API provided by gurobi and the output should be identical with gurobipy.
Just for my future reference, I re-ran @metab0t benchmark script with our new version (faster writing) and I get the following:
Pyoframe | PyOptInterface | |
---|---|---|
Building model | 12s | |
Writing .lp file | 16s | |
Gurobi reads .lp file | 19s | |
Total | 47s | 23s |
_Note weirdly if use_var_names=False
in .to_file
the time becomes 11s, 15s, 35s. Which means Gurobi is slower in reading the file despite the file being smaller..._
I think I could build on top of PyOptInterface at the io_mapper stage by simply creating a solver that creates a variable and constraint mapping where the ids are taken from .add_variable()
and .add_linear_constraint()
. This would be a really neat way to a) instantly support different solvers, b) potentially half the time it takes to build the model?
You mean replacing the writing/reading LP file with using PyOptInterface to create the model?
You mean replacing the writing/reading LP file with using PyOptInterface to create the model?
Precisely!
You mean replacing the writing/reading LP file with using PyOptInterface to create the model?
Precisely!
It is a minimal-effort approach to support multiple solvers.
I see that the performance can be accelerated further if PyOptInterface can construct linear expressions from two NumPy arrays (coefficients and variables) directly, because Pyoframe have already formulated them as columns of tables. If you find this feature useful, I can implement it in the C++ core of PyOptInterface.
Yes that would be very useful! A function to do that in your library would be key if it doesn't already exist.
By the way, polars uses PyArrow under the hood not numpy and I think generally PyArrow is considered the modern version of numpy. Any plans to use PyArrow instead?
PyArrow array can be passed without copy as numpy array or DLPack protocol.
The proposed API can be named build_linear_expression(coefs, indices)
where coefs
is an array of coefficients (float64) and indices
is an array of indices of PyOptInterface variable ids (int32).
@metab0t sounds good to me!
@metab0t after some further investigation this will be very difficult if not impossible. It turns out that to modify variable attributes through your API I need the VariableIndex object, not just the variable index integer. Since Polars doesn't support storing objects I'm not too sure what to do. I could of course store the objects in a Python list or dictionary but from experience I feel like that won't be very performant...
@staadecker
The VariableIndex
is a wrapper of monotonic numeric index from 0. In use cases where we do not remove variables, the index is the same with the index in Pyoframe (describing the column index of variable). The reason to use VariableIndex
instead of int
is that custom class supports operator overloading to construct expression.
Now that Pyoframe has its own abstraction to construct expressions and you still want to modify the variable attributes via its index. I think there are two solutions:
VariableIndex
object, just construct it on the fly using constructor VariableIndex(id)
. At the initialization step, you can iterate from 0
to N_variables
and call model.add_variable(lb=lbs[i], ub=ubs[i])
. If you want to modify the upper bound of variable at column 4 afterward, you can use model.set_variable_attribute(VariableIndex(4), poi.VariableAttribute.UpperBound, 1.0)
.VariableIndex
internally if the user passes an integer.I prefer the first approach because the ordinary users of PyOptInterface will only use the VariableIndex
returned by model.add_variable()
. Advanced users with special need like Pyoframe can exploit the fact that VariableIndex.index
is the same with the column index of variable to construct VariableIndex
directly.
@metab0t this is helpful! I hadn't realized the IDs are monotonic from 0. Definitely your suggestion 1 is the one that makes more sense.
I'll still need to think about this a bit more since it is possible that some Pyoframe variable IDs are skipped (e.g. if the user runs Variable()
without assigning it to the model). I could either disallow this (somewhat hacky and a breaking change but perhaps more aligned with the vision) or I could maintain a mapping of Pyoframe IDs to pyoptinterface IDs (pretty easy although it adds an overhead).
In any case, I'll generate the wrapper on the fly, thanks!
How does Pyoframe assign the index of variables?
I see that user can create free variables and attach them as attributes of model.
For example, after creating a 2-element Variable
with index [0, 1]
and the model has 10 variables now, does Pyoframe automatically overwrites the index of new variables to [10, 11]
after m.X = x
?
@metab0t The index is incremented every time Variable()
is called. So if a Variable()
is called but not saved to the model then from the perspective of the Model there will be a missing variable index.
(For multi-dimensional variables it's effectively the same)
Get it. All Variables
share one auto-incremental counter to assign indices.
Most modeling frameworks adopt a model-binding design, where a variable/constraint is binding to a specific model and cannot be used in other model instances.
The other design (like Pyoframe) is to treat variables/constraints as freestanding entities and can be used interchangeably between models, so there is inevitably a translation layer from the object index to the internal model index even without introducing PyOptInterface.
Hello, @staadecker!
I am the author of PyOptInterface, an efficient modeling interface for mathematical optimization in Python. Its performance is quite competitive compared with existing solutions (faster than vendored Python bindings of some optimizers).
As a researcher of power system, I deeply understand the need to construct large-scale optimization models efficiently. PyOptInterface provides user-friendly expression-based API to formulate problems with different optimizers, and the lightweight handles of variables and constraints can be stored freely in built-in multidimensional container or Numpy ndarray or Python dataframes (as you like).
I think that PyOptInterface might be a good abstraction layer for your package to handle different optimizers. You can try out our package and evaluate its performance (both memory consumption and speed) if you are interested, and we welcome feedbacks and suggestions in any form.