Closed Dr-Irv closed 1 year ago
I agree with @Dr-Irv here, that would be a very nice function to have for more simple constraints where you don't need to create a new pd.DataFrame
just to use two columns to make the expression out of when you could just do it directly in the function from the pd.Series
.
@Dr-Irv would you mind sharing a more complete snippet of the above example including where/how the variables are created? For instance I would have written this as follows (assuming daily_hours is data and h, pairday are variables):
daily_hours = pd.DataFrame(columns=["daily_hours"], data=...)
pairday_df = (
daily_hours
.grb.pd_add_vars(m, name="h")
.grb.pd_add_vars(m, name="pairday")
)
At that point there's no concat needed, everything was already munged into the same dataframe and aligned on the same index so constraint building just works:
pairday_df.grb.pd_add_constrs(m, "h <= 2* daily_hours * pairday", name="PairDayUp")
pairday_df.grb.pd_add_constrs(m, "2 * pairday <= h", name="PairDayLow")
So I'm curious why the variables are separated out as series in the first place?
Discussed a bit further with @rluce. For the example above (hopefully I am not misrepresenting it) we think it makes more sense to create variables attached to the original dataframe and use the accessors, and we'll focus on that approach in the documentation and examples.
However for some cases a top-level pd_add_constrs
function makes sense, such as aggregations across a multi-index that need to be aligned with an existing series:
pd_add_constrs(model, df.groupby("Job")["Assign"], GRB.EQUAL, requirements)
or as a convenience function to avoid direct use of the gurobipy model and the associated API confusion that comes with it:
pd_add_constrs(model, series_var.sum(), GRB.EQUAL, 1.0)
Discussed a bit further with @rluce. For the example above (hopefully I am not misrepresenting it) we think it makes more sense to create variables attached to the original dataframe and use the accessors, and we'll focus on that approach in the documentation and examples.
I really disagree here. If the DataFrame
contains data, you want the variables in a separate object. I don't want to mix my data with my decision variables. That's why I put them in a Series
.
So I'm curious why the variables are separated out as series in the first place?
From an object oriented design standpoint, we avoid mixing DataFrame
s that have data and DataFrame
s (or Series
) that are decision variables.
This is akin to how modeling languages work. You declare your data. You declare your decision variables. The decision variables are indexed on sets from the data, but you don't create objects that mix the two.
One technique we use is to leverage python dataclass
patterns. We could have
import dataclasses as dc
@dc.dataclass
class MyData:
table1: pd.DataFrame
table2: pd.DataFrame
@dc.dataclass
class MyDecisionVariables:
var1: pd.Series
var2: pd.Series
I am NOT a fan of the mixing of the data objects and the decision variable objects, which is what you have chosen to do.
I really believe that a cleaner design would have pd_add_constr()
and pd_add_vars()
at the top level, which would give you the benefit of type checking for those two functions, and then the accessors are only used to get access to the gurobi attributes for the corresponding gurobi objects stored in a Series
(i.e., don't use accessors for Index
or DataFrame
)
I agree with @Dr-Irv. There are several reasons to do this, but as Irv said at a minimum it is good programming practice for separation of data and model objects.
Even when the var has the same index as a DF, its not necessarily clean to just add a new column to the DF as that may cause confusion down the road in you have to remember which DF has each var in it, instead of just the Series that matches the var's name.
In the model I rewrote to use your new pandas API, it has around 20 unique variable types, with 7-10 different index sets. Some of these don't even match an index in the input data, also there are many different data inputs that share an index but come from different sources so then which DF do you use to build the var set? And then I'd have to remember which of those DFs has the var column in it.
Also, it's about maintaining code, if you set it up the way Irv said above it makes it much easier for a new person to come in and see what's going on. If I can't remember which data DF I assigned that variable to, there would no chance for a new person to come in and understand it, whereas if we make a Data specific class and a Var specific class that separation makes it easier for new teammates to come in and understand where everything is at.
Ok, so this seems to have changed from what I initially understood as a "nice to have for simple cases" global function to the dominant way you would prefer to write models?
@rrandall1471 you mentioned:
In the model I rewrote to use your new pandas API, it has around 20 unique variable types, with 7-10 different index sets ...
Could you please share the code you wrote for this model (even if it's rough / without sample data)? I'd like to get a clearer idea of how you're naming your different data inputs and variable series to understand where the pain points come from.
This complete data/decision variable separation is quite a departure from the initial style, especially the pd.eval
style which requires variables and data to live in the one dataframe. The change would mean an almost complete rewrite of the code, tests, and examples, as we'd like to present a consistent approach. If that's the way we need to go, that's fine, but I'd like to thoroughly convince myself before committing to it. I'm rewriting my examples in that form first of all, but if you could share some complete examples you've formulated (maybe by email is easier?) then I can try rewriting them in the accessor style to see where it breaks down for me.
Thanks!
I will email you two examples that I did, using the accessor style. One is simple, the other more complex. Not sure if @rrandall1471 can send the complex example he tried, as it based on client work.
In both of my examples, I store the decision variables in separate series. If I want to do a constraint that has both data and decision variables in it, I then concat
the DataFrame
and the Series
that has decision variables. Then I can use the eval
syntax.
Model/data separation is really important. That's one of the (few) things modeling languages got right.
@simonbowly I'm in the partner webinar, and they showed an example using the existing API. Have you decided to keep that, or take in our suggestions for making pd_add_constrs()
and pd_add_vars()
at the top level, and not mixing data and variables?
I am partway through the additions at the moment, but no documentation done yet so I couldn't give a complete example for the partner talk.
Current thinking is that we'll allow both styles, since for some examples the accessors approach looks useful, but we will definitely have the global functions implemented as you suggested.
@Dr-Irv one more open question on that actually - with the top level pd_add_vars
and pd_add_constrs
implemented, is there any use at all for the series accessors pd.Series.grb.add_vars
and pd.Series.grb.add_constrs
?
I think from your previous comments you would be ok with removing them? We would keep only the dataframe and index accessor methods for specific situations (e.g. method chaining). Series accessors would be restricted to vectorized operations on the Var/Constr objects that they hold.
New functions added, documentation to come
>>> import pandas as pd
>>> import gurobipy as gp
>>> from gurobipy_pandas import pd_add_constrs, pd_add_vars
>>> model = gp.Model()
>>> x = pd_add_vars(model, pd.RangeIndex(5))
>>> y = pd_add_vars(model, pd.RangeIndex(5))
>>> pd_add_constrs(model, x, '<', y)
0 <gurobi.Constr *Awaiting Model Update*>
1 <gurobi.Constr *Awaiting Model Update*>
2 <gurobi.Constr *Awaiting Model Update*>
3 <gurobi.Constr *Awaiting Model Update*>
4 <gurobi.Constr *Awaiting Model Update*>
dtype: object
@Dr-Irv one more open question on that actually - with the top level
pd_add_vars
andpd_add_constrs
implemented, is there any use at all for the series accessorspd.Series.grb.add_vars
andpd.Series.grb.add_constrs
?I think from your previous comments you would be ok with removing them? We would keep only the dataframe and index accessor methods for specific situations (e.g. method chaining). Series accessors would be restricted to vectorized operations on the Var/Constr objects that they hold.
IMHO, you shouldn't have the accessors for add_vars
and add_constrs
on DataFrame
or Index
either. One reason is that you don't get type checking. Another is because it's not really an "accessor", i.e., you're not getting a property - you're creating a new object.
To include them on Series
is all about how you want the design to appear. For completeness, if you are going to have the accessors for DataFrame
and Index
, then you should have them for Series
as well.
Also, from an examples perspective, you should choose one style and stick to it, because people will follow that style as evidence of "best practice". I think both @rrandall1471 and I think that add_vars
and add_constrs
should not be on the accessors, nor should the examples demonstrate using them that way.
I realize that the latter statements are something we disagree about.
Thanks Irv. Yes we disagree on the last point (the accessors have just proven too useful to discard for me) but I appreciate you making us aware of the potential downsides. And I take your point re: consistency.
FWIW, creating new objects via accessors seems a very natural usage to me: string accessors apply formatting to create new strings, pdvega creates plots by combining dataframe columns, geopandas applies transformations, etc
Thanks again for the feedback!
FWIW, creating new objects via accessors seems a very natural usage to me: string accessors apply formatting to create new strings, pdvega creates plots by combining dataframe columns, geopandas applies transformations, etc
Point taken.
If your plan is to provide both ways - at the accessor level and at the top level, then we'll most likely just use the top level method in order to maintain model/data separation. The other advantage is typing support.
I just did a simple model the other day using gurobipy-pandas and the lack of typing support for those 2 main methods slowed down my development. Had to pull up the online manual to get all the parameters right.
I just did a simple model the other day using gurobipy-pandas and the lack of typing support for those 2 main methods slowed down my development.
Yep, will follow up on typing in #41
I wrote a model that had constraints that look like this:
But if would have been really cool if I could have written the constraints like this:
In this example, the LHS and RHS are just
Series
, so you really don't need thepairday_df
at all (especially if you switch to the idea of makingpd_add_constrs()
a function rather than use it through the accessor). You could check if the twoSeries
were aligned and raise an Exception if not.